Gene-set enrichment analysis with topGO (part-1)

 

Introduction

Data analysis performed on high-throughput experiments usually produces lists of significantly perturbed genes (RNASeq) or other entities that can be mapped to genes, like genetic variants (whole genome sequencing) or Transcription factor binding sites (chIPSeq). The long lists of genes (often in the order of hundreds or thousands) produced as the outcome of statistical analysis on these high-throughput datasets are difficult to interpret by eye. In many occasions, biological patterns become evident only when the genes are grouped in pathways or other functional categories. There are many ways to infer biological meaning from results generated by high-throughput experiments. A common approach to interpreting these gene lists is enrichment analysis based on the functional annotation of these genes. The results of this analysis can be used to generate biological hypotheses.

Enrichment analysis is a collection of statistical methods for estimating how enriched is a gene set in a list of genes of interest, in simpler terms, are genes in the gene list coming from a gene set more frequently than what would be expected by chance? Multiple categories of methods are available: overrepresentation analysis, ranked enrichment analysis and conditional enrichment analysis. The names of these methods are not standardized so synonymous terms can be found in the literature. This analysis has few prerequisites.

A. We need a gene list that contains all the genes we are interested in. This is usually generated by a high-throughput experiment. Example: a gene list might be the list of all genes differentially expressed between treated and untreated patients.

B. Pathway analysis also requires gene sets definitions. Gene sets are usually characterized by a name (which represents a meaningful category like toll like receptor pathway or inflammation) and by the list of genes that belong to them (the association between category names and genes is a process usually called annotation).

         Gene Ontology (GO) is a hierarchical and curated vocabulary of gene categories, developed to respond to the need for standardization in annotation terms. In the hierarchical tree structure of GO each child node is a more specific term than its parents. GO terms are classified into three broad categories: cellular component (where the gene product is localized inside the cell), molecular function (what is the function of a gene product) and biological process i.e. which series of events is the gene product involved in. Before the development of Gene Ontology every biologist used slightly different names for similar functional categories or annotation terms. Nonetheless, sometimes it can be useful to interrogate more than one resource to discover enriched pathways.

C. Finally, methods to define which gene sets (think pathways) are significantly perturbed by an experiment are also necessary. There are three different types of enrichment analysis:

  1. Over-representation analysis– In this we determine the significance of the overlap between a list of genes of interest and a gene set by a statistical hypothesis test (Fisher’s Exact Test or Hypergeometric Test). If the significance is high the overlap between our list of genes and the gene set we are testing is bigger than what we would get by randomly selecting gene lists with the same number of genes. One of the limitations of overrepresentation analysis is it requires a threshold to be applied to obtain a list of genes of interest and the choice of threshold has an impact on the results of the enrichment. Further, the choice of universe also strongly effects the results we obtain [2].
  2. Rank-based enrichment analysis– This method uses the entire gene list tested in an experiment regardless of their significance. Genes are ordered according to a meaningful metric (eg: fold change, p-value, absolute t-static etc.) and no arbitrary threshold is required. Say, if the list of genes came from comparing cases from controls, the genes most different between the two categories would appear at the top but all genes would be included in the list. The method then tests if genes from a gene set tend to fall at the top of the list more frequently than randomly picked lists of genes of similar sizes. The enrichment is significant if the genes of the gene set are mainly located at the top of the ranked list and not significant if the genes of the gene set are randomly scattered across the list. Gene Set Enrichment Analysis (GSEA) works on this approach.
  3. Conditional enrichment analysis– Since genes in a GO gene set also belong to the parental GO gene set, one GO gene set might look enriched only because one of its children is enriched. This is a behaviour that might not be desirable in some cases. Conditional enrichment analysis can be used to account for this phenomenon in the estimation of enrichment significance. In this approach, all leaf gene sets (those without any children) are tested as usual. Then parents of these gene sets are tested but if one of their children is significant the genes belonging to this child are removed from the parent gene set. The analysis then proceeds to higher and higher levels till the root gene set (which has no parent) has been reached. Example of this approach is topGO [1].

In this blog, we illustrate how to use topGO for enrichment analysis. The next blog will show the usage of clusterProfiler, another R Bioconductor package for enrichment analysis.

topGO

topGO is an R Bioconductor package for gene set enrichment analysis for Gene Ontology (GO) terms. It provides tools for testing GO terms while accounting for the topology of the GO graph (conditional enrichment). Both overrepresentation analyses and rank-based methods can also be used. In this sense, it is versatile from other methods for Gene enrichment analysis like DAVID, GOStat etc. which uses the standard way to test for overrepresentation i.e. hypergeometric test.

One can employ various default algorithm/test combinations for eliminating local similarities and dependencies between GO terms. In this respect, the different algorithm/test combinations available in topGO are:

table1

The default algorithm used by the topGO package is a mixture between the elim and the weight algorithms and is referred as weight01.


It’s important to have the latest version of the R packages to get the most updated information from the GO databases.

source("https://bioconductor.org/biocLite.R")
biocLite("topGO")
biocLite("GO.db")
biocLite("biomaRt")
biocLite("Rgraphviz")

# Load the required R packages
library(topGO)
library(GO.db)
library(biomaRt)
library(Rgraphviz)

Step 1: Preparing the data in the required format

We need to tell TopGO what is our list of genes of interest, and what is the list of all genes in the ‘gene universe’ that we want to compare our list of interest to. The ‘gene universe’ can be all the genes in your experiment:

# Gene universe file
exp_data= read.table('background_experiment1.csv', header=TRUE, sep=',')
bg_genes=as.character(exp_data[,1])

# Read in genes of interest
candidate_list =read.table('candidate_genelist_experiment1.csv', header=TRUE, sep=',')
candidate_list= as.character(candidate_list[,1])

The Identifiers in the gene universe file and the candidate gene list should be of the same category i.e gene symbol or ensembl IDS or entrez IDs etc.

length(bg_genes)
[1] 24193
head(bg_genes)
[1] "SRY" "RPS4Y1" "ZFY" "PCDH11Y" "AMELY" "TBL1Y"

length(candidate_list)
[1] 215
head(candidate_list)
[1] "SPTBN1" "EPAS1" "PSMD4" "STAT1" "PITPNA" "SRRM1"

Step 2: Create GO annotation

The next information we need are the GO terms and the mapping that associate each gene in our study with one or more GO term(s)

# create GO db for genes to be used using biomaRt - please note that this takes a while
db= useMart('ENSEMBL_MART_ENSEMBL',dataset='hsapiens_gene_ensembl', host="www.ensembl.org")
go_ids= getBM(attributes=c('go_id', 'external_gene_name', 'namespace_1003'), filters='external_gene_name', values=bg_genes, mart=db)

One can see the list of valid attributes in bioMart by using the function:

listAttributes(db)

What if our gene list were not gene symbol but entrez IDS? In that case, we simply replace ‘external_gene_name‘ in the above query to ‘entrezgene‘.

# build the gene 2 GO annotation list (needed to create topGO object)
gene_2_GO=unstack(go_ids[,c(1,2)])

# remove any candidate genes without GO annotation
keep = candidate_list %in% go_ids[,2]
keep =which(keep==TRUE)
candidate_list=candidate_list[keep]

# make named factor showing which genes are of interest
geneList=factor(as.integer(bg_genes %in% candidate_list))
names(geneList)= bg_genes

Step 3: Make topGO data object

We need,

  1. Ontology: character string specifying the ontology of interest (BP, MF or CC).
  2. allGenes: named vector of type numeric or factor. The names attribute contains the genes identifiers. The genes listed in this object define the gene universe.
  3. nodeSize: an integer larger or equal to 1. This parameter is used to prune the GO hierarchy from the terms which have less than nodeSize annotated genes (after the true path rule is applied).
  4. annotationFun: function which maps genes identifiers to GO terms. There are a couple of annotation function included in the package trying to address the user’s needs. The annotation functions take three arguments. One of those arguments is specifying where the mappings can be found, and needs to be provided by the user. annFUN.gene2GO this function is used when the annotations are provided as a gene-to-GOs mapping.
GOdata=new('topGOdata', ontology='BP', allGenes = geneList, annot = annFUN.gene2GO, gene2GO = gene_2_GO)

Step 4: Test for significance

# define test using the classic algorithm with fisher (refer to [1] if you want to understand how the different algorithms work)
classic_fisher_result=runTest(GOdata, algorithm='classic', statistic='fisher')

Selecting ‘algorithm=classic’ means that the GO hierarchy isn’t taken into account, so each GO term is tested independently (over-representation enrichment). The limitation of using this is that all genes annotated to a GO terms will be automatically annotated to its parents as well, therefore a GO term might look enriched just because its children are enriched. Thus, it is important that GO hierarchy is taken into account (conditional enrichment) to avoid redundancy.

# define test using the weight01 algorithm (default) with fisher
weight_fisher_result=runTest(GOdata, algorithm='weight01', statistic='fisher') 

# generate a table of results: we can use the GenTable function to generate a summary table with the results from tests applied to the topGOdata object.
allGO=usedGO(GOdata)
all_res=GenTable(GOdata, weightFisher=weight_fisher_result, orderBy='weightFisher', topNodes=length(allGO))

Step 5: Correcting for multiple testing

Generally, the p-values returned by enrichment methods in topGO are interpreted as corrected or not affected by multiple testing. However, one can perform an adjustment of the p-values if they consider that it is important for their analysis.

Note: It is important to mention that for the methods that account for the GO topology, the problem of multiple testing is complicated. Here, one computes the p-value of a GO term conditioned on the neighbouring terms. The tests are therefore not independent and the multiple testing theory does not directly apply.

#performing BH correction on our p values
p.adj=round(p.adjust(all_res$weightFisher,method="BH"),digits = 4)

# create the file with all the statistics from GO analysis
all_res_final=cbind(all_res,p.adj)
all_res_final=all_res_final[order(all_res_final$p.adj),]

#get list of significant GO before multiple testing correction
results.table.p= all_res_final[which(all_res_final$weightFisher<=0.001),]

#get list of significant GO after multiple testing correction
results.table.bh=all_res_final[which(all_res_final$p.adj<=0.05),]

#save first top 50 ontolgies sorted by adjusted pvalues
write.table(all_res_final[1:50,],"summary_topGO_analysis.csv",sep=",",quote=FALSE,row.names=FALSE)

# PLOT the GO hierarchy plot: the enriched GO terms are colored in yellow/red according to significance level

pdf(file='topGOPlot_fullnames.pdf', height=12, width=12, paper='special', pointsize=18)
showSigOfNodes(GOdata, score(weight_fisher_result), useInfo = "none", sigForAll=FALSE, firstSigNodes=2,.NO.CHAR=50)
dev.off()

In the above code, one can change useInfo=”all” if they want to plot the GO term description and p values (in addition to GO term) within each of the nodes.

topGO_Plot_GOterms

Step 6: Get all the genes in your significant GO TERMS

To find the genes enriched with each of the significant GO terms we use the following code as this information is not provided by default in topGO.

myterms =results.table.p$GO.ID # change it to results.table.bh$GO.ID if working with BH corrected values
mygenes = genesInTerm(GOdata, myterms)

var=c()
for (i in 1:length(myterms))
{
myterm=myterms[i]
mygenesforterm= mygenes[myterm][[1]]
mygenesforterm=paste(mygenesforterm, collapse=',')
var[i]=paste("GOTerm",myterm,"genes-",mygenesforterm)
}

write.table(var,"genetoGOmapping.txt",sep="\t",quote=F)

The information about the R packages can be found below.

# print the package versions used ---#
sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

## attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base

## other attached packages:
[1] Rgraphviz_2.20.0 biomaRt_2.32.1 topGO_2.28.0 SparseM_1.77 GO.db_3.4.1 AnnotationDbi_1.38.2 IRanges_2.10.2 S4Vectors_0.14.3 Biobase_2.36.2 graph_1.54.0 BiocGenerics_0.22.0 limma_3.32.5 statmod_1.4.30
[14] stringr_1.2.0

## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 compiler_3.4.1 bitops_1.0-6 tools_3.4.1 digest_0.6.12 bit_1.1-12 evaluate_0.10.1 RSQLite_2.0 memoise_1.1.0 tibble_1.3.3 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.1.1 DBI_0.7
## [15] yaml_2.1.14 knitr_1.16 bit64_0.9-7 rprojroot_1.2 XML_3.98-1.9 rmarkdown_1.6 blob_1.1.0 magrittr_1.5 backports_1.1.0 matrixStats_0.52.2 htmltools_0.3.6 stringi_1.1.5 RCurl_1.95-4.8

References

[1]. Alexa,A., Rahnenführer,J. and Lengauer,T. (2006) Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22, 1600–1607.
[2]. Timmons,J. A., Szkop,K.J. and Gallagher,I.J. (2015) Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol., 16, 186.

 

Advertisements

2 thoughts on “Gene-set enrichment analysis with topGO (part-1)

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: