%\VignetteIndexEntry{topGO}
%\VignetteDepends{ALL, hgu95av2.db, genefilter, xtable, multtest, Rgraphviz}
%\VignetteKeywords{topGO, GO, graph}
%\VignettePackage{topGO}

\documentclass[a4paper, oneside, 10pt]{article}

\usepackage[pdftex]{graphicx}
\usepackage{calc}
\usepackage{sectsty}
\usepackage{caption}
\renewcommand{\captionfont}{\it\sffamily}
\renewcommand{\captionlabelfont}{\bf\sffamily}
\allsectionsfont{\sffamily}

% page style %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage[a4paper, left=25mm, right=20mm, top=20mm, bottom=25mm, nohead]{geometry}
\setlength{\parskip}{1.5ex}
\setlength{\parindent}{0cm}
\pagestyle{empty}


\usepackage{Sweave}
\SweaveOpts{prefix.string = tGO}


\title{\vspace*{-6ex} Gene set enrichment analysis with {\bf topGO}}
\author{Adrian Alexa,  J\"org Rahnenf\"uhrer}
\date{\today \\%
  \texttt{http://www.mpi-sb.mpg.de/$\sim$alexa}}






%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
%% This document is the short version of the topGO manual
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{document}
\maketitle

%%\newpage
\tableofcontents
\newpage

<<echo = FALSE>>=
options(width = 95)
@ 

\section{Introduction}
\label{sampleSession}

This manuscript provides a quick tour into {\tt topGO}. There is an accompanying document which provides
details on the functions used in this document as well as showing more advance functionality
implemented in the {\tt topGO} package.

The {\tt topGO} package is designed to work with different test statistics and
with multiple GO graph algorithms, see~\cite{Alexa}. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This section describes a simple working-session using {\tt topGO}. There are only a handful of commands
necessary to perform a gene set enrichment analysis and we will be briefly presented them bellow.

A typical session can be divided into three steps:
\begin{enumerate}
\item Collection and preprocessing of the data. The list of genes and the correspondent gene' score,
  the criteria for selecting genes based on their scores and the gene-to-GO annotations are all
  collected to form a suitable R object.
  
\item Using the object created in the first step the user can perform enrichment analysis using various
  statistical tests and methods that deal with the GO topology.
  
\item Analysis of the results.
\end{enumerate}

{\it But before going through each of these steps we need to decide which biological question we want to answer.
This will most likely dictate which test statistic we need to use. The choice of the test is also restricted by
the data available at hand. The knowledge of the test will dictate the way the gene expression data needs to be
process.}
Here we will test for enrichment of GO terms with differentially expressed genes using the
Kolmogorov-Smirnov test and Fisher's exact test as examples.


\subsection{Step 1: Preparing the data}
In this step a convenient R object of class {\tt topGOdata} is created containing all the information
required for the remaining two steps. The user needs to provide a list of genes, GO annotations and
a criteria for selecting interesting genes.

In most cases we want to test enrichment of GO terms with differentially expressed genes. Thus, the 
starting point is a list of genes and the respective $p$-values for differential expression. A toy
example of a list of gene identifiers and the respective $p$-values is provided by the {\tt geneList}
object.

<<results = hide>>=
library(topGO)
library(ALL)
data(ALL)
data(geneList)
@ 

The {\tt geneList} data is based on an differential expression analysis of the ALL dataset.
It contains just a small number, $\Sexpr{length(geneList)}$, of genes with the corespondent $p$-values.
For these genes we need GO annotations to be able to form the gene groups. The information on where to
find the GO annotations is stored in the ALL object and its easily accessible.

<<>>=
affyLib <- paste(annotation(ALL), "db", sep = ".")
library(package = affyLib, character.only = TRUE)
@ 

The chip used in the experiment is the {\tt \Sexpr{annotation(ALL)}} from Affymetrix, as we can see 
from the {\tt affyLib} object. When we loaded the {\tt geneList} object a function which will select 
the differentially expressed genes from the list was also loaded under the name of {\tt topDiffGenes}.
For example using this function we can see that there are \Sexpr{sum(topDiffGenes(geneList))} genes
with a row $p$-value less than $0.01$ out of a total of \Sexpr{length(geneList)} genes.

<<>>=
sum(topDiffGenes(geneList))
@ 

We now have all the data necessary to build on object of type {\tt topGOdata}. This object will
contain all the gene identifiers and their score, the GO annotations, the GO hierarchical structure
and all the other information needed to perform the enrichment analysis. 

<<results = hide>>=
sampleGOdata <- new("topGOdata", 
                    description = "Simple session", ontology = "BP",
                    allGenes = geneList, geneSel = topDiffGenes,
                    nodeSize = 10,
                    annot = annFUN.db, affyLib = affyLib)
@ 

A summary of the {\tt sampleGOdata} object can be seen by typing the object name at the R prompt.
Having all the data stored into this object facilitates the access to identifiers, annotations and 
to obtain basic data statics.

<<results = hide>>=
sampleGOdata
@ 


\subsection{Step 2: Running the desired tests}
Once we have an object of class {\tt topGOdata} we can start with the enrichment analysis. 
Since for each gene we have a score and there is also a procedure to select interesting 
genes based on the scores we will use two types of test statistics: Fisher's exact test which 
is based on gene counts, and a Kolmogorov-Smirnov like test which needs gene scores.

The function {\tt runTest} is used to apply the specified enrichment test and method to the data.
It has three basic arguments. The first argument needs to be on object of class {\tt topGOdata}.
The second argument is of type character and specifies which method for dealing with the GO graph
structure will be used (this argument is optional). And the third argument specifies the test
statistic and is of type character.

First we perform a classical enrichment analysis by testing the over-representation of GO terms with
differentially expressed genes. In the classical case each GO category is tested independently. 

<<results = hide>>=
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")
@ 

{\tt runTest} returns an object of class {\tt topGOresult}. A short summary of this object is shown bellow.

<<>>=
resultFisher
@ 

Next we will test the enrichment using the Kolmogorov-Smirnov test. We will use the both the {\sf classic}
and the {\sf elim} methods.

<<results = hide>>=
resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")
resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")
@ 

Please note that some statistical tests will not work with any method. The compatibility matrix between
the methods and statistical tests is shown in Table~\ref{tabletopGO}.

The $p$-values computed by the {\tt runTest} function are unadjusted for multiple testing. We do note
advocate against adjusting the $p$-values of the tested groups, but in many cases the an adjusted $p$-value
can be misleading. 


\subsection{Step 3: Analysis of results}
After the enrichment tests are performed the researcher needs tools for analysing and interpreting the
results. {\tt GenTable} is an easy to use function for analysing the most significant GO terms and the
corresponding $p$-values. For example, we want to see which are the top $10$ significant GO terms identified 
by the {\sf elim} method. We also want to compare the ranks and the $p$-values of these GO terms in the
case where the {\sf classic} method was employe used.

<<>>=
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher, 
                   classicKS = resultKS, elimKS = resultKS.elim,
                   orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 10)
@ 

The {\tt GenTable} function returns a data.frame containing the top {\tt topNodes} GO terms identified
by the {\sf elim} algorithm, see {\tt orderBy} argument. The data.frame includes some statistics on the
GO terms and the $p$-values obtained from the methods passes as arguments. Table~\ref{tab:sampleGOresults}
shows the results.

\begin{table}[!t]
  \centering\resizebox{.99\linewidth}{!}{%
<<echo = FALSE, results = tex>>=
if(require(xtable))
  print(xtable(apply(allRes, 2, as.character)), floating = FALSE)
@
}\caption{Significance of GO terms according to {\sf classic} and {\sf elim} methods.}
\label{tab:sampleGOresults}
\end{table}


For accessing the GO term's $p$-values from a {\tt topGOresult} object the user should use the {\tt score}
functions. As a simple example, we look at the differences between the results of the {\sf classic} and
the {\sf elim} methods in the case of the Kolmogorov-Smirnov test. Due to the fact that there are few 
significant GO terms, $\Sexpr{sum(score(resultKS) <= 0.01)}$ terms with a $p$-value less than $0.01$
in the {\sf classic} method, one would expect that only few GO terms will have different $p$-values
between these two methods. This, of course, depends on the value of the cutoff parameter used by the 
{\sf elim} method.


\setkeys{Gin}{width=.4\linewidth}
\begin{figure}[!h]
  \centering 
<<fig = TRUE>>=
pValue.classic <- score(resultKS)
pValue.elim <- score(resultKS.elim)[names(pValue.classic)]
plot(pValue.classic, pValue.elim, xlab = "p-value classic", ylab = "p-value elim", cex = .5)
@ 
\caption{$p$-values scatter plot for the {\sf classic} and {\sf elim} methods.}
\label{scatterClassicElim}
\end{figure}

We can see in Figure~\ref{scatterClassicElim} that there are indeed few differences between
the two methods. Some GO terms witch are found significant by the {\sf classic} method are less 
significant in the {\sf elim}, as expected.

Another insightful way of looking at the results of the analysis is to investigate how
the significant GO terms are distributed over the GO graph. For the {\tt elim} algorithm
the subgraph induced by the $5$ most significant GO terms is plotted. In the plot, the
significant nodes are represented as boxes. The plotted graph is the upper induced graph
generated by these significant nodes. 

<<eval = FALSE>>=
showSigOfNodes(sampleGOdata, score(resultKS.elim), firstTerms = 5, useInfo = 'all')
@ 

<<results = hide, echo = FALSE>>=
printGraph(sampleGOdata, resultKS.elim, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "all", pdfSW = TRUE)
@ 

\begin{figure}[!ht]
\centering 
\includegraphics[width=1.05\linewidth]{tGO_elim_5_all}
\caption{The subgraph induced by the top $5$ GO terms identified by the {\sf elim} algorithm
  for scoring GO terms for enrichment. Rectangles indicate the $5$ most significant terms.
  Rectangle color represents the relative significance, ranging from dark red (most significant)
  to light yellow (least significant). Black arrows indicate {\it is-a} relationships and red
  arrows {\it part-of} relationships.
  For each node, some basic information is displayed. The first two lines show the  GO identifier
  and a trimmed GO name. In the third line the raw p-value is shown. The forth line is showing the 
  number of significant genes and the total number of genes annotated to the respective GO term. 
}
\label{fig:sampleGOelim}
\end{figure}
\clearpage




\clearpage

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE,results=tex>>=
toLatex(sessionInfo())
@


\addcontentsline{toc}{section}{References} \label{references}


\bibliography{ref_books}
\bibliographystyle{apalike}


\end{document}
