% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{affycoretools Overview}
% \VignetteDepends{affycoretools, affy, limma, hgfocuscdf}
% \VignetteDepends{genefilter, annaffy,}
% \VignetteKeywords{Expression Analysis, Postprocessing}
% \VignettePackage{affycoretools}
\SweaveOpts{keep.source=TRUE}
\documentclass[11pt]{article}


\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{times}
\usepackage{comment}

\parindent 0.5in

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{\textit{#1}}

\bibliographystyle{plainnat}


\begin{document}

\title{\bf Using affycoretools}

\author{James W. MacDonald}

\maketitle



\section{Overview}

This package is made up of various 'wrapper' functions that I use 
to help automate some of the more routine aspects of my job as a 
Biostatistician in a microarray core facility. Since the vast majority 
of analyses I do are based on data from Affymetrix GeneChips, the 
focus of these functions are also directed towards this platform. This does
not however preclude these functions from being extended to other platforms
including other oligonucleotide arrays as well as spotted cDNA arrays.

\section{Introduction}

For most analyses, I follow the precepts of \textit{Literate Statistical practice} 
\citet{dsc:Rossini:2001}. The basic idea being that the document 
 used to present an analysis is also what is used to \textit{do} the
analysis. To do this, I use Emacs/ESS (\textit{Emacs speaks statistics}) 
\citet{ESS} and an \Rfunction{Sweave} document \citet{Leisch:2002}.

An \Rfunction{Sweave} document is a file (with an .Rnw extension) that
contains text and \LaTeX{} markup that will be used
to create (usually) a PDF document, as well as R code that will be used 
to create plots or tables in the resulting document, and/or to provide
finished output suitable for presentation to your client(s). Usually the
\Rfunction{Sweave} document does both. Note that this vignette (as are most
BioConductor vignettes) is produced using an \Rfunction{Sweave} document.

The learning curve for \LaTeX{} and the other markup required to create a 
functional \Rfunction{Sweave} document can be steep. However, it is well worth
the effort to learn for anybody who is routinely required to do statistical
analyses and present the results to others. For anybody interested in learning
about \Rfunction{Sweave}, the two best sources of information (in my opinion)
are the \href{http://www.ci.tuwien.ac.at/~leisch/Sweave}{Sweave User Manual},
and just about any BioConductor vignette. The .Rnw file for most BioC vignettes
can be found in the R-Home/library/<packagename>/doc directory. In addition,
there is an example \Rfunction{Sweave} document in the example directory for this 
package that is the basis for most of the analyses I do.

Because I do all of my analyses using \Rfunction{Sweave}, most of the functions
in this package are designed for both interactive and non-interactive use. In addition,
most functions will output information that may be useful to present in the resulting
text document.

\section{Interactive Analyses}
\subsection{Quality Control}

For these examples, I will be using some data that were generated in our microarray
core. The experiment is a simple comparison of two different cell lines, one of 
which is sensitive to a particular treatment, whereas the other is not. One set of
samples was prepared using the Affymetrix \textit{in vitro translation} kit, whereas
the other set of samples was prepared using the NuGen Ovation kit.  There are
three biological replicates for each sample type. The celfiles can be found in
the examples directory of this package (R-home/library/affycoretools/example).

For some analyses it may not be necessary to generate a report detailing the analysis,
or you simply may want to do a quick quality check to ensure the raw data
are of high enough quality to proceed with the analysis. In this case we can just
do some quality control plots and compute expression measures using \Rfunction{affystart}.

The \Rfunction{affystart} function may be used to compute \Rfunction{rma}, \Rfunction{gcrma}
or \Rfunction{mas5} expression values. In the case of \Rfunction{mas5} expression values, the
output (written to a text file in the working directory) includes the P/M/A calls
and associated $p$-values.

<<echo=false>>=
options(width = 70)
@ 

<<eval = false, width = 6>>=
library("affycoretools")
pd <- new("AnnotatedDataFrame", 
          data = read.table("pdata.txt", header = TRUE, row.names = 1))
eset <- affystart(groups = rep(1:4, each = 3), 
                  groupnames = unique(paste(pData(pd)[,1], 
                  pData(pd)[,2], sep = "-")), 
                  phenoData = pd)
@ 

<<echo = false, results=hide>>=
library(affycoretools)
load("abatch.Rdata")
load("exprSet.Rdata")
@ 

\begin{figure}
\centering
<<fig=true, width=6, height=6, echo = false>>=
plotHist(dat)
@ 
\caption{Density plot}
\label{fig:hist}
\end{figure}

\begin{figure}
\centering
<<fig=true, width=6, height=6, echo = false>>=
plotDeg(dat)
@ 
\caption{RNA degradation plot}
\label{fig:deg}
\end{figure}

\begin{figure}
\centering
<<fig=true, width=6, height=6, echo = false>>=
pd <- new("AnnotatedDataFrame", 
          data = read.table("pdata.txt", header = TRUE, row.names = 1))
sampleNames(pd) <- sampleNames(eset)
plotPCA(eset, groups = rep(1:4, each = 3), 
        groupnames = unique(paste(pData(pd)[,1], pData(pd)[,2], sep = "-")))
phenoData(eset) <- pd
@ 
\caption{PCA plot}
\label{fig:pca}
\end{figure}

\begin{figure}
\centering
<<fig = true, width = 6, height = 6, echo = false>>=
plotPCA(eset, screeplot = TRUE)
@ 
\caption{Screeplot}
\label{fig:scree}
\end{figure}

Figure~\ref{fig:hist} is the usual density plot -- we have found that this plot is
one of the more informative quality control plots available, at least for \Rfunction{rma}.
Any chips with high background (curve shifted to the right) invariably need to be re-done, 
which in this case usually means re-fragmenting the cRNA and re-hybridizing to a new chip.
With these data I expect much more variability due to the differences in the \textit{IVT}
kits that were used for the two sample sets. However, a case could be made that sample12 
needs to be re-done.


Figure~\ref{fig:deg} is an RNA degradation plot -- this is supposed to give some idea of how
much degradation of mRNA occured, and how well the \textit{IVT} step went. 
This plot is moderately useful, but not nearly as informative as the density plot. Here
we can see that the slope of the lines for the two groups is quite different, indicating
that the two \textit{IVT} kits give distinctly different results.

Figure~\ref{fig:pca} is a plot of the first two principal components from a principal components
analysis (PCA). Basically, this is used to show the overall structure of the data. This is another 
very useful plot. In most cases we expect replicate samples to group together, indicating general
similarity in overall expression patterns. It may be difficult however to determine from this plot
how closely samples are grouping -- for instance, the NuGen samples appear to be quite well separated
on the $y$-axis. To determine how meaningful this separation is, we need a \Robject{screeplot}.

Figure~\ref{fig:scree} shows the \Robject{screeplot} for this PCA. Each bar shows how much of the 
overall variance is captured by each principal component. Here we can see that the first PC captures
the vast majority of the variance, which indicates that the separation of the samples on the $y$-axis
(the second PC) is actually quite small, so the samples are grouping fairly tightly.

The \Rfunction{affystart} function calls three other functions to make these plots (\Rfunction{plotHist},
\Rfunction{plotDeg}, and \Rfunction{plotPCA}), which can all be called individually to make just one of
these plots, or in the case of \Rfunction{plotPCA}, to make either the PCA or the \Robject{screeplot}.

\subsection{Computing Differential Expression}

After checking the quality control plots (and maybe looking at other QC plots that are
available in the \Rpackage{affyPLM} package), the next step is to make comparisons and output
lists of differentially expressed genes. Because of the obvious differences between the
two sample sets, it is probably preferable to compute expression values separately and
then combine the data.


<<eval = FALSE>>=
eset1 <- affystart(filenames = list.celfiles()[1:6], 
                   plot = FALSE, pca = FALSE)
eset2 <- affystart(filenames = list.celfiles()[7:12], 
                   plot = FALSE, pca = FALSE)
eset <- new("ExpressionSet", 
            exprs = cbind(exprs(eset1), exprs(eset2)), 
            phenoData = pd,
            annotation = annotation(eset1))
@



I do most of my analyses using the \Rpackage{limma} package. I find that this package is
capable of analyzing most of the experiments that I see. I also like to use the
\Rpackage{annaffy} package for creating output to give to my clients. The HTML tables
that can be produced using this package can either be posted on the web or an intranet, or
simply emailed to the client. Because I use both of these packages together on a regular
basis, some of my functions are designed to link the results from a \Rpackage{limma} analysis
to the \Rpackage{annaffy} package.

The data set we are using for this vignette was originally produced in order to see how
comparable the results from the two \textit{IVT} kits were. One way to make this 
comparison is to fit a linear model, compute contrasts of sensitive and insensitive
samples for each sample set, and then look for genes that are significant in 
both contrasts.

First, we filter the data, removing those genes that appear not to be expressed in
either sample. The criterion here is at least three of the samples have to have 
expression values greater than $2^6$. 

<<>>=
library(genefilter)
f1 <- kOverA(3, 6)
filt <- filterfun(f1)
index <- genefilter(eset, filt)
eset <- eset[index,]
@ 

After filtering out the 'unexpressed' genes, we fit a linear model and extract
contrasts of interest. Explaining the following code is beyond the scope of this
vignette -- for more information on fitting linear models using \Rpackage{limma}, 
please see the ``LIMMA User's guide''.

<<>>=
library(limma)
grps <- paste(pData(eset)[,1], 
              pData(eset)[,2], sep = ".")
design <- model.matrix(~ 0 + factor(grps))
colnames(design) <- levels(factor(grps))
ugrps <- unique(grps)
contrasts <- matrix(c(1, -1, 0, 0, 0, 0, 1, -1),
                    ncol = 2, dimnames = list(ugrps,
                              paste(ugrps[c(1,3)], ugrps[c(2,4)],
                                    sep = " - ")))
fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2)
@ 

Printing out the design and contrast matrices may be helpful:

<<>>=
design
contrasts
@

Once we have fit the model and extracted contrasts of interest, the next step is to 
output some results. We might first want to look at a Venn diagram that shows how many
genes were differentially expressed in each sample. Note here that if we use the 
\Rfunction{vennCounts} function in \Rpackage{limma} with \Rfunarg{include} = ``both'',
then we will select genes that are significant in both comparisons, but without
requiring the genes be differentially expressed in the same direction. In this case
it does not make sense to count a gene as being differentially expressed in both
sample sets unless the direction is the same. In other words, if a given gene
appears to be upregulated in the sensitive samples when we use the Affy \textit{IVT}
kit, but downregulated in the sensitive samples when we use the NuGen Ovation kit,
it does not make sense to say that the results agree (e.g., are in the intersection
of the Venn diagram). Therefore, we will use the \Rfunction{vennCounts2} function,
with \Rfunarg{method} = ``same'', which will require the same direction as well.

<<>>=
rslt <- decideTests(fit2)
vc <- vennCounts2(rslt, method = "same")
vennDiagram(vc, cex = 0.8)
@ 

\begin{figure}
\centering
<<fig = true, width = 6, height = 6, echo = false>>=
vennDiagram(vc, cex = 0.6)
@ 
\caption{Venn Diagram}
\label{fig:venn}
\end{figure}

Figure~\ref{fig:venn} shows the Venn diagram for this analysis.

At this point we may wish to output lists of the genes that are unique to each
comparison, as well as the genes that are common to both. To do this, we use
the \Rfunction{vennSelect} function.

<<eval = false>>=
vennSelect(eset, design, rslt, contrasts, fit2)
@ 

This will output both HTML and text files containing the gene names, links to 
various online databases (for the HTML files), and the expression values for
the samples in question. The file names will be extracted from the column names
of the \Robject{TestResults} object (produced as a result of calling \Rfunction{decideTests}
above). Note that \Rfunction{decideTests} uses the column names of the contrasts
matrix to make the column names of the \Robject{TestResults} object, so it is 
important to set up the contrasts matrix with reasonable names. Reasonable being
defined here as:

\begin{itemize}
\item Something that will make sense as the name for the resulting tables.
\item Names that will be acceptable as part of a filename for your particular operating 
system.
\end{itemize}

Alternatively, we may simply want to output lists of genes that are significant in
each of the contrasts at a given $p$-value and/or fold change.

<<eval = false>>=
limma2annaffy(eset, fit2, design, 
              contrasts, annotation(eset), 
              pfilt = 0.05)
@ 

This will output two HTML tables containing all genes that are significant at an 
adjusted $p$-value of 0.05 (default multiplicity correction using false discovery
rate \citet{Benjamini:1995}). The gene lists will be sorted in descending $p$-value 
order, so theoretically
the more 'interesting' genes will be at the top of the list. We have the option
of outputting text files as well. I generally do so, because it is not uncommon
for my clients to want to open these files in a spreadsheet program and do some 
further exploration, and the HTML tables tend not to work well.

Another analysis that we may wish to perform (although it doesn't make much sense
here), is to look for Gene Ontology terms that are 'enriched' in the set of 
significant genes. The \Rpackage{GOstats} package is quite useful for this 
sort of analysis, but the output is not always as compact as one might like.
My clients generally just want to see a list of GO terms that are enriched, as
well as the $p$-values associated with each term.

We can get the Affy probe IDs for the genes in the intersection of the 
Venn diagram, and then use those IDs to look for GO terms that are 'enriched'
in that set of of probes.

<<eval = false>>=
index1 <- vennSelect(x = rslt, indices.only = TRUE)[[3]]
probids <- unique(getLL(featureNames(eset)[index1], 
                        annotation(eset)))
univ <- unique(getLL(featureNames(eset),
                     annotation(eset)))
params <- new("GOHyperGParams", geneIds = probids,
              universeGeneIds = univ, 
              annotation = annotation(eset),
              conditional = TRUE, ontology = "MF")
hyp <- hyperGTest(params)
htmlReport(hyp, file = "GO MF terms.html",
           categorySize = 10)
@ 

This function outputs an HTML file that can be opened using a web browser.

\section{Non-Interactive Analyses}

It is relatively simple to write a vignette for interactive analysis using
a given package. It is not simple to do the same for a non-interactive
analysis because by definition there is no active interaction with R. 
Therefore, instead of trying to explain things in this vignette, I have 
placed an example \Rfunction{Sweave} document in the examples directory
of this package (R-home/library/affycoretools/examples) that will re-create
the above analyses and output a PDF file as well as the HTML and text files.
Between this example and the Sweave User Manual, it should be relatively 
straightforward to figure out how to do something similar for your own
analyses.

\bibliography{affycoretools}

\end{document}
