% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{affycoretools Overview} % \VignetteDepends{affycoretools, affy, limma, hgfocuscdf} % \VignetteDepends{genefilter, annaffy,} % \VignetteKeywords{Expression Analysis, Postprocessing} % \VignettePackage{affycoretools} \SweaveOpts{keep.source=TRUE} \documentclass[11pt]{article} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \parindent 0.5in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{\bf Using affycoretools} \author{James W. MacDonald} \maketitle \section{Overview} This package is made up of various 'wrapper' functions that I use to help automate some of the more routine aspects of my job as a Biostatistician in a microarray core facility. Since the vast majority of analyses I do are based on data from Affymetrix GeneChips, the focus of these functions are also directed towards this platform. This does not however preclude these functions from being extended to other platforms including other oligonucleotide arrays as well as spotted cDNA arrays. \section{Introduction} For most analyses, I follow the precepts of \textit{Literate Statistical practice} \citet{dsc:Rossini:2001}. The basic idea being that the document used to present an analysis is also what is used to \textit{do} the analysis. To do this, I use Emacs/ESS (\textit{Emacs speaks statistics}) \citet{ESS} and an \Rfunction{Sweave} document \citet{Leisch:2002}. An \Rfunction{Sweave} document is a file (with an .Rnw extension) that contains text and \LaTeX{} markup that will be used to create (usually) a PDF document, as well as R code that will be used to create plots or tables in the resulting document, and/or to provide finished output suitable for presentation to your client(s). Usually the \Rfunction{Sweave} document does both. Note that this vignette (as are most BioConductor vignettes) is produced using an \Rfunction{Sweave} document. The learning curve for \LaTeX{} and the other markup required to create a functional \Rfunction{Sweave} document can be steep. However, it is well worth the effort to learn for anybody who is routinely required to do statistical analyses and present the results to others. For anybody interested in learning about \Rfunction{Sweave}, the two best sources of information (in my opinion) are the \href{http://www.ci.tuwien.ac.at/~leisch/Sweave}{Sweave User Manual}, and just about any BioConductor vignette. The .Rnw file for most BioC vignettes can be found in the R-Home/library//doc directory. In addition, there is an example \Rfunction{Sweave} document in the example directory for this package that is the basis for most of the analyses I do. Because I do all of my analyses using \Rfunction{Sweave}, most of the functions in this package are designed for both interactive and non-interactive use. In addition, most functions will output information that may be useful to present in the resulting text document. \section{Interactive Analyses} \subsection{Quality Control} For these examples, I will be using some data that were generated in our microarray core. The experiment is a simple comparison of two different cell lines, one of which is sensitive to a particular treatment, whereas the other is not. One set of samples was prepared using the Affymetrix \textit{in vitro translation} kit, whereas the other set of samples was prepared using the NuGen Ovation kit. There are three biological replicates for each sample type. The celfiles can be found in the examples directory of this package (R-home/library/affycoretools/example). For some analyses it may not be necessary to generate a report detailing the analysis, or you simply may want to do a quick quality check to ensure the raw data are of high enough quality to proceed with the analysis. In this case we can just do some quality control plots and compute expression measures using \Rfunction{affystart}. The \Rfunction{affystart} function may be used to compute \Rfunction{rma}, \Rfunction{gcrma} or \Rfunction{mas5} expression values. In the case of \Rfunction{mas5} expression values, the output (written to a text file in the working directory) includes the P/M/A calls and associated $p$-values. <>= options(width = 70) @ <>= library("affycoretools") pd <- new("AnnotatedDataFrame", data = read.table("pdata.txt", header = TRUE, row.names = 1)) eset <- affystart(groups = rep(1:4, each = 3), groupnames = unique(paste(pData(pd)[,1], pData(pd)[,2], sep = "-")), phenoData = pd) @ <>= library(affycoretools) load("abatch.Rdata") load("exprSet.Rdata") @ \begin{figure} \centering <>= plotHist(dat) @ \caption{Density plot} \label{fig:hist} \end{figure} \begin{figure} \centering <>= plotDeg(dat) @ \caption{RNA degradation plot} \label{fig:deg} \end{figure} \begin{figure} \centering <>= pd <- new("AnnotatedDataFrame", data = read.table("pdata.txt", header = TRUE, row.names = 1)) sampleNames(pd) <- sampleNames(eset) plotPCA(eset, groups = rep(1:4, each = 3), groupnames = unique(paste(pData(pd)[,1], pData(pd)[,2], sep = "-"))) phenoData(eset) <- pd @ \caption{PCA plot} \label{fig:pca} \end{figure} \begin{figure} \centering <>= plotPCA(eset, screeplot = TRUE) @ \caption{Screeplot} \label{fig:scree} \end{figure} Figure~\ref{fig:hist} is the usual density plot -- we have found that this plot is one of the more informative quality control plots available, at least for \Rfunction{rma}. Any chips with high background (curve shifted to the right) invariably need to be re-done, which in this case usually means re-fragmenting the cRNA and re-hybridizing to a new chip. With these data I expect much more variability due to the differences in the \textit{IVT} kits that were used for the two sample sets. However, a case could be made that sample12 needs to be re-done. Figure~\ref{fig:deg} is an RNA degradation plot -- this is supposed to give some idea of how much degradation of mRNA occured, and how well the \textit{IVT} step went. This plot is moderately useful, but not nearly as informative as the density plot. Here we can see that the slope of the lines for the two groups is quite different, indicating that the two \textit{IVT} kits give distinctly different results. Figure~\ref{fig:pca} is a plot of the first two principal components from a principal components analysis (PCA). Basically, this is used to show the overall structure of the data. This is another very useful plot. In most cases we expect replicate samples to group together, indicating general similarity in overall expression patterns. It may be difficult however to determine from this plot how closely samples are grouping -- for instance, the NuGen samples appear to be quite well separated on the $y$-axis. To determine how meaningful this separation is, we need a \Robject{screeplot}. Figure~\ref{fig:scree} shows the \Robject{screeplot} for this PCA. Each bar shows how much of the overall variance is captured by each principal component. Here we can see that the first PC captures the vast majority of the variance, which indicates that the separation of the samples on the $y$-axis (the second PC) is actually quite small, so the samples are grouping fairly tightly. The \Rfunction{affystart} function calls three other functions to make these plots (\Rfunction{plotHist}, \Rfunction{plotDeg}, and \Rfunction{plotPCA}), which can all be called individually to make just one of these plots, or in the case of \Rfunction{plotPCA}, to make either the PCA or the \Robject{screeplot}. \subsection{Computing Differential Expression} After checking the quality control plots (and maybe looking at other QC plots that are available in the \Rpackage{affyPLM} package), the next step is to make comparisons and output lists of differentially expressed genes. Because of the obvious differences between the two sample sets, it is probably preferable to compute expression values separately and then combine the data. <>= eset1 <- affystart(filenames = list.celfiles()[1:6], plot = FALSE, pca = FALSE) eset2 <- affystart(filenames = list.celfiles()[7:12], plot = FALSE, pca = FALSE) eset <- new("ExpressionSet", exprs = cbind(exprs(eset1), exprs(eset2)), phenoData = pd, annotation = annotation(eset1)) @ I do most of my analyses using the \Rpackage{limma} package. I find that this package is capable of analyzing most of the experiments that I see. I also like to use the \Rpackage{annaffy} package for creating output to give to my clients. The HTML tables that can be produced using this package can either be posted on the web or an intranet, or simply emailed to the client. Because I use both of these packages together on a regular basis, some of my functions are designed to link the results from a \Rpackage{limma} analysis to the \Rpackage{annaffy} package. The data set we are using for this vignette was originally produced in order to see how comparable the results from the two \textit{IVT} kits were. One way to make this comparison is to fit a linear model, compute contrasts of sensitive and insensitive samples for each sample set, and then look for genes that are significant in both contrasts. First, we filter the data, removing those genes that appear not to be expressed in either sample. The criterion here is at least three of the samples have to have expression values greater than $2^6$. <<>>= library(genefilter) f1 <- kOverA(3, 6) filt <- filterfun(f1) index <- genefilter(eset, filt) eset <- eset[index,] @ After filtering out the 'unexpressed' genes, we fit a linear model and extract contrasts of interest. Explaining the following code is beyond the scope of this vignette -- for more information on fitting linear models using \Rpackage{limma}, please see the ``LIMMA User's guide''. <<>>= library(limma) grps <- paste(pData(eset)[,1], pData(eset)[,2], sep = ".") design <- model.matrix(~ 0 + factor(grps)) colnames(design) <- levels(factor(grps)) ugrps <- unique(grps) contrasts <- matrix(c(1, -1, 0, 0, 0, 0, 1, -1), ncol = 2, dimnames = list(ugrps, paste(ugrps[c(1,3)], ugrps[c(2,4)], sep = " - "))) fit <- lmFit(eset, design) fit2 <- contrasts.fit(fit, contrasts) fit2 <- eBayes(fit2) @ Printing out the design and contrast matrices may be helpful: <<>>= design contrasts @ Once we have fit the model and extracted contrasts of interest, the next step is to output some results. We might first want to look at a Venn diagram that shows how many genes were differentially expressed in each sample. Note here that if we use the \Rfunction{vennCounts} function in \Rpackage{limma} with \Rfunarg{include} = ``both'', then we will select genes that are significant in both comparisons, but without requiring the genes be differentially expressed in the same direction. In this case it does not make sense to count a gene as being differentially expressed in both sample sets unless the direction is the same. In other words, if a given gene appears to be upregulated in the sensitive samples when we use the Affy \textit{IVT} kit, but downregulated in the sensitive samples when we use the NuGen Ovation kit, it does not make sense to say that the results agree (e.g., are in the intersection of the Venn diagram). Therefore, we will use the \Rfunction{vennCounts2} function, with \Rfunarg{method} = ``same'', which will require the same direction as well. <<>>= rslt <- decideTests(fit2) vc <- vennCounts2(rslt, method = "same") vennDiagram(vc, cex = 0.8) @ \begin{figure} \centering <>= vennDiagram(vc, cex = 0.6) @ \caption{Venn Diagram} \label{fig:venn} \end{figure} Figure~\ref{fig:venn} shows the Venn diagram for this analysis. At this point we may wish to output lists of the genes that are unique to each comparison, as well as the genes that are common to both. To do this, we use the \Rfunction{vennSelect} function. <>= vennSelect(eset, design, rslt, contrasts, fit2) @ This will output both HTML and text files containing the gene names, links to various online databases (for the HTML files), and the expression values for the samples in question. The file names will be extracted from the column names of the \Robject{TestResults} object (produced as a result of calling \Rfunction{decideTests} above). Note that \Rfunction{decideTests} uses the column names of the contrasts matrix to make the column names of the \Robject{TestResults} object, so it is important to set up the contrasts matrix with reasonable names. Reasonable being defined here as: \begin{itemize} \item Something that will make sense as the name for the resulting tables. \item Names that will be acceptable as part of a filename for your particular operating system. \end{itemize} Alternatively, we may simply want to output lists of genes that are significant in each of the contrasts at a given $p$-value and/or fold change. <>= limma2annaffy(eset, fit2, design, contrasts, annotation(eset), pfilt = 0.05) @ This will output two HTML tables containing all genes that are significant at an adjusted $p$-value of 0.05 (default multiplicity correction using false discovery rate \citet{Benjamini:1995}). The gene lists will be sorted in descending $p$-value order, so theoretically the more 'interesting' genes will be at the top of the list. We have the option of outputting text files as well. I generally do so, because it is not uncommon for my clients to want to open these files in a spreadsheet program and do some further exploration, and the HTML tables tend not to work well. Another analysis that we may wish to perform (although it doesn't make much sense here), is to look for Gene Ontology terms that are 'enriched' in the set of significant genes. The \Rpackage{GOstats} package is quite useful for this sort of analysis, but the output is not always as compact as one might like. My clients generally just want to see a list of GO terms that are enriched, as well as the $p$-values associated with each term. We can get the Affy probe IDs for the genes in the intersection of the Venn diagram, and then use those IDs to look for GO terms that are 'enriched' in that set of of probes. <>= index1 <- vennSelect(x = rslt, indices.only = TRUE)[[3]] probids <- unique(getLL(featureNames(eset)[index1], annotation(eset))) univ <- unique(getLL(featureNames(eset), annotation(eset))) params <- new("GOHyperGParams", geneIds = probids, universeGeneIds = univ, annotation = annotation(eset), conditional = TRUE, ontology = "MF") hyp <- hyperGTest(params) htmlReport(hyp, file = "GO MF terms.html", categorySize = 10) @ This function outputs an HTML file that can be opened using a web browser. \section{Non-Interactive Analyses} It is relatively simple to write a vignette for interactive analysis using a given package. It is not simple to do the same for a non-interactive analysis because by definition there is no active interaction with R. Therefore, instead of trying to explain things in this vignette, I have placed an example \Rfunction{Sweave} document in the examples directory of this package (R-home/library/affycoretools/examples) that will re-create the above analyses and output a PDF file as well as the HTML and text files. Between this example and the Sweave User Manual, it should be relatively straightforward to figure out how to do something similar for your own analyses. \bibliography{affycoretools} \end{document}