% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{HowTo: Use the edd package} %\VignetteKeywords{Expression Analysis} %\VignetteDepends{edd} %\VignettePackage{edd} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \bibliographystyle{plainnat} \begin{document} @ \title{HowTo Use the Bioconductor {\tt edd} package} \author{Vince Carey {\tt stvjc@channing.harvard.edu}} \maketitle \tableofcontents \section{Introduction} {\it edd} is a package that assists with one aspect of exploratory data analysis for microarrays. The basic question addressed in {\it edd} is the variety of shapes of gene-specific distributions of expression in collections of microarrays. Use of the package is most sensible when there are numerous arrays obtained under the same experimental condition or for a given clinical condition. The key idea is that marginal gene-specific distributions may have a relatively number of different qualitative shapes, some of which may be of considerable substantive interest (e.g., multimodal shapes), and some of which may be of methodologic importance (e.g., when one group of subjects has a skewed distribution for a gene, and another has a symmetric distribution for the same gene, use of a log transform is counterindicated). In this brief HOWTO, we illustrate directly the use of the {\it edd} package. We will investigate the diversity of distributions in the two main groups of Golub's leukemia dataset. \section{Important caveat} The {\tt edd} function will transform all gene-specific expression distributions to have common location and scale. This process can make noise have the appearance of signal. Before using edd, remove all genes that have small variability. See the next section for an example of this filtering process. \section{Distributional shapes in Golub's data} First we attach the necessary libraries and data frames. {\it edd} will require the {\it golubEsets} library. <>= library(edd) library(golubEsets) library(xtable) data(Golub_Merge) @ \subsection{Filtering out genes with low variation} Next we filter the Golub data to require reasonable dispersion (confine attention to upper half sample defined by size of MAD) and reasonable expression (confine attention to genes with minimum expression level 300). <>= madvec <- apply(exprs(Golub_Merge),1,mad) minvec <- apply(exprs(Golub_Merge),1,min) keep <- (madvec > median(madvec)) & (minvec > 300) gmfilt <- Golub_Merge[keep==TRUE,] @ \subsection{Forming stratum-specific ExpressionSets} Finally we split the dataset into the ALL and AML samples: <>= ALL <- gmfilt$ALL.AML=="ALL" gall <- gmfilt[,ALL==TRUE] gaml <- gmfilt[,ALL==FALSE] show(gall) @ \subsection{Running edd} We will apply edd using an nnet classifier with the default reference catalog. See the edd-Details vignette for information about the reference catalog. <>= set.seed(12345) alldists <- edd(gall, meth="nnet", size=10, decay=.2) amldists <- edd(gaml, meth="nnet", size=10, decay=.2) <>= greo <- match(sapply(eddDistList,tag),names(table(alldists))) if (any(ii <- is.na(greo))) greo = greo[-which(ii)] @ An example of the results is given by the classification calls for the first 5 genes in the filtered ExpressionSet: <>= gn <- row.names(exprs(gall)) alld <- as.character(alldists)[1:5] names(alld) <- gn[1:5] print(alld) @ We can use edd with other classification methods. <>= set.seed(123) alldistsKNN <- edd(gall, meth="knn", k=1, l=0) alldistsTEST <- edd(gall, meth="test", thresh=.3) @ The agreement between nnet and knn procedures is not exact. See table \ref{conc1}. Choice between these methods and selection of tuning parameters is context-dependent. <>= cap <- "Comparison of distribution shape classification by nnet (rows) and by knn (columns) methods in edd." print(try(xtable( latEDtable(table(alldists,alldistsKNN),reorder=greo),digits=rep(0,length(table(alldists))+1),caption=cap, label="conc1"))) @ The test procedure is the only one at present that allows an outcome of 'doubt'. <>= print(table(alldistsTEST)) @ \subsection{Assessing the results} We can assess the relative frequencies of the different shapes in the ALL samples with a table, see Table \ref{marg1}. <>= cap <- "Frequencies of distributional shapes in filtered ALL data." print(xtable(latEDtable( table(alldists),reorder=greo),digits=rep(0,length(table(alldists))+1),caption=cap,label="marg1")) @ We can use barplots also; see Figure \ref{compfig}. \begin{figure} \setkeys{Gin}{width=1.05\textwidth} <>= mymat <- matrix(c(1,2),nr=1) layout(mymat)#,heights=c(lcm(6),lcm(6)),widths=c(lcm(6),lcm(6)),respect=TRUE) barplot(table(alldists),las=2) barplot(table(amldists),las=2) @ \caption{Compositions of distributional shapes within strata.} \label{compfig} \end{figure} Discordance between distributional shapes in gene expression for the AML and ALL groups can be assessed using the cross-classification, see Table \ref{disco1}. <>= cap <- "Rows are gene-specific distribution shapes for ALL, columns for AML, and cell entries are counts of genes." print(xtable(latEDtable(table(alldists,amldists),reord=greo),cap=cap, label="disco1")) @ Let's see what these discordances mean. To begin, let's get some indices for genes with bimodally shaped expression distribution for ALL, but approximately gaussian expression distribution for AML: <>= print((1:540)[alldists==".75N(0,1)+.25N(4,1)" & amldists=="N(0,1)"][1:5]) @ We consider the gene with probe D87953\_at. The top left panel gives the model (solid density trace) and a kernel density estimate applied to the expression levels among ALL patients, and the top right is the corresponding histogram. \begin{figure} \setkeys{Gin}{width=0.95\textwidth} <>= par(mfrow=c(2,2)) plotED(MIXN1,data=exprs(gall)[65,]) hist(exprs(gall)[65,],xlim=c(0,3500)) plotED(N01,data=exprs(gaml)[65,]) hist(exprs(gaml)[65,],xlim=c(0,3500)) @ \caption{Two models for D87953\_at in ALL and AML patients.} \end{figure} While the specific mixture model used as reference is not a perfect fit to the ALL data, the neural net classifier was sensitive to the bimodality. The Gaussian model does not seem particularly appropriate for the AML data, but was the closest match in the reference catalog. \section{Extending the reference catalog} The reference catalog supplied with edd has components <>= names(eddDistList) @ There is nothing sacred about this set. Let's consider its scope (we'll look at 8 of nine reference distributions): \begin{figure} <>= par(mfrow=c(4,2)) for (i in 1:8) plotED(eddDistList[[i]]) @ \caption{Eight of the reference distributions in the eddDistList supplied with {\it edd}.} \end{figure} From the example above we see that it might be useful to have a mixture of Gaussians with modes separated by 6SD. To add such a model we construct an instance of the {\tt eddDist} class: <>= #> median(rmixnorm(10000,.75,0,1,6,1)) #[1] 0.4337298 #> mad(rmixnorm(10000,.75,0,1,6,1)) #[1] 1.550768 -- these are resistant stats! MIXN3 <- new("eddDist", stub="mixnorm", parms=c(p1=.75,m1=0,s1=1,m2=6,s2=1), median=.43, mad=1.55, tag=".75N(0,1)+.25N(6,1)", plotlim=c(-3,11), latexTag="$\\frac{3}{4}\\Phi+\\frac{1}{4}\\Phi_{6,1}$") eddDistList[["MIXN3"]] <- MIXN3 set.seed(12345) alldists2 <- edd(gall, meth="nnet", size=10, decay=.2) print(alldists2[65]) @ The symbol MIXN3 used to name the list element is arbitrary, as are the values of the tag and latexTag slots. But the user should choose meaningful values for those items. The new reference distribution is used for classification of probe D87953\_at. The two fits for the different mixtures are shown in Figures \ref{f3}, \ref{f1}. \begin{figure} <>= plotED(MIXN3,data=exprs(gall)[65,]) @ \caption{Reference catalog element: mixture with modes separated by 6SD. Superimposed is the kernel smooth of centered/scaled and then translated data for D87953\_at.} \label{f3} \end{figure} \begin{figure} <>= plotED(MIXN1,data=exprs(gall)[65,]) @ \caption{Reference catalog element: mixture with modes separated by 3SD. Superimposed is the kernel smooth of centered/scaled and then translated data for D87953\_at.} \label{f1} \end{figure} @ \end{document}