%\VignetteIndexEntry{VanillaICE Vignette}
%\VignetteKeywords{copy number, genotype, SNP}
%\VignettePackage{VanillaICE}
\documentclass{article}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{color}

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\R}{\textsf{R}}
\newcommand{\hmmoptions}{\Robject{HmmOptions}}
\newcommand{\hmmparam}{\Robject{HmmParameter}}

\newcommand{\cne}{\widehat{\text{CN}}}
\newcommand{\gte}{\widehat{\text{GT}}}
\newcommand{\gtehom}{\widehat{\text{HOM}}}
\newcommand{\gtehet}{\widehat{\text{HET}}}
\newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}}
\newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}}
\newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}}
\newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}}
\newcommand{\thom}{\text{HOM}}
\newcommand{\thet}{\text{HET}}
\newcommand{\bDelta}{\mbox{\boldmath $\Delta$}}
\newcommand{\real}{\mbox{$\mathbb R$}}      % real numbers
\newcommand{\bnu}{\mbox{\boldmath $\nu$}}
\newcommand{\ice}{\Rpackage{VanillaICE}}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\begin{document}
\title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal
  Alterations using High-throughput SNP Arrays}
\author{Robert Scharpf}
\maketitle


<<setup, echo=FALSE>>=
options(width=80)
@ 

\begin{abstract}
  Chromosomal DNA is characterized by variation between individuals at
  the level of entire chromosomes (e.g. aneuploidy in which the
  chromosome copy number is altered), segmental changes (including
  insertions, deletions, inversions, and translocations), and changes
  to small genomic regions (including single nucleotide
  polymorphisms). A variety of alterations that occur in chromosomal
  DNA, many of which can be detected using high density single
  nucleotide polymorphism (SNP) microarrays, are linked to normal
  variation as well as disease and therefore of particular
  interest. These include changes in copy number (deletions and
  duplications) and genotype (e.g. the occurrence of regions of
  homozygosity).  Hidden Markov models (HMM) are particularly useful
  for detecting such abnormalities, modeling the spatial dependence
  between neighboring SNPs.  Here, we extend previous approaches that
  utilize HMM frameworks for inference in high throughput SNP arrays
  by integrating copy number, genotype calls, and the corresponding
  measures of uncertainty when available.  Using simulated and
  experimental data, we demonstrate how confidence scores control
  smoothing in a probabilistic framework.  
  
  % The goal of this vignette
  % is to provide a simple interface for fitting HMMs and plotting
  % functions to help visualize the predicted states alongside the
  % experimental data.
\end{abstract}

\section{Overview}

This vignette describes how to fit a hidden Markov model to
locus-level estimates of genotype or copy number. This vignette does not
describe how to preprocess, genotype, or estimate copy number.  We
assume that locus-level estimates of genotype and/or copy number have
been obtained by software such as the \R{} package \Rpackage{crlmm}
(preferred).  Several HMM implementations are now available for the
joint analysis of copy number and genotype, including QuantiSNP
\citep{Colella2007} and PennCNV \citep{Wang2007a}.  

\paragraph{Citing this software.} 
%\bibitem{Scharpf2008}
Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski.
\newblock Hidden {M}arkov models for the assessment of chromosomal alterations
  using high-throughput {SNP} arrays.
\newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008.

\section{Data organization}
\label{sec:data}

\subsection{Basic data elements}
\label{data:matrices}

The basic elements required for fitting the vanilla hidden Markov
model in this vignette are
\begin{itemize}
\item a matrix of copy number estimates
\item a matrix of genotype calls (represented as integers: 1=AA, 2=AB,
  3=BB)
\item a mapping of platform identifiers to physical position and chromosome
\end{itemize}

When available, uncertainty estimates of copy number and genotypes
(CRLMM uncertainties) should also be represented as matrices. As several
platforms now have polymorphic and nonpolymorphic probes, the dimensions
of the genotype and copy number matrices do not necessarily
coincide. However, for this release of \Rpackage{VanillaICE} we do
impose this requirement. The genotypes for the nonpolymorphic probes can
be represented as NAs.  The assay data provided with this package
contains the basic elements for the HMM from an older Affymetrix
platform (100k):

<<assaydata>>=
library(VanillaICE)
data(locusLevelData)
copynumber <- locusLevelData[["copynumber"]]/100
genotypes <- locusLevelData[["genotypes"]]
@ 

\noindent The copy number estimates provided in the example data were
saved as [copy number estimate]*100 and stored as an integer -- we
divided the estimates by 100 above to revert to the original scale.
Genotypes should be represented as an integer: 1= AA, 2 = AB, 3=BB.  NA
or the integer 4 can be used to represent missing values.  Confidence
scores for the assay data are included in the \Robject{locusLevelData},
but for now we ignore these.

Meta-data on the locus-level estimates are also required. In particular,
we require a mapping of the platform identifiers to the chromosome and
physical position. The preferred representation for the chromosome is an
integer. Currently, the only supported chromosomes are 1-22, X, and Y.

<<featuredata>>=
chromosome <- locusLevelData[["annotation"]][, "chromosome"]
position <- locusLevelData[["annotation"]][, "position"]
names(position) <- names(chromosome) <- rownames(locusLevelData[["annotation"]])
@ 

Other chromosomes could potentially be supported, but the user will be
expected to obtain centromere start and stop sites for alternative
chromosomes.  For details on how chromosome annotation should be
organized, see

<<chromosomeAnnotation>>=
require(SNPchip)
data(chromosomeAnnotation, package="SNPchip")
@ 

<<checks>>=
all(c(identical(names(position), rownames(copynumber)),
      identical(names(position), rownames(genotypes)),
      identical(colnames(genotypes), colnames(copynumber))))
@ 

The assay data elements should be ordered by chromosome and physical
position:

<<order>>=
ordering <- order(chromosome, position)
chromosome <- chromosome[ordering]
position <- position[ordering]
genotypes <- genotypes[ordering, , drop=FALSE]
copynumber <- copynumber[ordering, , drop=FALSE]
@ 



<<TESTING, echo=FALSE, results=hide, eval=FALSE>>=
##For testing that the arm variable forces the HMM to run independently on each chromosomal arm
data(locusLevelData, package="VanillaICE")
cnConf <- cnConfidence(chromosome1)
callsConf <- callsConfidence(chromosome1)
callsConf <- matrix(as.integer(-1000*log(1-callsConf)), nrow(callsConf), ncol(callsConf))
callsConf <- rbind(callsConf, callsConf[1:100, , drop=FALSE])
dimnames(callsConf) <- dimnames(copynumber)
locusLevelData[["crlmmConfidence"]] <- callsConf

annotation <- locusLevelData[["annotation"]]
copynumber <- locusLevelData[["copynumber"]]/100
genotypes <- locusLevelData[["genotypes"]]
i <- order(annotation[, "chromosome"], annotation[, "position"])
genotypes <- genotypes[i, , drop=FALSE]
copynumber <- copynumber[i, , drop=FALSE]
annotation <- annotation[i, , drop=FALSE]
copynumber <- copynumber[1:9165, , drop=FALSE]
genotypes <- genotypes[1:9165, , drop=FALSE]
annotation <- annotation[1:9165,, drop=FALSE]
copynumber <- rbind(copynumber, copynumber)
genotypes <- rbind(genotypes, genotypes)
annotation2 <- annotation
annotation2[, 1] <- 2
rownames(annotation2) <- paste("SNP_A-", 1:nrow(annotation2), sep="")
annotation <- rbind(annotation, annotation2)
rownames(genotypes) <- rownames(copynumber) <- rownames(annotation)
@ 

\subsection{Using \Rpackage{Biobase}-derived classes}
\label{data:classes}


A principal advantage of using classes instead of matrices is that the
assay data elements are bound to the meta-data on the samples and loci.
In this section, we show how the above assay data and meta-data can be
encapsulated in a single object. The classes required for this operation
are defined in the \Rpackage{oligoClasses} package available at
BioConductor.  These class definitions are all extensions of
\Rpackage{eSet}. The methods for manipulating \Robject{eSets} in the
\R{} package \Rpackage{Biobase} are now available to us.  We will
instantiate these objects in two-steps: (i) instantiate an object that
contains the locus-level meta-data and (ii) instantiate an object that
contains the assay-data and the meta-data.

For (i), the \Robject{AnnotatedDataFrame} class defined in the
\Rpackage{Biobase} package is used as a container for the meta-data on
the locus-level summaries:

<<annotatedDataFrame>>=
require(oligoClasses)
locusAnnotation <- data.frame(list(chromosome=chromosome, position=position), row.names=names(chromosome))
featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation)))
stopifnot(all(c("chromosome", "position") %in% varLabels(featuredata)))
@ 

\noindent The labels 'chromosome' and 'position' are controlled
vocabularies for the locus-level metadata that are later required when
fitting the HMM.  For (ii), three options are possible and depend on the
assay data available:
\begin{enumerate}
\item {\color{blue}{SnpCallSet}} (genotypes-only)

<<instantiateCallSet, eval=FALSE>>=
  new("SnpCallSet",
      calls=genotypes,
      phenoData=annotatedDataFrameFrom(genotypes, byrow=FALSE),
      featureData=featuredata)
@ 
\item {\color{blue}{SnpCopyNumberSet}} (copy number-only)

<<instantiateCopyNumberSet, eval=FALSE>>=
new("SnpCopyNumberSet",
    copyNumber=copynumber,
    phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),
    featureData=featuredata)
@ 
 
\item {\color{blue}{oligoSnpSet}} (genotypes and copy number)
  
<<instantiateOligoSnpSet>>=  
locusset <- new("oligoSnpSet",
		copyNumber=copynumber,
		calls=genotypes,
		phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),
		featureData=featuredata)
@ 
  
\end{enumerate}

Successfully instantiating an object of the class ensures that the assay
data elements and the meta-data are consistent, greatly simplifying
operations such as subsetting and reordering.  For instance, the assay
data and meta-data can be ordered by chromosome and physical position
with one line of \R{} code:

<<orderClass>>=
locusset <- locusset[order(chromosome(locusset), position(locusset)), ]
@ 

\section{Fitting the Vanilla HMM}
\label{hmm:vanilla}

We use the Vanilla HMM to smooth the locus-level summaries as a function
of physical position when confidence estimates of the assay data are
unavailable.  In Section \ref{fit:classes}, \Rpackage{Biobase}-derived
class representations from the previous section and pre-defined wrappers
for fitting a basic HMM are implemented.  In Section \ref{sec:matrices},
we reproduce the results from Section \ref{fit:classes} but in a
step-by-step fashion using only the matrices and vectors that were
created in Section \ref{data:matrices} (no classes).  The step-by-step
code in Section \ref{data:matrices} is a useful guide for those wishing
to alter the number of states in the HMM or develop other adaptations.

\subsection{Using \Rpackage{Biobase}-derived classes}
\label{fit:classes}

The objective of developing classes is to facilitate the process of
fitting an HMM and to more effectively keep the assay- and meta-data
bound in one object.  The following code uses classes to arrive at the
same HMM predictions as above.  The starting point is the \R{} object
\Robject{locusset} created as indicated in Section \ref{data:classes}.
This object may be any one of the following classes:
\Robject{SnpCallSet}, \Robject{SnpCopyNumberSet}, \Robject{oligoSnpSet}.
Note that we could repeat the steps of parameterizing the HMM with
transition- and emission-probabilities as described above, using the
accessor methods \Robject{chromosome}, \Robject{position},
\Robject{copyNumber}, and \Robject{calls} to access the meta- or
assay-data.  We provide a convenience wrapper for fitting the Vanilla
HMM, \Robject{vanilla}, that computes the transition probabilities and
emission probabilities.  \emph{For \Robject{oligoSnpSet} objects, the
  hidden states are assumed to be hemizygous deletion, normal, ROH, and
  amplification (in this order). }

<<vanillaSegments>>=
##featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation)))
##locusset <- new("oligoSnpSet",
##		copyNumber=copynumber,
##		calls=genotypes,
##		phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),
##		featureData=featuredata)
(brks <- hmm(object=locusset, 
	    states=c("hemDel", "normal", "ROH", "amplification"), 
	    mu=c(1, 2, 2, 3), 
	    probs=c(0.999, 0.7, 0.999, 0.7),
	    takeLog=TRUE))
@ 


\noindent To obtain the matrix of predicted states for each locus, set
\Robject{returnSegments} to \Robject{FALSE}:

<<vanillaMatrix>>=
fit2 <- hmm(object=locusset, 
	    states=c("hemDel", "normal", "ROH", "amplification"), 
	    mu=c(1, 2, 2, 3), 
	    probs=c(0.999, 0.7, 0.9999, 0.7),
	    takeLog=TRUE,
	    returnSegments=FALSE)
@ 

An example of how one may visualize the segmentation:

<<vanillaPlot, fig=TRUE, width=8>>=
show(plot(locusset[chromosome(locusset) == 1, ]))
fit2[fit2 == 3] <- 2  ##So LOH line will not plot at 3
fit2[fit2 == 4] <- 3
lines(position(locusset)[chromosome(locusset) == 1], fit2[chromosome(locusset) == 1, 1])
rohRegion <- as.integer(brks[brks$state == "ROH", ][c("start", "end")])
abline(v=rohRegion, col="royalblue", lty=2)
legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="n")
@ 

\subsubsection{Extracting intermediate objects}

To extract intermediate objects that were created by the \Robject{hmm}
wrapper, one may pass an environment to store the intermediate objects
that are created prior to fitting the HMM.

<<passingAnEnvironment>>=
env <- new.env()
fit2 <- hmm(object=locusset, 
	    states=c("hemDel", "normal", "ROH", "amplification"), 
	    mu=c(1, 2, 2, 3), 
	    probs=c(0.999, 0.7, 0.9999, 0.7),
	    takeLog=TRUE,
	    returnSegments=FALSE, envir=env)
ls(env)
@ 

\subsection{Step by step}
\label{sec:matrices}
We now go through the previous example step-by-step without using
Biobase-derived classes or wrappers.  While we reproduce the results in
the previous section, one may modify any of these step (for instance, to
expand the of hidden states). The basic elements of the HMM are

\begin{itemize}
\item[1] hidden states
\item[2] initial state probabilities
\item[3] transition probabilities 
\item[4] emission probabilities 
\end{itemize}

Items (1) and (4) depend critically on the software that was used for
obtaining the locus-level estimates and the scientific goals of the
analysis.

\subsubsection{Hidden states}
\label{sec:hiddenStates}

In this section we discuss specification of the parameters for the
hidden states.  We assume that conditional on the underlying state the
copy number estimates and genotypes are independent \citep{Scharpf2008}.

\paragraph{Copy number.}

There are two important considerations when specifying the hidden copy
number states. First, the scale of the locus-level estimates are
dependent on the preprocessing software.  For instance, the BeadStudio
software for the Illumina platform provides logR ratios, whereas
\Rpackage{crlmm} provides an absolute estimate of copy number.
Secondly, the data source is important.  Integer copy number states are
not appropriate when mixtures of different cell populations are present.
Mixtures of cell populations can be diagnosed and the admixture
estimated by visual inspection.  For instance, a genomic region
containing copy number estimates between 1 and 2 on the absolute scale
with heterozygous genotype calls interspersed suggest a mixture of cell
populations.  Therefore, plot your data.

<<plotSnp, eval=FALSE>>=
chr1 <- annotation[, "chromosome"] == 1
plot(annotation[chr1, "position"], copynumber[chr1, 1], pch=".", cex=2, col="grey60")
@ 

If genotypes are available, color-code by the diallelic genotype call.
<<addCalls,fig=TRUE,width=8, height=5, eval=FALSE>>=
plot(annotation[chr1, "position"], copynumber[chr1, 1], pch=".", cex=2, col="grey60")
het <- genotypes[, 1] == 2
points(annotation[chr1 & !het, "position"], copynumber[chr1 & !het, 1], pch=".", cex=2, col="royalblue")
points(annotation[chr1 & het, "position"], copynumber[chr1 & het, 1], pch=".", cex=2, col="red")
abline(h=1:3)
@ 

While the \Robject{locusLevelData} is based on HapMap samples,
alterations were simulated to have integer copy number
states. Therefore, in the code chunk below we specify an integer copy
number for the hidden states hemizygous deletion (hemDel: copy number <
1), normal (copy number 2 and typical heterozygosity), region of
homozygosity (ROH: copy number 2 and unusually low heterozygosity), and
amplification (copy number >= 3).

<<hiddenStates>>=
states <- c("hemDel", "normal", "ROH", "amplification")
copynumberStates <- c(1, 2, 2, 3)
@

\paragraph{Genotypes}

In the absence of confidence scores for the genotype calls, the
parameters needed for the hidden states for the genotypes is the
probability that the true genotype is homozygous conditional on the
underlying state.

<<hiddenStateParameters-genotypes>>=
probs <- c(0.999, 0.7, 0.999, 0.7)
names(probs) <- states
@ 

\noindent One may expand the number of hidden states to include $>$ 3
copies and homozygous deletions, perform simple gain/loss analyses using
copy number only, or assess regions of homozygosity with genotypes only.

\subsubsection{Initial state probabilities.}
\label{sec:initial}

The only requirement for specifying the initial state probabilities is
that the probabilities sum to 1.  Here, we assume \emph{a priori} that
the states are equally likely:

<<initialProb>>=
initialP <- (rep(1, length(states)))/length(states)
@ 

\subsubsection{Transition probabilities.}
\label{sec:transition}

Our HMM uses locus-specific transition probabilities that are calculated
as a function of the physical distance between loci. Specifically, the
probability that the locus at position $t-1$ is \emph{not} informative
for the locus at position $t$ is calculated as $1-e^{-d/C}$, where $C$
is a constant specified by the user and $d$ is the physical distance
between locus $t$ and locus $t-1$.  The default for $C$ is $1 \times
10^8$ and can be specified by the \Robject{TAUP} argument to the
function \Robject{transitionProbability} to acheive a desired amount of
sensitivity and specificity.  Larger values of \Robject{TAUP} decreases
the probability of transitioning to other states, and therefore provides
a more smooth fit. In particular, \Robject{transitionProbability} (i)
transforms the physical distance between adjacent loci to an estimate of
the genomic distance and (ii) adds an 'arm' variable to the annotation
matrix.

<<transitionProbability>>=
tau <- transitionProbability(chromosome=chromosome,
			     position=position,
			     TAUP=1e8)
table(tau[, "arm"])
str(tau)
@

\noindent The values for \Robject{arm} indicate that the HMM will be fit
independently to three separate segments of the supplied data,
corresponding to arms 1p, 1q, and 2p.  By default this function uses the
chromosome annotation provided by the \R{} package \Rpackage{SNPchip}
for the centromere coordinates and will work only for chromosomes 1-22,
X, and Y.

%TODO: Discuss how one may scale the transition probability matrix.

% \section{Joint analysis of copy number and genotype}

\subsubsection{Emission probabilities}

\paragraph{Copy number}

Skip to \emph{Genotypes} if copy number estimates are not available.

One attractive feature of HMMs for smoothing copy number estimates as a
function of the physical position is that uncertainty estimates can be
incorporated into the emission probabilities.  While uncertainty
estimates for copy number and genotype calls can improve both the
sensitivity and specificity for detecting chromosomal alterations
\citep{Scharpf2008}, these estimates are not always available. However,
we can develop ad-hoc estimates of the uncertainty provided a reasonable
number of samples are available.

To obtain emission probabilities, we provide the matrix of copy number
estimates and the location parameter for the hidden states (mu).  The
object \Robject{mu} should have length equal to the number of hidden
states.  If the hidden states are locus-specific, one may specify
\Robject{mu} as a matrix with rows corresponding to loci and columns
corresponding to states.
%TODO: test SNP-speicific

<<copynumberEmission>>=
copynumberStates <- c(1, 2, 2, 3)
emission.cn <- copynumberEmission(copynumber=copynumber,
				  states=states,
				  mu=copynumberStates,
				  takeLog=TRUE,
				  verbose=TRUE)
##or
copynumberStates <- matrix(c(1, 2, 2, 3), nrow(copynumber), length(states), byrow=TRUE)
emission.cn2 <- copynumberEmission(copynumber=copynumber,
				  states=states,
				  mu=copynumberStates,
				  takeLog=TRUE)
stopifnot(identical(emission.cn, emission.cn2))
dim(emission.cn)
@ 

The third dimension of the array returned by
\Robject{copynumberEmission} corresponds to the number of hidden states.
The copy number and location parameter (mu) must be supplied on the same
scale.  For instance, in the above code chuck the copy number estimates
and mu are on the scale of copy number.  These estimates/parameters are
both log-transformed within the body of the function by setting the
argument \Robject{takeLog} to \Robject{TRUE}.  Again, the emission
probabilities are calculated under the assumption that the estimates, or
a suitable transformation, are approximately Normal. Hence, in the
simulated data we check that the middle 90\% is approximately Gaussian.

<<qq, fig=TRUE>>=
par(pty="s", las=1)
qqnorm(log2(copynumber), cex=0.6)
qqline(log2(copynumber))
abline(v=c(-1.645, 1.645))
@ 

When true uncertainty estimates for the copy number are not available,
copy number outliers are more likely to result in extreme emission
probabilities than can influence the HMM segmentation. There are several
approaches to reducing the influence of outliers on the HMM
segmentation: (i) improved uncertainty estimates (preferred), (ii)
increase \Robject{TAUP} for the transition probabilities, (iii)
threshold the emission probabilities, and (iv) threshold extreme values
for the copy number estimates.  For example, 

<<epsilon>>=
emission.cn[emission.cn[, , 2] < -10, , 2] <- -10
@ 

\Robject{copynumberEmission} estimates the scale parameter for the
Normal distribution from the supplied data using the median absolute
deviation (MAD). However, different standard deviations can be supplied
by the user with the argument \Robject{sds}. The supplied \Robject{sds}
must be of the same dimension as the copy number matrix. The following
discussion elaborates briefly on the default procedure used to estimate
the standard deviations.

In the example dataset, we have only one sample and no estimates of the
copy number uncertainty.  Therfore, we obtain a robust estimate of the
copy number standard deviation across SNPs and use this as a rough
estimate of the uncertainty.  As the log-transformed copy number
estimates are more nearly Gaussian, we calculate a robust estimate of
the standard deviation using the median absolute deviation (MAD) on the
log scale:

<<robustSds>>=
cn.sds <- VanillaICE:::robustSds(copynumber, takeLog=TRUE)
dim(cn.sds)
@ 

When multiple samples are available (e.g., 10 or more), SNP-specific
estimates of the copy number uncertainty can be obtained by scaling an
estimate of the variability across samples by a sample-specific estimate
of noise.  In the following code chunk, we simulate a matrix of copy
number for 200 samples and then compute a robust SNP-specific estimate
of the standard deviation.

<<multipleSamples, eval=FALSE>>=
CT <- matrix(sample(copynumber, 100*200, replace=TRUE), 100, 200)
sds <- VanillaICE:::robustSds(CT, takeLog=TRUE)
@ 

The \Robject{robustSds} function returns a SNP-specific matrix of
standard deviations provided that the copy number matrix has at least 3
samples.  The {\it preferred} approach when the sample size is small
(say, less than 10), is to develop SNP-specific estimates of the
variance on a larger reference set, such as HapMap, using the same
software, and then scale these estimates by a measure of the sample
variance.

\paragraph{Genotypes.}

Proceed to \emph{Fitting the hidden Markov model} if genotypes estimates
are not available.  If genotypes were estimated using the
\Rpackage{CRLMM} or \Rpackage{oligo} and confidence estimates are
available, see Section \ref{sec:ice}.

As copy number estimates are typically very noisy, genotype calls can
potentially provide increased power to identify small regions of
hemizygous deletions.  In addition, the genotypes can be used to
identify unusually long regions of homozygosity (ROH).  In our
experience, sequences of 70 - 100 homozygous genotypes are not uncommon
and likely represent normal regions of the genome with fairly uniform
haplotype structure.  Emission probabilities for the Vanilla HMM are
estimated from a Bernoulli with state-specific probabilities of a
homozygous genotype call as specified in Section \ref{sec:hiddenStates}.

<<genotypeEmission>>=
emission.gt <- genotypeEmission(genotypes=genotypes, states=states, probHomCall=probs)
@ 

If any of the genotype calls are missing and missingness is not
independent of the underlying hidden state, one may specify the
probability of a missing genotype calls for each hidden state
(\texttt{probMissing}).  By default, the HMM will assume that missing
genotype calls are independent of the underlying hidden state.  In
particular, this assumption may not be reasonable for homozygous
deletions.

Conditional on the hidden state, we assume that the copy number and
genotype are independent.  Therefore, the emission probabilities for an
HMM that models the copy number and genotypes jointly are computed by
adding the emission probabilities (log-scale) for copy number and
genotype:

<<emission>>=
emission <- emission.gt + emission.cn
@ 

If assay data for only genotypes or only copy number is available, the
emission probability is simply the genotype or copy number emission
probabilities, respectfully.

\subsubsection{Viterbi algorithm.}

The sequence of states that maximizes the likelihood is obtained by the
Viterbi algorithm.  Note that the argument \Robject{arm} to this
function is a factor indicating the chromosomal arms -- the Viterbi
algorithm is computed for independently for each chromosomal arm.


<<fit>>=
fit <- viterbi(initialStateProbs=log(initialP),
	       emission=emission,
	       tau=tau[, "transitionPr"],
	       arm=tau[, "arm"])
table(fit)
breaks(x=fit, states=states, position=tau[, "position"],
       chromosome=tau[, "chromosome"],
       sampleNames=colnames(copynumber))
@ 


%\paragraph{Updating elements of the HMM.} 
%
%To update the elements used, for instance, a different.  For instance,
%
%<<environment>>=
%##source("~/madman/Rpacks/VanillaICE/R/functions.R")
%e1 <- new.env()
%hmm(object=locusset, 
%    states=c("hemDel", "normal", "ROH", "amp"), 
%    mu=c(1, 2, 2, 3), 
%    probs=c(0.9999, 0.99, 0.9999, 0.99),
%    takeLog=TRUE, envir=e1)
%ls(e1)
%e1[["TAUP"]] <- 1e12
%##untrace(update, signature="environment", where=getNamespace("VanillaICE"))
%update(object=e1)
%@ 

\section{Integretating Confidence Estimates (ICE): Extending
  \Rpackage{CRLMM}}
\label{sec:ice}

In this section, we illustrate how one may fit an HMM that incorporates
confidence esimates of the SNP-level summaries for the genotype calls.
To reduce the amount of repeated code, we will use the
\Rpackage{Biobase}-derived classes to illustrate this approach.  We will
instantiate an object of class \Robject{oligoSnpSet} as described in
Section \ref{data:classes}, however we will also store the CRLMM
confidence scores (represented as an integer).

<<oligoSnpSetCrlmm>>=
##load("../../../tmp/VanillaICE/data/chromosome1.RData")
##cnConf <- cnConfidence(chromosome1)
copynumber <- locusLevelData[["copynumber"]]/100
crlmmConfidence <- locusLevelData[["crlmmConfidence"]]
genotypes <- locusLevelData[["genotypes"]]
fd <- locusLevelData[["annotation"]]
featuredata <- new("AnnotatedDataFrame", data=data.frame(fd), varMetadata=data.frame(labelDescription=colnames(fd)))
(locusset2 <- new("oligoSnpSet",
		  copyNumber=copynumber,
		  calls=genotypes,
		  callsConfidence=crlmmConfidence,
		  phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),
		  featureData=featuredata,
		  annotation="genomewidesnp6"))
locusset2 <- locusset2[order(chromosome(locusset2), position(locusset2)), ]
@ 

The \Robject{annotation} slot must be specified so that the appropriate
reference distribution will be loaded.  Supported Affymetrix platforms
include genomewidesnp6, pd.mapping250k.nsp, and pd.mapping250k.sty. When
fitting the ICE HMM using the \Robject{hmm} wrapper, we again assume
that the hidden states are hemizygous deletion, normal, copy-neutral
region of homozygosity, and amplification (in this order).  Fitting the
ICE HMM using other states is possible, and one could follow the
step-by-step approach outlined in Section \ref{sec:matrices}.

<<ice>>=
fit3 <- hmm(object=locusset2, 
	    states=c("hemDel", "normal", "ROH", "amplification"), 
	    mu=c(1, 2, 2, 3), 
	    takeLog=TRUE,
	    returnSegments=FALSE, ice=TRUE)
brks <- hmm(object=locusset2, 
	    states=c("hemDel", "normal", "ROH", "amplification"), 
	    probs=c(0.05, 0.99, 0.7, 0.999),
	    mu=c(1, 2, 2, 3), 
	    takeLog=TRUE,
	    ice=TRUE)
@ 

\noindent Here we plot one of the regions that has different breakpoints
in the ICE and Vanilla HMMs.

<<iceFig, fig=TRUE, width=8>>=
i <- which(position(locusset2) >= 45*1e6 & position(locusset2) <= 55*1e6 & chromosome(locusset2) == 1)
gp <- plot(locusset2[i, ]) 
gp$pch <- 21; gp$col <- "black"; gp$cex <- 0.9; gp$ylim <- c(0.9, 6)
gp$add.cytoband <- FALSE
show(gp)
fit3[fit3 == 3] <- 2  ##So LOH line will not plot at 3
fit3[fit3 == 4] <- 3
lines(position(locusset2)[i], fit3[i, 1])
rohRegion <- as.numeric(as.matrix(brks[brks$state == "ROH", ][c("start", "end")]))
abline(v=rohRegion, col="royalblue", lty=2)
legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="o", bg="white")
@ 

%TODO: Vignette for generating chromosome 21 data.
%TODO: Reference distribution for pd.mapping50k.nsp
%TODO: Reference distribution for pd.mapping50k.sty
%TODO: merging pd.mapping50k.sty and pd.mapping50k.sty

%\section{Affymetrix 6.0 example}
%
%Placeholder for an example using Affy 6.0 data (INCOMPLETE).
%
%<<crlmmResults>>=
%require(SNPchip) 
%data(crlmmResults, package="VanillaICE")
%crlmmResults[["crlmmVersion"]]
%@ 
%
%<<reformat>>=
%genotypes <- crlmmResults[["genotypes"]]
%confidence <- crlmmResults[["confidence"]]
%logsds <- crlmmResults[["logsds"]]/100
%copynumber <- crlmmResults[["copynumber"]]/100
%position <- crlmmResults[["position"]]
%dns <- list(crlmmResults[["nms"]], crlmmResults[["sns"]])
%copynumber[copynumber < 0.05] <- 0.05
%copynumber[copynumber > 6] <- 6
%copynumber[is.na(copynumber)] <- 2
%dimnames(confidence) <- dimnames(logsds) <- dimnames(copynumber) <- dimnames(genotypes) <- dimnames(confidence) <- dns
%@ 
%
%<<g6set>>=
%fD <- new("AnnotatedDataFrame", 
%	  data=data.frame(cbind(chromosome=21, position=position), row.names=rownames(copynumber)), 
%	  varMetadata=data.frame(labelDescription=c("chromosome", "position")))
%snpset <- new("oligoSnpSet", 
%	      copyNumber=log2(copynumber), 
%	      cnConfidence=1/logsds,
%	      calls=genotypes, 
%	      callsConfidence=confidence, featureData=fD,
%	      phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE),
%	      annotation="genomewidesnp6")
%snpset <- snpset[order(chromosome(snpset), position(snpset)), ]
%fit4 <- hmm(object=snpset, 
%	    states=c("hemDel", "normal", "ROH", "amplification"), 
%	    probs=c(0.01, 0.99, 0.9, 0.999),
%	    mu=c(0.05, 1, 1, 1.58), 
%	    takeLog=FALSE,
%	    returnSegments=FALSE, ice=TRUE)
%env <- new.env()
%brks <- hmm(object=snpset, 
%	    probs=c(0.01, 0.99, 0.9, 0.999),	    
%	    states=c("hemDel", "normal", "ROH", "amplification"), 
%	    mu=c(0.05, 1, 1, 1.58), 
%	    takeLog=FALSE,
%	    ice=TRUE, 
%	    envir=env)
%@ 

%<<>>=
%par(las=1)
%CT <- 2^copyNumber(snpset)
%HET <- calls(snpset)[, 1] == 2
%plot(position(snpset), CT[, 1], log="y", pch=".", type="n", ylab="copy number", xaxt="n", ylim=c(0.5, 5))
%points(position(snpset)[HET], CT[HET, 1], col="royalblue", pch=".")
%points(position(snpset)[!HET], CT[!HET, 1], col="red", pch=".")
%fit4[fit4 == 3] <- 2  ##So LOH line will not plot at 3
%fit4[fit4 == 4] <- 3
%lines(position(snpset), fit4[, 1])
%rohRegion <- as.integer(as.matrix(brks[brks$state == "ROH", c("start", "end")]))
%abline(v=rohRegion, col="royalblue", lty=2)
%legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="n")
%@ 

\section{Appendix}

\subsection{Confidence scores for genotype calls.}

We suggest using the \Rfunction{CRLMM} algorithm \cite{Carvalho2007} for
genotype calls.  \Rfunction{CRLMM} (in the R package \Rpackage{oligo})
provides confidence scores ($\pgte$) of the genotype estimates ($\gte$).
From 269 HapMap samples assayed on the Affymetrix 50k Xba and Hind
chips, we have a gold standard of the true genotype defined by the
consensus of the HapMap centers.  We use kernal-based density estimates
to obtain

{\scriptsize
\begin{eqnarray}
\label{ingo2}
f\left\{\ \pgtehom\ |\ \gtehom,\thom\ \right\},\
f\left\{\ \pgtehom\ |\ \gtehom,\thet\ \right\},\
f\left\{\ \pgtehet\ |\ \gtehet,\thom\ \right\},\ \mbox{~ and~}
f\left\{\ \pgtehet\ |\ \gtehet,\thet\ \right\}
\end{eqnarray}
}

\noindent separately for the Xba and Hind 50k chips. The first term in
(\ref{ingo2}), for example, denotes the density of the scores when the
genotype is correctly called homozygous ($\gtehom$) and the true
genotype is homozygous ($\thom$). See \cite{Scharpf2008} for a more
complete description of the methods. 

%\paragraph{Confidence scores for copy number estimates}
%
%To illustrate how standard errors of the copy number estimate could be
%integrated in the HMM, the R object \Robject{chromosome1} contains
%standard errors simulated from a shifted Gamma: $\mbox{Gamma}(1, 2) +
%0.3$, where 1 is the shape parameter and 2 is the rate parameter.  To
%ascertain the effect of qualitatively high confidence scores on the
%ICE HMM, we scaled a robust estimate of the copy number standard
%deviation by $\frac{1}{2}$. Similarly, to simulate less precise $\cne$
%we scaled $\epsilon$ by 2.  For more detailed information about how
%the data in the \Robject{chromosome1} was generated, see the
%documentation for this object in the R package \ice.


<<vanillaPlot,fig=TRUE,width=8,height=5, eval=FALSE>>=
gp <- plot(snpset[chromosome(snpset) == 1, ])
lines(position(snpset)[chromosome(snpset)==1], viterbiResults[chromosome(snpset) == 1, ])
##show copy-neutral LOH by vertical lines
gp$abline.v <- TRUE ##plots vertical dashed lines
allParameters <- unlist(snpPar(gp))
gp$col.predict[3] <- "white"
gp$hmm.ycoords <- c(0.7,0.9)
show(gp)
@ 

\subsection{More examples}
%
\subsubsection{Copy number only: assessing gain/loss}

The method \Robject{hmm} has a different set of underlying hidden states
depending on whether copy number estimates, genotype calls, or both are
available.  When only copy number estimates are available, the hidden
states (for autosomes) are hemizygous or homozygous deletion (one or
fewer copies), normal (two copies), and amplification (three or more
copies).  The corresponding Biobase-derived class is {SnpCopyNumberSet}.

\subsubsection{Genotypes only: assessing ROH}

When only genotype calls are available, the hidden states are loss and
retention (ret) of heterozygosity.  We define \textit{loss} to be a
sequence of homozygous SNPs longer than what we would expect to observe
by chance. Note that many long stretches of homozygosity may occur as a
result of a population sharing a common underlying haplotype structure;
loss predictions from an HMM fit to an indvidual do not necessarily
reflect the 'loss' of an allele in that individual.  The corresponding
Biobase-derived class is {SnpCallSet}.


\subsection{Expanding the number of hidden states}

Section is incomplete.  For homozygous deletion, we specify a small
value (less than 1) but greater than zero to account for background
fluorescence.

<<hiddenStates, eval=FALSE>>=
states <- c("homozygousDeletion", "hemizygousDeletion",
	    "normal", "LOH", "3copyAmp", "4copyAmp")
mu <- c(0.05, 1, 2, 2, 3, 4)
probs <- c(0.99, 0.999, 0.99, 0.999, 0.99, 0.99)
probMissing <- c(0.95, rep(0.5, 5))
@ 

\subsection{Classes for segmented data}

TODO

\subsection{Plotting methods for segmented data classes}

TODO

\section{Session Information}

The version number of R and packages loaded for generating the vignette
were:

<<echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@  

\bibliography{ice}{}
\bibliographystyle{plain}


\end{document}
