%\VignetteIndexEntry{Classes Used in the Oligo Packages}
%\VignetteDepends{oligo}
%\VignetteKeywords{Expression, SNP, Affymetrix, NimbleGen, Oligonucleotide Arrays}
%\VignettePackage{oligo}

\documentclass{article}

\usepackage{hyperref, amsfonts}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\oligo}{\Rpackage{oligo }}

\begin{document}
\title{Classes Used in the Oligo Package}
\date{March, 2007}
\author{Benilton Carvalho}
\maketitle

\section{Introduction}

This document describes the classes used in the \oligo package. The \oligo package uses essentially two groups of classes:
\begin{itemize}
\item Static data classes: these are chip-specific information. Each chip contains its own annotation, which is shared across experiments that used that array. These are generated by the \Rpackage{makePlatformDesign} and \Rpackage{pdInfoBuilder} packages.
\item Experimental data classes: these classes refer to the experimental data, ie. CEL and XYS files that the user has. All the experimental data classes derive from \Robject{eSet} defined in \Rpackage{Biobase}.
\end{itemize}

The \Rclass{platformDesign} is one of the static data classes and is generated by the \Rpackage{makePlatformDesign} package. It is a container for the chip-specific information. We are transitioning the creation of the chip-specific packages to the \Rpackage{pdInfoBuilder}, which makes more efficient use of memory (via SQLite) and is much more flexible than the environment approach used by \Rpackage{makePlatformDesign}.

\section{\Rclass{platformDesign} Class}

The \Rclass{platformDesign} class is the container for information on
the expression (NimbleGen), tiling (Affymetrix, NimbleGen) and exon
(Affymetrix) arrays. It contains the following slots:

\begin{table}[h]
  \centering
  \begin{tabular}{|l|l|} \hline
    Slot                            &  Type \\ \hline
    \Robject{manufacturer}           & \Rclass{character} \\
    \Robject{genomebuild}            & \Rclass{character} \\
    \Robject{featureInfo}            & \Rclass{enviroment}\\
    \Robject{featureTypeDescription} & \Rclass{list}      \\
    \Robject{type}                   & \Rclass{character} \\
    \Robject{nrow}                   & \Rclass{numeric}   \\
    \Robject{ncol}                   & \Rclass{numeric}   \\
    \Robject{nwells}                 & \Rclass{numeric}   \\
    \Robject{lookup}                 & \Rclass{data.frame}\\
    \Robject{indexes}                & \Rclass{list}      \\
    \Robject{platforms}              & \Rclass{character} \\
    \hline
  \end{tabular}
  \caption{Description of the \Rclass{platformDesign} class}
  \label{tab:platformDesign}
\end{table}

\begin{itemize}
\item \Robject{manufacturer}: lower case string containing the name of
  the manufacturer of the array (eg., \Rcode{``affymetrix''} or
  \Rcode{``nimblegen''}).
\item \Robject{genomebuild}: lower case string containing the genome
  release information using the USCS notation, as described at
  \url{http://genome.ucsc.edu/FAQ/FAQreleases#release1}.
\item \Robject{featureInfo}: an environment containing vectors of same
  length which fully characterizes the array being used. See details
  below.
\item \Robject{type}: a string describing the type of the array (eg.,
  \Rcode{``expression''}, \Rcode{``tiling''}, \Rcode{``exon''},
  \Rcode{``SNP''}).
\item \Robject{nrow} and \Robject{ncol}: array dimensions.
\item \Robject{nwells}: number of wells (specific for NimbleGen data).
\item \Robject{lookup}: data.frame used to map features in complex
  NimbleGen designs.
\item \Robject{indexes}: not used anymore. To be removed.
\item \Robject{platform}: not used anymore. To be removed.
\end{itemize}

\subsection{Details on the \Robject{featureInfo} slot}

The \Robject{featureInfo} is the home for the majority of the
information used by \Rpackage{oligo}. \Robject{featureInfo} is an
\Rclass{environment} containing the following vectors:

\begin{table}[h]
  \centering
  \begin{tabular}{|l|c|c|c|} \hline
    \textbf{Field}               & \textbf{Expression} & \textbf{Tiling} & \textbf{Exon} \\ \hline
    \Robject{X} and \Robject{Y}  &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{feature\_set\_name} &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{feature\_ID}        &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{feature\_type}      &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{target\_strand}     &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{sequence}           &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{order\_index}       &     \checkmark      &  \checkmark     &   \checkmark  \\ \hline
    \Robject{length}             &                     &  \checkmark     &   \checkmark  \\ \hline
    \Robject{chromosome}         &                     &  \checkmark     &               \\ \hline
    \Robject{ambiguous\_feature} &                     &  \checkmark     &               \\ \hline
    \Robject{position}           &                     &  \checkmark     &               \\ \hline
    \Robject{location}           &     \checkmark      &                 &   \checkmark  \\ \hline
    \Robject{atomID}             &                     &                 &   \checkmark  \\ \hline
    \Robject{gc\_count}          &                     &                 &   \checkmark  \\ \hline
  \end{tabular}
  \caption{Fields in \Robject{featureInfo}}
  \label{tab:featureInfo}
\end{table}

\begin{itemize}
\item \Robject{X} and \Robject{Y}: X/Y coordinates on the array. Class:
  \Rclass{integer}.
\item \Robject{feature\_set\_name}: name of the featureset
  (probeset). Class: \Rclass{character}.
\item \Robject{feature\_ID}: match ID between PM and MM. \Rclass{integer}.
\item \Robject{feature\_type}: type of the feature. Class:
  \Rclass{factor}. (PM/MM)
\item \Robject{target\_strand}: target strandness. Class:
  \Rclass{factor}. (antisense/sense)
\item \Robject{sequence}: probe sequence. Class: \Rclass{character}.
\item \Robject{length}: probe length. Class: \Rclass{integer}.
\item \Robject{chromosome}: chromosome. Class:
  \Rclass{character}. (chr1/chr22/chrX)
\item \Robject{ambiguous\_feature}: indicator if sequence is mapped to
  more than one genomic location. \Rclass{logical}
\item \Robject{position}: genomic location within
  chromosome. \Rclass{numeric}
\item \Robject{location}: genomic location within chromosome. To be
  removed and merged with \Robject{position}.
\item \Robject{atomID}: pairing key between PM-MM.
\item \Robject{gc\_count}: number of GC bases. To be removed, as this
  can be obtained from the sequence information.
\end{itemize}

\subsubsection{Particularities of Tiling Arrays}

For tiling arrays, I have been using the genomic position as
\Robject{feature\_set\_name}, but it is not uncommon to have a probe
sequence matching $k>1$ genomic positions. In situations like this, the
\Robject{feature\_set\_name} is set as the concatenation of the $k$
genomic positions using \Robject{``;''} as separator and
\Robject{ambiguous\_feature} is set \Robject{TRUE}. For example:

\begin{table}[h]
  \centering
  \begin{tabular}{|c|c|c|c|} \hline
    \Robject{sequence}       &  \Robject{position}  & \Robject{feature\_set\_name}   & \Robject{ambiguous\_feature} \\ \hline
    \Robject{AAATC...GCCAT}  &  12345               & \Robject{``12345''}            & \Robject{FALSE}              \\
    \Robject{CCACG...ATTCC}  &  34567 / 87654       & \Robject{``34567;87654''}      & \Robject{TRUE}               \\ \hline
  \end{tabular}
  \caption{Naming convention for tiling arrays}
  \label{tab:namingTiling}
\end{table}

An even more effective naming convention would be
\Robject{CHRnnPmmmmmm}, which would be more robust on designs that
involve multiple chromosomes.

\subsubsection{Particularities of Exon Arrays}

A basic support of Exon Arrays is offered by \Rpackage{oligo}.

\subsubsection{Particularities of SNP Arrays}

The data packages for SNP arrays are now built via \Rpackage{pdInfoBuilder} package. The packages for the Affymetrix 100K and 500K sets are available via BioConductor.

\subsubsection{About \Robject{order\_index}}

In the final version of the data packages, the fields described on Table
\ref{tab:featureInfo} are ordered by \Robject{feature\_set\_name},
\Robject{feature\_type} and \Robject{target\_strand}. Note that this
breaks the link between the intensity file (which is often ordered by
X/Y location) and the annotation available in the
\Rclass{platformDesign} object.

In order to keep this link, we initially order the
\Rclass{platformDesign} object by X/Y location, so it matches the
intensities files. Then we add the field \Robject{order\_index}, which
is only the row number. Later, the \Rclass{featureInfo} object is
reordered by \Robject{feature\_set\_name}, \Robject{feature\_type} and
\Robject{target\_strand}. But with the presence of
\Robject{order\_index}, we can correctly map the intensities to their
probe-level annotations.

\section{\Rclass{DBPDInfo} Class}

The \Rclass{DBPDInfo} class is the database approach for the \Rclass{platformDesign} class. Table \ref{DBPDInfo} describes the class structure.

\begin{table}[h]
  \centering
  \begin{tabular}{|l|l|}
    \hline
    Slot & Type \\ \hline
    getdb & \Robject{function} \\
    tableInfo & \Robject{data.frame} \\
    geometry & \Robject{integer} \\
    manufacturer & \Robject{character} \\
    genomebuild & \Robject{character} \\ \hline
  \end{tabular}
  \label{DBPDInfo}
  \caption{Description of the \Rclass{DBPDInfo} class}
\end{table}

\begin{itemize}
\item \Robject{getdb}: function that accesses the external database (we use SQLite, via RSQLite);
\item \Robject{tableInfo}: a data.frame with two columns (\Robject{tbl}
  and \Robject{row\_count}). This data.frame contains the name and number
  of rows of each table available in the database.
\item \Robject{geometry}: an integer vector of length 2, containing the
  number of rows and columns of the array;
\item \Robject{manufacturer}: a string with the manufacturer's name;
\item \Robject{genomebuild}: a string with the genome build information.
\end{itemize}

\section{\Rclass{FeatureSet} Class}

The \Rclass{FeatureSet} class is a virtual class to be used with the
feature-level data and is created from the \Rclass{eSet}
class. Different classes are created from this:
\begin{itemize}
\item \Rclass{ExpressionFeatureSet}: for expression arrays;
\item \Rclass{SnpFeatureSet}: for SNP arrays;
\item \Rclass{ExonFeatureSet}: for exon arrays;
\item \Rclass{TilingFeatureSet}: for tiling arrays.
\end{itemize}

\section{\Rclass{SnpQSet} Class}
The \Rclass{SnpQSet} class is created by the \Rmethod{snprma()} method. It contains four matrices, which contain the summarized information for SNP data. The four matrices are:
\begin{itemize}
\item \Robject{antisenseThetaA}: summarized data at the SNP-level for the antisense strand and allele A;
\item \Robject{antisenseThetaB}: summarized data at the SNP-level for the antisense strand and allele B;
\item \Robject{senseThetaA}: summarized data at the SNP-level for the sense strand and allele A;
\item \Robject{senseThetaB}: summarized data at the SNP-level for the sense strand and allele B;
\end{itemize}

This is the expected input to the genotyping algorithm, \Rmethod{crlmm()}.

\section{\Rclass{SnpCallSet} Class}

The \Rclass{SnpCallSet} class is a container for the output of genotyping algorithm, eg. \Rmethod{crlmm()}. It contains two matrices: \Robject{calls} and \Robject{callsConfidence}, which hold respectively the genotype calls and associated measures of confidence.

\section{\Rclass{SnpCopyNumberSet} Class}

The \Rclass{SnpCopyNumberSet} class is a container for the output of copy number analisys. It contains two matrices: \Robject{copyNumber} and \Robject{copyNumberConfidence}, which hold respectively the copy number estimates and associated measures of confidence.

\end{document}