%\VignetteIndexEntry{An Introduction to IRanges}
%\VignetteDepends{}
%\VignetteKeywords{Ranges}
%\VignettePackage{IRanges}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\IRanges}{\Rpackage{IRanges}}

\title{An Introduction to \Rpackage{IRanges}}
\author{Michael Lawrence, Patrick Aboyoun, Herv\'e Pag\`es}
\date{\today}

\begin{document}

\maketitle

<<options,echo=FALSE>>=
options(width=60)
@

\section{Introduction}

The \IRanges{} package is designed to represent sequences, ranges representing
indices along those sequences, and data related to those ranges. In
this vignette, we will rely on simple, illustrative example datasets,
rather than large, real-world data, so that each data structure and
algorithm can be explained in an intuitive, graphical manner.  We
expect that packages that apply \IRanges{} to a particular problem domain
will provide vignettes with relevant, realistic examples.

\section{Sequences}

The \IRanges{} package is primarily interested in the
representation, manipulation and analysis of sequences. A
\textit{sequence} is mathematically defined as an ordered list of
objects, or elements. It differs from a \textit{set} in that order
matters and the same element can appear multiple times. 

\IRanges{} supports the use of R vectors to represent
sequences, and we also formally define a virtual class
\Rclass{Sequence}, the derivatives of which convey the sequence
semantic of ordered elements. There are currently two
\Rclass{Sequence} implementations in \Rpackage{IRanges}: \Rclass{Rle},
which compresses a sequence through run-length encoding, and
\Rclass{XSequence}, which refers to its data through an external
pointer. \Rclass{XSequence} and its derivatives are considered
low-level infrastructure and, as such, will not be covered by this
vignette.

We begin our demonstration by loading the \IRanges{} package.
<<initialize, results=hide>>=
library(IRanges)
@

\subsection{Run Length Encoding}

% sparse representation, and more (manipulating runs is useful)

The \Rclass{Rle} class represents a run-length encoded (compressed) sequence of
\Rclass{logical}, \Rclass{integer}, \Rclass{numeric}, \Rclass{complex}, 
\Rclass{character}, or \Rclass{raw} values.

<<Rle-construction>>=
set.seed(0)
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length = 500), seq(10,0.001, length = 500))
x <- Rle(rpois(1e7, lambda))
y <- Rle(rpois(1e7, lambda[c(251:length(lambda), 1:250)]))
x
y
@

<<Rle-accessors>>=
head(runValue(x))
head(runLength(x))
@

<<Rle-ops>>=
x > 0
x + y
x > 0 | y > 0
@


<<Rle-summary>>=
range(x)
sum(x > 0 | y > 0)
@

<<Rle-math>>=
log1p(x)
@


<<Rle-cor>>=
cor(x, y)
shiftApply(249:251, y, x, FUN = function(x, y) var(x, y) / (sd(x) * sd(y)))
@

\subsection{Sequence Extraction}

It is often necessary to extract a sequence from another, in
analogous manner to ordinary vector extraction or
subsetting. A simple operation is to select a list
of consecutive elements from a sequence. This is the purpose of the
\Rfunction{subseq} function.

%% EXAMPLE: subseq()

Mathematically, a subsequence is slightly more general: selected
elements need only to be in the same order, not consecutive. We can
generalize this further to sequence extraction, where the order of the
elements is no longer fixed. As the order constraint is rarely
broken, we will use the term \textit{subsequence} to represent the
result of sequence extraction. The most general way to describe such a
subsequence would be a vector of indices into the sequence. As it is
common to extract consecutive values from the sequence, the indices
are usually more efficiently encoded as a list of ranges, i.e. a
vector of start positions and a parallel vector of widths.

The general interface for extracting subsequences is
\Rfunction{seqextract}, which is supported by all \Rclass{Sequence}
objects, as well as ordinary vectors.

%% EXAMPLE: seqextract(), [

\section{Lists}

\subsection{Basic operations}

One often wants to organize and manipulate multiple sequences
simultaneously. We could place multiple \Rclass{Rle} instances, for
example, into a list. However, a list is too generic; it does not
confer any information about the specific class of its elements. There
is no type safety, and it is not possible to define methods
specifically for homogeneous lists with elements of a particular
class. For example, for a list of \Rclass{Rle} objects, we may wish to
define a method that retrieves the run values for each element,
without special type checking. To enable this, we define a specific
collection class, \Rclass{RleList}, for storing \Rclass{Rle} objects.

%% EXAMPLE (construct RleList)

In fact, we have done the same for many of the other classes in
\Rpackage{IRanges}, as well as the base atomic vectors (raw, logical,
integer, numeric, complex and character). All of the collection
classes derive from the virtual class \Rclass{ListLike}, the
derivatives of which are all obligated to support two basic list
operations: element extraction and length retrieval.

%% EXAMPLE (element extraction, length)

Most collection types, including \Rclass{RleList}, derive from
\Rclass{TypedList}, a \Rclass{ListLike} derivative that implements
the requisite operations, as well as a number of additional
features. These include familiar list functions, such as
\Rfunction{c} and \Rfunction{lapply}.

%% EXAMPLE (c, lapply)

\subsection{Annotated Lists} 

Often when one has a collection of objects, there is a need to attach
metadata that describes the collection in some way. The
\Rclass{AnnotatedList} class extends \Rclass{TypedList} to add two
metadata components: an ordinary \Rclass{list} to hold arbitary
objects, called \Robject{metadata}, and \Robject{elementMetadata}, a
data frame with one row per element and any number of columns. Many of
the high-level list objects in \IRanges{} (described later) are
\Rclass{AnnotatedList} objects.

\subsection{Advanced Notes}

%% Perhaps this belongs in a separate advanced vignette or the paper?

\Rclass{TypedList} also provides some more advanced features. First,
as its name suggests, subclasses of \Rclass{TypedList} can specify a
type from which the class of each of its elements must inherit. For
example, \Rclass{RleList} directs \Rclass{TypedList} to ensure that
all of its elements are \Rclass{Rle} objects. One benefit of this type
safety is that it allows methods dispatching on \Rclass{RleList} to
assume that all of the elements are really of class \Rclass{Rle}.

A second advanced feature of \Rclass{TypedList} relates to its
internal representation. Behind the scenes, \Rclass{TypedList} is
simply an R \Rclass{list}. By default, there is a one-to-one mapping
between elements of the \Rclass{TypedList} and elements in the
\Rclass{list}. This is a simple design; however, it is not always
ideal. If one has many, small objects, the storage overhead,
especially for S4 objects, would be relatively high. Our solution is
to \textit{compress} the list by concatenating the elements together,
if possible. This forms a single, long element that is virtually split
by the \Rclass{TypedList} interface. A beneficial side effect of this
approach is that unlisting (concatenating all of the elements)
is cheap, as it reduces to returning the internal representation.

\section{Sequence Ranges}

When analyzing sequences, we are often interested in specific
segments, or consecutive subsequences, of the sequence. It is not
uncommon for an analysis task to focus only on the segments
themselves, while ignoring the underlying sequence values. A
consecutive list of indices would be a simple way to select a
consecutive subsequence. However, a sparser representation would be a
a range, a pairing of a start position and a width, as used when
extracting sequences with \Rfunction{subseq} and
\Rfunction{seqextract}, above.

When analyzing subsequences in \IRanges{}, each range is treated as an
observation. The virtual \Rclass{Ranges} class represents lists of
ranges, or, equivalently, sequences of consecutive integers. The most
commonly used implementation of \Rclass{Ranges} is \Rclass{IRanges},
which stores the starts and widths as ordinary integer vectors. To
construct an \Rclass{IRanges} instance, we call the
\Rfunction{IRanges} constructor. Ranges are normally specified by
passing two out of the three parameters: start, end and width (see
\Sexpr{help(IRanges)} for more information).

%% Example: IRanges() with different SEW combinations

Accessing the starts, widths and ends is supported by every
\Rclass{Ranges} implementation.

%% Example: start(), width(), end()

For \Rclass{IRanges} and some other \Rclass{Ranges} derivatives,
subsetting is also supported.

%% Example: Ranges subsetting

\subsection{Normality}

\Rclass{NormalIRanges}

\subsection{Lists of Ranges}

\Rclass{RangesList}
\Rclass{IRangesList}
\Rclass{MaskCollection} % make distinction with RangesList

\subsection{Sequence Extraction}

As \Rclass{Ranges} objects encode subsequences, they may be used
directly in sequence extraction. Note that when a \textit{normal}
\Rclass{Ranges} is given as the index, the result is a true
subsequence, in the mathematical sense.

%% EXAMPLE: subseq() and [ with Ranges

\subsection{Transforming Ranges}

% most utilities fall in here

\paragraph{Making ranges normal}

\Rfunction{reduce}

\paragraph{Making ranges disjoint}

\Rfunction{disjoin}, \Rfunction{disjointBins}

\paragraph{Adjusting starts, ends and widths}

\Rfunction{shift}
\Rfunction{*} % zooming
\Rfunction{narrow} % like 'pintersect'
\Rfunction{threebands} % like 'pdisjoin', if it existed

\paragraph{Other transformations}

\Rfunction{restrict} % like 'intersect'
\Rfunction{reflect}
\Rfunction{flank}

\subsection{Set Operations}

\Rfunction{gaps}, \Rfunction{pgaps}
\Rfunction{setdiff}, \Rfunction{psetdiff}
\Rfunction{union}, \Rfunction{punion}
\Rfunction{intersect}, \Rfunction{pintersect}

\subsection{Finding Overlapping Ranges}

\Rclass{RangesMatching}
\Rclass{RangesMatchingList}
\Rclass{IntervalTree}
% need an IntervalTreeList?

\Rfunction{overlap}, \Rfunction{\%in\%}

\subsection{Finding Neighboring Ranges}

\Rfunction{nearest}, \Rfunction{precede}, \Rfunction{follow}

\subsection{Mapping Ranges Between Sequences}

\Rclass{RangesMapping}
\Rfunction{map}

\section{Sequence Views}

When we extract a sequence with \Rfunction{seqextract}, we can pass
multiple ranges, each selecting a single consecutive
subsequence. Those subsequences are extracted and concatenated into a
single sequence. There are many cases where the user wishes to avoid
the concatenation step and instead treat each consecutive subsequence
as a separate element in a list.  

While one could simply store each extracted sequence as an element in
a list object like a \Rclass{TypedList}, this is undesirable for a
couple of reasons. First, the user often wants to preserve the
original sequence and declare a set of interesting regions as an
overlay. This allows retrieving sequence values even after the ranges
have been adjusted. Another benefit of an overlay approach is
performance: the sequence values need not be copied.

For representing such an overlay, the \IRanges{} package provides the
virtual \Rclass{Views} class, which derives from \Rclass{Ranges} but
also stores a sequence. Each range is said to represent a
\textit{view} onto the sequence.

Here, we will demonstrate the \Rclass{RleViews} class, where the
sequence is of class \Rclass{Rle}. Other \Rclass{Views}
implementations exist, such as \Rclass{XStringViews} in the
\Rpackage{Biostrings} package.

\subsection{Manipulating Views}

\subsection{Aggregating Views}


\section{Data Sets}

\Rclass{XDataFrame}
\Rclass{XDataFrameList}
\Rclass{SplitXDataFrameList}

\section{Sequence Ranges with Data Sets}

\Rclass{RangedData}
\Rclass{RangedDataList}

\subsection{Applying Over Spaces}

\Rclass{RDApplyParams}
\Rclass{FilterRules}

%% Classes Not Covered (16)
%%\Rclass{ANYTHING}
%%\Rclass{characterORNULL}
%%\Rclass{expressionORfunction}
%%\Rclass{expressionORlanguage}
%%\Rclass{functionORNULL}
%%\Rclass{IntegerPtr}
%%\Rclass{NumericPtr}
%%\Rclass{RangesORmissing}
%%\Rclass{RawPtr}
%%\Rclass{SequencePtr}
%%\Rclass{vectorORfactor}
%%\Rclass{XDataFrameORNULL}
%%\Rclass{XInteger}
%%\Rclass{XIntegerORNULL}
%%\Rclass{XIntegerViews}
%%\Rclass{XNumeric}

\begin{table*}[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}

\end{document}
