

%
% NOTE -- ONLY EDIT Biobase.Rnw!!!
% Biobase.tex file will get overwritten.
%
%\VignetteIndexEntry{ontoTools archived data}
%\VignetteDepends{GO.db, org.Hs.eg.db, ontoTools}
%\VignetteKeywords{ontology, semantics}
%\VignettePackage{ontoTools}
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Archiving data for use with \Rpackage{ontoTools}}
\author{VJ Carey {\tt <stvjc@channing.harvard.edu>}}
\maketitle
\tableofcontents

\section{Introduction}

Effective use of \Rpackage{ontoTools} depends upon a number
of potentially laborious computations involving large scale
metadata resources.  Using Bioconductor's conventions,
the \Rpackage{org.Hs.eg.db} package and the \Rpackage{GO}
package, one can create `object-ontology complexes' (OOCs) (or
annotated corpora) for use in a variety of investigations.

In general, the \Rpackage{ontoTools} maintainer will attempt
to supply the most important complexes with the package,
to secure uniformity of results across diverse applications.
However, users may need to create their own OOCs or derived
concept probability scores.  This vignette indicates how
this can be done.

As noted, in realistic applications, the computations are
laborious.  Hence we will protect the reader from
intensive computation by conditioning out the slowest
computations.  If you want to run this vignette directly,
modify the variable {\tt DontRun} to have value {\tt FALSE}
in the next chunk.

<<setup>>=
DontRun <- TRUE
@

\section{Nomenclature details}

At this point, a standard nomenclature is lacking.  But the
following conventions may be of some use.  The data subdirectory
of package ontoTools will be guaranteed to hold the following objects,
where \verb+[x.y]+ evaluates to a Bioconductor release tag.

\begin{verbatim}
goMFgraph.[x.y].rda    -- graph::graphNEL representing goMF DAG
goMFamat.[x.y].rda     -- namedSparse representing accessibility matrix 
                          of goMFgraph
LL2GOMFooMap.[x.y].rda -- namedSparse representing map from LocusLink 
                          to GO MF, using org.Hs.eg.db
LL2GOMFcp.[x.y].rda    -- vector of concept probabilities for LL2GOMF
\end{verbatim}


\section{Illustration: working with LocusLink and GO molecular function}

To begin, we attach the current \Rpackage{org.Hs.eg.db}
package and extract all the LocusLink tags.

<<getLLtags>>=
library(org.Hs.eg.db)
lltags <- ls(org.Hs.egGO)
@
We see that there are \Sexpr{length(lltags)} tags for human loci:
<<countTags>>=
print(length(lltags))
@

We now obtain the GO annotations for these tags.
This is accomplished by saving annotations in a list indexed
by loci, and then unlisting.  We will store the restricted
(molecular function annotation) mapping in
an environment \Robject{hllgoEnv}.

<<getAnnot>>=
kvmap <- list()
hllgoEnv <- new.env(hash=TRUE)
library(GO.db)
@

We want to confine attention to molecular function (MF)
terms.
<<getGOMF>>=
GOtags <- ls(GOTERM)
library(Biobase)
library(annotate)
GOlabs <- mget(GOtags, GOTERM)
GOMFtags <- GOtags[sapply(GOlabs,Ontology)=="MF"]
#GOMFterms <- unlist(mget(GOMFtags,env=GOTERM))
#ntags <- length(GOMFtags)
#if (any(duplicated(GOMFterms)))
# {
# dups <- (1:ntags)[duplicated(GOMFterms)]
# GOMFterms[dups] <- paste(GOMFterms[dups],".2",sep="")
# }
#names(GOMFterms) <- GOMFtags

@

Now we iterate over loci, checking for presence of tag annotations
in the MF ontology before saving to the \Robject{kvmap} list.

<<iterateLoci>>=
if (!(DontRun)) {
 cat(length(lltags))
 for (i in 1:length(lltags))
  {
  if (i %% 200 == 0) cat(i)
  tmp <- get(lltags[i], org.Hs.egGO)
  tmp = gsub("@.*$","", tmp)
  tmp <- tmp[ tmp %in% GOMFtags ]
  if (length(tmp)>0) 
     {
     kvmap[[ lltags[i] ]]  <- tmp
     assign( lltags[i], tmp, env=hllgoEnv )
     }
  }
}
@

The resulting map has \Sexpr{length(lltags)} elements.  These
define the rows of the OOC map matrix (mapping objects to terms).

We now get the unique GO target tags.  These define the columns
of the OOC map.

<<getGOtargets>>=
if (!(DontRun)) {
 print(length(kvmap))
 gotargs <- sort(unique(unlist(kvmap)))
 llused <- names(kvmap)
 print(length(gotargs))
}

@

Now we use the \Rpackage{ontoTools} utility that creates
a named sparse matrix out of an object-term key-value environment.
The \Rfunction{otkvEnv2namedSparse} function is quite
slow for the 10000 by 2000 application with late 2003 LL and GO.

<<makeNamedSparse>>=
library(ontoTools)
if (!DontRun) {
LL2GOMFooMap.1.18 <- otkvEnv2namedSparse( llused, gotargs, hllgoEnv )
save(LL2GOMFooMap.1.18, file="LL2GOMFooMap1.18.rda", compress=TRUE)
save.image()
}
if (DontRun) data(LL2GOMFooMap.1.18)
@

@
\section{The GO ontology}

In \Rpackage{ontoTools}, an ontology is a lightly
annotated DAG.  The package includes an function \Rfunction{buildGOgraph},
which works by default on the environment GO::GOMFPARENTS.
<<gogr>>=
if (!DontRun) {goMFgraph.1.18 <- buildGOgraph()} else data(goMFgraph.1.18)
save.image()
@
This is then installed in an \Robject{ontoTools::ontology} 
object via the following steps

<<rDAG>>=
if (!DontRun) {
  go1.18DAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.18)
  GOMF1.18 <- new("ontology", name="GOMF",  version="bioc 2.0", rDAG=go1.18DAG)
}
if (!DontRun) {goMFamat.1.18 <- accessMat(GOMF1.18)} else {data(goMFamat.1.18)}
save.image()
@
Finally, we make the formal OOC instance:
<<makeOOC>>=
if (!DontRun) LL2GOMFooc1.18 <- new("OOC", ontology=GOMF1.18, OOmap=LL2GOMFooMap.1.18)
save.image()

@
\section{Concept probabilities in LL2GOMF}

Concept probabilities are computed on the basis of an OOC.
At present this is extremely slow, and the calculation should
be refined so that only terms that are actually used in the OOC
are tested.

<<conceptProbs>>=
if (!DontRun) LL2GOMFcp.1.18 <- conceptProbs( ooc=LL2GOMFooc1.18, acc=goMFamat.1.18 )
save.image()

@
\end{document}

