
%
%\VignetteIndexEntry{Rredland}
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{RDF processing for Bioconductor: \Rpackage{Rredland}}
\author{\copyright 2005 VJ Carey \texttt{<stvjc@channing.harvard.edu>}}

\maketitle

\tableofcontents

\section{Introduction}

Resource Description Framework (RDF) is a graphical model for information.
RDF statements are ordered triples of the form (subject, predicate, object).
Subjects and objects are viewed as nodes in a directed graph, and predicates
are viewed as arcs in the graph.  RDF is a key component of current
developments towards a semantic web, with considerable work completed on web resource
metadata representation and exchange using RDF.
A richer metadata model is provided by OWL (Web Ontology Language), but
most OWL models are serialized using XML/RDF.  Thus, as we will illustrate,
various public OWL resources can be processed by this package.

Redland is the name of an open source software project downloadable
from \url{librdf.org}.
Redland is a C language library with bindings provided to a variety
of other languages.  Redland is highly modular, and allows developers
to drop in components to substitute for base functionalities.  Because
metadata resources can be very voluminous, such flexibility is important.
A solution to the problem of persistent storage of indexed metadata
is provided through the use of BerkeleyDB serializations of Redland models.

\Rpackage{Rredland} is an R package that provides interfaces to facilities
of Redland.  Configuration support is currently limited.  You will be
able to use Rredland if you do a stock installation of librdf and BerkeleyDB.
If you have these resources in nonstandard locations, you can set the
Makevars variables in \texttt{src} to reflect your configuration.  You
may need to set \verb+LD_LIBRARY_PATH+.

\section{Illustration}

\subsection{Simple manipulations with a fragment of GO}

Eric Jain of ISB-CH has provided an RDF serialization of the UniProt
database and associated annotation resources, including an RDF serialization
of GO.  A fragment of this serialization is distributed with the
\Rpackage{Rredland} package.

<<setup>>=
library(Rredland)
gofrag <- system.file("RDF/gopart.rdf", package="Rredland")
@

Here we dump the first 10 lines of this document as text:
<<dump>>=
readLines(gofrag,n=10)
@

This could be processed as an XML document, but let's
use Redlands modeling facilities.  First we need to
set up a URI object for the model source document.

<<seturi>>=
gouri <- makeRedlURI( paste("file:", gofrag, sep="") )
@

Now we read from this document.  We will set the
\texttt{useCore} option to use in-memory storage.
<<dored1>>=
gof <- readRDF( gouri )
gof
@
We are handed back an S4 object of class \Rclass{redlModel}.
<<lkcls>>=
getClass("redlModel")
@
We need to use the \Rfunction{model} accessor to get to the
model reference.

We can easily compute the number of statements (also 
computed with show()):
<<lksiz>>=
#getRedlModelSize(model(gof))
size(gof)
@

We can also transform to a data frame:
<<lkdf>>=
godf <- as(gof, "data.frame")
godf[1:4,]
@

We see that long text strings can cause a problem for rendering.
<<lkobj>>=
as.character(godf[1:4,3])
@

The data frame representation is useful for splitting up the
statement set.
<<bypred>>=
bypred <- split(godf, as.character(godf$predicate))
names(bypred)
sapply(bypred, nrow)
@

The \texttt{subClassOf} predicate helps determine the DAG structure:
<<lktree>>=
bypred$"http://www.w3.org/2000/01/rdf-schema#subClassOf"[,-2]
@


\subsection{BioPAX Level 1}

The BioPAX pathway ontologies are available.
<<dobp1>>=
bp1 <- makeRedlURI(paste("file:",system.file("RDF/biopax-level1.owl", 
   package="Rredland"),sep=""))
bp1m <- readRDF( bp1 )
size(bp1m)
@
This is a manageable object, so we convert to data frame:
<<lkbp1>>=
bp1df <- as(bp1m, "data.frame")
sapply(bp1df[1:5,], substring, 1, 70)
@

The namespace qualifications make the strings difficult to render.
A simple approach uses substitution up to the pound sign, preceded by
eliminating any XSD postfix information.
<<defstp>>=
strip2pound <- function(x) gsub(".*#","",cleanXSDT(as.character(x)))
sapply(bp1df[1:5,], strip2pound)
@

Working with a data frame, it is easy to filter statements of interest.
Suppose we wish to determine all the instances of \texttt{owl\#Class} in the model.
<<getcl>>=
isTypeOwlClass <- grep("owl#Class", as.character(bp1df[,3]))
strip2pound( bp1df[isTypeOwlClass,1] )
@

We see a number of decipherable terms, and some tokens of the form (rnnn...).
The latter are called blank nodes.  These are created to define classes that
have no names, but that are implicitly defined in the model.  For example,
a class that is the union of entity and physicalEntity is a blank node in this
model.

To get the detailed commentary on a class definition, the following
function can be used:
<<echo=FALSE>>=
chopLong = function(x,nword=12) {
 tvec <- strsplit(x," ")[[1]]
 ltvec <- length(tvec)
 if (ltvec %% nword != 0) {
    pad <- rep(" ", ceiling(length(tvec))*nword)
    pad[1:ltvec] <- tvec
 }
 else pad <- tvec
 ss <- matrix(pad,nr=nword)
 ss <- rbind(ss,"\n")
 paste(ss,collapse=" ")
}
 
<<getcldef>>=
getClassComment <- function(term, df, nsPref="http://www.biopax.org/release/biopax-level1.owl#",
	commPred= "http://www.w3.org/2000/01/rdf-schema#comment", doChop=TRUE, nword=12 ) {
ind <- which( as.character(df[,1]) == paste(nsPref,term,sep="") & as.character(df[,2]) == commPred )
chopLong(cleanXSDT(as.character(bp1df[ind,3])), nword=nword)
}
cat(getClassComment("chemicalStructure", bp1df ))
cat(getClassComment("biochemicalReaction", bp1df ))
@

\subsection{BioPAX level 2}

Here we check the classes available in BioPAX level 2.
<<lk2>>=
bp2 <- makeRedlURI(paste("file:",system.file("RDF/biopax-level2.owl", 
	package="Rredland"),sep=""))
bp2m <- readRDF( bp2 )
size(bp2m)
bp2df <- as(bp2m, "data.frame")
isTypeOwlClass <- grep("owl#Class", as.character(bp2df[,3]))
strip2pound( bp2df[isTypeOwlClass,1] )
@

%\subsection{HumanCyc}
%
%The BioCyc project (\url{www.biocyc.org}) is a collection of
%pathway/genome databases in a variety of structures.  The data resources
%are available to academic researchers, and a registration/download process
%must be completed for access.  We illustrate use of \Rpackage{Rredland}
%to work with the BioPAX encoding of HumanCyc.  This is 19MB of RDF
%and an in-core storage model is not likely to be satisfactory.
%We will use the default BerkeleyDB storage approach.

<<getHum,eval=FALSE,echo=FALSE>>=
humu <- makeRedlURI(paste("file:","humancyc.owl",sep=""))
humm <- readRDF( humu, storageType="bdb", storageName="hucyc")
@

%Note that the vignette cannot assume that you have this OWL file.
%After the above commands, we have
%
%\begin{verbatim}
%-rw-r--r--   1 stvjc  stvjc  59723776 Jul 28 13:09 test-sp2o.db
%-rw-r--r--   1 stvjc  stvjc  39538688 Jul 28 13:07 test-po2s.db
%-rw-r--r--   1 stvjc  stvjc  57499648 Jul 28 13:07 test-so2p.db
%\end{verbatim}
%These are the BerkeleyDB hashes representing aspects of the graph.
%
%It is not too difficult to transform into a data frame.

<<gethdf,eval=FALSE,echo=FALSE>>=
hudf <- as(humm, "data.frame")
husubs <- as.character(hudf[,1])
hupreds <- as.character(hudf[,2])
huobs <- as.character(hudf[,3])
table(hupreds)
@

%To find the named pathways,
<<getnpw,eval=FALSE,echo=FALSE>>=
isPw <- grep("pathway", husubs)
isNa <- grep("NAME", hupreds)
isnp <- intersect(isPw, isNa)
cleanXSDT(huobs[isnp][1:10])
@

%So we see in the predicate set what kinds of relationships are
%described, and we get a glimpse of the pathway names addressed
%in this resource.
%
%Note that there is no need to parse the data once the Berkeley
%DB hashes are made available.  The BDBSexists option on readRedlModel
%can be used to revive a model-hash association.


\subsection{HumanCyc}

The BioCyc project (\url{www.biocyc.org}) is a collection of
pathway/genome databases in a variety of structures.  The data resources
are available to academic researchers, and a registration/download process
must be completed for access.  We illustrate use of \Rpackage{Rredland}
to work with the BioPAX encoding of HumanCyc.  This is 19MB of RDF
and an in-core storage model is not likely to be satisfactory.
We will use the default BerkeleyDB storage approach.

\begin{Schunk}
\begin{Sinput}
> humu <- makeRedlURI(paste("file:", "humancyc.owl", sep = ""))
> humm <- readRDF(humu, storageType = "bdb", storageName = "hucyc")
\end{Sinput}
\end{Schunk}

Note that the vignette cannot assume that you have this OWL file.
After the above commands, we have

\begin{verbatim}
-rw-r--r--   1 stvjc  stvjc  59723776 Jul 28 13:09 test-sp2o.db
-rw-r--r--   1 stvjc  stvjc  39538688 Jul 28 13:07 test-po2s.db
-rw-r--r--   1 stvjc  stvjc  57499648 Jul 28 13:07 test-so2p.db
\end{verbatim}
These are the BerkeleyDB hashes representing aspects of the graph.

It is not too difficult to transform into a data frame.

\begin{Schunk}
\begin{Sinput}
> hudf <- as(humm, "data.frame")
> husubs <- as.character(hudf[, 1])
> hupreds <- as.character(hudf[, 2])
> huobs <- as.character(hudf[, 3])
> table(hupreds)
\end{Sinput}
\begin{Soutput}
hupreds
                   http://www.biopax.org/release/biopax-level1.owl#AUTHORS 
                                                                     31432 
         http://www.biopax.org/release/biopax-level1.owl#CELLULAR-LOCATION 
                                                                      2800 
                  http://www.biopax.org/release/biopax-level1.owl#COFACTOR 
                                                                        11 
                   http://www.biopax.org/release/biopax-level1.owl#COMMENT 
                                                                      1231 
                http://www.biopax.org/release/biopax-level1.owl#COMPONENTS 
                                                                        36 
              http://www.biopax.org/release/biopax-level1.owl#CONTROL-TYPE 
                                                                        36 
                http://www.biopax.org/release/biopax-level1.owl#CONTROLLED 
                                                                      2216 
                http://www.biopax.org/release/biopax-level1.owl#CONTROLLER 
                                                                      2216 
               http://www.biopax.org/release/biopax-level1.owl#DATA-SOURCE 
                                                                       167 
                        http://www.biopax.org/release/biopax-level1.owl#DB 
                                                                     12251 
                   http://www.biopax.org/release/biopax-level1.owl#DELTA-G 
                                                                        23 
                 http://www.biopax.org/release/biopax-level1.owl#EC-NUMBER 
                                                                       872 
                        http://www.biopax.org/release/biopax-level1.owl#ID 
                                                                     12251 
                      http://www.biopax.org/release/biopax-level1.owl#LEFT 
                                                                      1968 
          http://www.biopax.org/release/biopax-level1.owl#MOLECULAR-WEIGHT 
                                                                       666 
                      http://www.biopax.org/release/biopax-level1.owl#NAME 
                                                                      6046 
                 http://www.biopax.org/release/biopax-level1.owl#NEXT-STEP 
                                                                       895 
                  http://www.biopax.org/release/biopax-level1.owl#ORGANISM 
                                                                      1730 
        http://www.biopax.org/release/biopax-level1.owl#PATHWAY-COMPONENTS 
                                                                      1049 
           http://www.biopax.org/release/biopax-level1.owl#PHYSICAL-ENTITY 
                                                                      2800 
                     http://www.biopax.org/release/biopax-level1.owl#RIGHT 
                                                                      2020 
                  http://www.biopax.org/release/biopax-level1.owl#SEQUENCE 
                                                                        12 
                    http://www.biopax.org/release/biopax-level1.owl#SOURCE 
                                                                      5534 
               http://www.biopax.org/release/biopax-level1.owl#SPONTANEOUS 
                                                                         3 
         http://www.biopax.org/release/biopax-level1.owl#STEP-INTERACTIONS 
                                                                      2869 
http://www.biopax.org/release/biopax-level1.owl#STOICHIOMETRIC-COEFFICIENT 
                                                                      2783 
                 http://www.biopax.org/release/biopax-level1.owl#STRUCTURE 
                                                                       776 
            http://www.biopax.org/release/biopax-level1.owl#STRUCTURE-DATA 
                                                                       776 
          http://www.biopax.org/release/biopax-level1.owl#STRUCTURE-FORMAT 
                                                                       776 
                  http://www.biopax.org/release/biopax-level1.owl#SYNONYMS 
                                                                     10032 
                http://www.biopax.org/release/biopax-level1.owl#TAXON-XREF 
                                                                         1 
                      http://www.biopax.org/release/biopax-level1.owl#TERM 
                                                                        10 
                     http://www.biopax.org/release/biopax-level1.owl#TITLE 
                                                                      5534 
                      http://www.biopax.org/release/biopax-level1.owl#XREF 
                                                                     13605 
                      http://www.biopax.org/release/biopax-level1.owl#YEAR 
                                                                      5460 
                           http://www.w3.org/1999/02/22-rdf-syntax-ns#type 
                                                                     22984 
                              http://www.w3.org/2000/01/rdf-schema#comment 
                                                                         1 
\end{Soutput}
\end{Schunk}

To find the named pathways,
\begin{Schunk}
\begin{Sinput}
> isPw <- grep("pathway", husubs)
> isNa <- grep("NAME", hupreds)
> isnp <- intersect(isPw, isNa)
> cleanXSDT(huobs[isnp][1:10])
\end{Sinput}
\begin{Soutput}
 [1] "\"biosynthesis of aspartate and asparagine; interconversion of aspartate and asparagine.\""
 [2] "\"serine and glycine biosynthesis\""                                                       
 [3] "\"alanine biosynthesis II\""                                                               
 [4] "\"alanine biosynthesis I\""                                                                
 [5] "\"alanine biosynthesis III\""                                                              
 [6] "\"superpathway of alanine biosynthesis\""                                                  
 [7] "\"arginine biosynthesis III\""                                                             
 [8] "\"citrulline biosynthesis\""                                                               
 [9] "\"asparagine biosynthesis I\""                                                             
[10] "\"aspartate biosynthesis and degradation\""                                                
\end{Soutput}
\end{Schunk}

So we see in the predicate set what kinds of relationships are
described, and we get a glimpse of the pathway names addressed
in this resource.

Note that there is no need to parse the data once the Berkeley
DB hashes are made available.  The BDBSexists option on readRedlModel
can be used to revive a model-hash association.


\section{Future work}

We will need to take unions of RDF models and C code will be required
for that.  We need R interfaces to Redland approaches to model filtering.
Some graph/set-theoretic activities can be introduced to bring some
RDF/RDFS inferencing in.


\end{document}
