% rm(list=ls());library("weaver");Sweave("SuppMat.Rnw", driver=weaver)

%\VignetteIndexEntry{Reading PSI-25 XML file from IntAct}
%\VignetteDepends{}
%\VignettePackage{Rintact}

\documentclass[11pt]{article}

\usepackage{times}
\usepackage{hyperref}
\usepackage{geometry}
\usepackage{longtable}
\usepackage{times}

\SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,include=FALSE,prefix=FALSE,width=4,height=4} 

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\title{Reading PSI-25 XML file from IntAct with the \Rpackage{Rintact} package}
\author{Tony Chiang and Nianhua Li}
\begin{document}
\maketitle

\begin{abstract}
This document serves as a user's vignette to the R package
\Rpackage{Rintact}.  We present examples of how to use the two main
function's of \Rpackage{Rintact}, and also how to take the output data
from these two functions and create the input data for
statistical methods in proteomic analysis provided by other
Bioconductor packages.
\end{abstract}

\section{Introduction}
\Rpackage{Rintact} is an R package mainly used to parse the PSI-25
files generated by the \textit{IntAct} data repository which collects,
curates and stores thousands of protein interactions. Currently, there
are two main functions within \Rpackage{Rintact}:

\begin{enumerate}
\item \Rfunction{psi25interaction}
\item \Rfunction{psi25complex}
\end{enumerate} 

The first function, \Rfunction{psi25interaction}, takes either a
PSI-25 XML file from IntAct or an URL containing the web address of
where such an XML file can be obtained.  The XML file must contain
\emph{binary} protein protein interaction data.  Example for such data
are direct physical interactions, complex co-membership, synthetic
genetic interactions.  The second function, \Rfunction{psi25complex},
also takes a PSI-25 XML file or URL as an input parameter, but the
file must contain protein complex membership information. In
principle, these two functions can take any XML file which adheres to
the PSI-25 standards. We have constructed these functions, however, to
work primarily with the \textit{IntAct} PSI-25 XML files, as there are
subtle implementation differences between repositories such as \textit{IntAct} and
\textit{DIP}, although both use the PSI-25 standards. In this vignette, we
shall demonstrate the use of these functions on the data generated by
\cite{Ewing2007} and the manually curated protein complexes derived by
the \textit{IntAct} curators.

\subsection{Loading R Libraries}
We begin by loading the various R libraries with which we shall use. Our
primary focus will be with the \Rpackage{Rintact} package, but we will also examine and
exploit statistical methods found in various Bioconductor packages for the analysis
of the interaction data obtain from \textit{IntAct}.
 
<<loadlibs, results=hide>>=
library("Rintact")
library("graph")
library("Rgraphviz")
#library("ppiStats")
library("RBGL")
library("apComplex")
library("xtable")
@ 

\section{Obtaining the Interaction Information}
\subsection{\Rfunction{psi25interaction}}
We first demonstrate the use of the function \Rfunction{psi25interaction}. We can
either download the \textit{IntAct} PSI-25 XML file onto a local directory or we 
can simply use the URL (of where the file can be obtained) as the input parameter. 
We have chosen the latter:

<<psi25int, echo=TRUE>>=
url <- system.file("PSI25XML", "interactionSample.xml", package="Rintact")
ewing <- psi25interaction(url)
@ 

Once the XML file has been parsed by \Rfunction{psi25interaction}, we can look at 
its overall structure. 
<<ewingStruc, echo=TRUE>>=
class(ewing)
@ 

We can see that the output of \Rfunction{psi25interaction} is
an instance of the class \Rclass{interactionEntry}. This class has 
\Sexpr{length(slotNames(ewing))} slots:
%
<<slotEntries>>=
slotNames(ewing)
@ 
Three of them contain simple character vectors:
<<simpleSlots>>=
ewing@organismName
ewing@taxId
ewing@releaseDate
@ 
%FIXME: We should get rid of organismName and taxId or make sure that they will get all of
%the organisms tested in the XML file.

\Rclass{organismName}
records all the organisms for which interactions were assayed. For each organism, we have
also included its taxonomy identification code. Because \textit{IntAct} does not 
currently version its weekly release, we have added the \Rclass{releaseDate} as
a time stamp to act as a surrogate for the version number.

Let us investigate the structure of the \Rclass{interactions} slot. This slot contains
a list which holds all the binary interactions given within the XML file (along with 
information about each particular interaction). Each element of the 
list is an instance of the class \Rclass{intactInteraction} class. This class has 
\Sexpr{length(slotNames(interactions(ewing)[[1]]))} slots:
%
<<interactionSlot>>=
length(interactions(ewing))
class(interactions(ewing)[[1]])
interactions(ewing)[[1]]
slotNames(interactions(ewing)[[1]])
@ 

The various slots contain information which is relevant for each
individual interaction.  The \Rclass{interactionType} slot details
what manner of interaction was found between the bait protein and the
prey protein, which are specified in the \Rclass{bait} and
\Rclass{prey} slots.  Another important attribute is the experimental
confidence value given in the \Rclass{confidenceValue} slot. This
confidence value is reported by the experimenters; it does not report
scores derived by third parties.

We can extract the names of the bait and prey proteins for all of the
interactions in the \Robject{ewing} dataset:
%
<<getBaitPrey>>=
ewbait <- sapply(interactions(ewing), bait)
ewprey <- sapply(interactions(ewing), prey)
@ 
%
We now have two character vectors, \Robject{ewbait} and \Robject{ewprey},
that are aligned to each other: the $i^{th}$ protein in \Robject{ewprey} 
is found by the $i^{th}$ protein in \Robject{ewbait}.
%
<<baitVec>>=
ewbait
ewprey
@ 

The \textit{IntAct} accession codes are useful as unique and uniform
identifiers in the \textit{IntAct} repository, but we will usually
want to translate them to other identifier schemes such as HUGO gene
name Ensembl gene identifier.  The PSI-25 XML files from
\textit{IntAct} contain a look-up table for this purpose.  This
look-up table is stored in the \Rclass{interactors} slot of the
\Rclass{interactionEntry} object \Robject{ewing}, in the form of a
character matrix. Its rows are indexed by the \textit{IntAct} accession 
numbers of the molecules in the data structure, 
and its has \Sexpr{ncol(interactors(ewing))} columns.
%
<<interactors>>=
interactors(ewing)
@ 
%
The \textit{IntAct} accession codes can be translated into any of the
associated identifier schemes.  Two further properties are given for
each molecule: the organism in which the molecule is native and the
corresponding taxonomy ID. Most of the interactions found in
\textit{IntAct} will be protein-protein interactions; other types of
interactions, however, are also stored such as small molecule to
protein interactions as well as gene-gene interactions. As a result,
there will be times when a molecule cannot be mapped to a locus name
or an ORF etc. We also remark that interactions have been tested
between proteins of different organisms (i.e human protein against
mice). Thus the organism attribute is vital to keep such interactions
in the proper context.

Using the look-up table is quick and efficient because of the
subsetting functionality of R. For instance, say we would like
to translate the following \textit{IntAct} accession codes
%
<<translateEG>>=
wh = ewbait[3:4]
@ 
into gene names:
<<lookUP>>=
interactors(ewing)[wh, "geneName"]
@ 

\section{Obtaining Protein Complex Composition Information}
Now we will demonstrate the parser function \Rfunction{psi25complex}. We remark here 
that the protein complexes which this function obtains have been 
manually curated from literature sources by \textit{IntAct} curators. 

%FIXME: Why?
%Those protein complexes estimated by the experimenters (e.g. \cite{Gavin2006}) should not be
%culled from the XML files.

The parameters of \Rfunction{psi25complex} are identical to those of 
\Rfunction{psi25interaction}, while its output only 
contains 3 slots:
%
<<psi25complex, echo=TRUE>>=
url2 <- system.file("PSI25XML/complexSample.xml", package="Rintact")
comps <- psi25complex(url2)
slotNames(comps)
@ 
%
Again the \Rclass{releaseDate} slot serves as a surrogate version
number. The \Rclass{interactors} slot again holds a look-up table that
can be used to translate the \textit{IntAct} accession codes. The
\Rclass{complexes} slot is a list of \Rclass{intactComplex} objects.
Each list entry is an instance of the class \Rclass{intactComplex},
which itself has \Sexpr{length(slotNames(comps@complexes[[1]]))}
slots.
<<complex1>>=
length(complexes(comps))
class(complexes(comps)[[1]])
slotNames(complexes(comps)[[1]])
@
These slots describe the multi-protein complex. 
The three most important ones are
\Rclass{fullName}, \Rclass{attributes} and \Rclass{members} slots. 
The \Rclass{fullName} slot gives the exact name of the multi-protein complex while
the \Rclass{attributes} slots gives a short description as to the known functionality
of the complex. The \Rclass{interactors} slot gives the members of the complex and
their multiplicity. 
<<check, echo=FALSE>>=
stopifnot(all(c("fullName", "attributes", "members")%in%slotNames(complexes(comps)[[1]])))
@ 
<<showComplex1, echo=TRUE>>=
complexes(comps)[[1]]
@ 

%---------------------------------------------------------
% SessionInfo
%---------------------------------------------------------
\begin{table*}[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}


\begin{thebibliography}{12}

\bibitem[Gavin et~al.(2007)]{Gavin2006}
Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ,
  Bastuck S, D{\"u}mpelfeld B, et~al. 
\newblock{Proteome survey reveals modularity of the yeast cell machinery}. 
\newblock \emph{Nature} 2006,
  \textbf{440}:631--636.

\bibitem[Ewing et~al.(2007)]{Ewing2007}
Ewing EM et~al.
\newblock {Large-scale Mapping of Protein-Protein Interactions by Mass Spectrometry}.
\newblock \emph{Molecular Systems Biology} 2007, 3.
\end{thebibliography}

\end{document}
