%% LyX 1.6.0 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[english]{article}
\usepackage[T1]{fontenc}
\usepackage[latin9]{inputenc}
\setlength{\parskip}{\medskipamount}
\setlength{\parindent}{0pt}
\usepackage{url}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
%% Because html converters don't know tabularnewline
\providecommand{\tabularnewline}{\\}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
\usepackage{Sweave}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rcommand}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
% Meta information - fill between {} and do not remove %
% \VignetteIndexEntry{An R Package for retrieving data from DAVID into R objects. }
% \VignetteDepends{RCurl}
% \VignetteKeywords{}
% \VignettePackage{DAVIDQuery}

\usepackage{babel}

\begin{document}

\title{Vignette for the \Rpackage{DAVIDQuery} package: }


\title{Retrieving data from the DAVID Bioinformatics Resource}


\author{Roger Day }


\author{Departments of Biomedical Informatics and Biostatistics }


\author{University of Pittsburgh}


\date{January 10, 2009}

\maketitle

\section{Introduction}

DAVID (Database for Annotation, Visualization and Integrated Discovery)
is a bioinformatics resource developed by the National Institute of
Allergy and Infectious Diseases at Frederick in conjunction with the
Laboratory of Immunopathogenesis and Bioinformatics (LIB), SAIC Frederick.
This resource is described as {}``a graph theory evidence-based me
thod to agglomerate species-specific gene/protein identifiers the
most popular resources including NCBI, PIR and Uniprot/SwissProt.
It groups tens of millions of identifiers into 1.5 million unique
protein/gene records.'' Further information can be found in published
articles {[}1]{[}2].

As of this time, maintenance of the DAVID resource is supervised by
Dr. Richard Lempicki. The resource is accessed interactively at \url{http://david.abcc.ncifcrf.gov/}.
The interactive interface provided there is suitable for many purposes,
but for a bioinformatician using R an automated procedural solution
is needed. The convention for executing queries via formation of URL
attribute-value strings is provided at \url{http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html}.
Although this is described as an application program interface (API),
the desired query result is not directly provided by the immediate
return page, and two rounds of {}``screen-scraping'' and URL formulation
are required to retrieve the query results from a program.


\section{Types of identifiers and reports}

As of this version, there are three important attributes in the URL
specification. The \Robject{"id"} attribute will hold the proband
identifiers about which information is to be retrieved. The \Robject{id}
values are combined in a single string joined by commas. The \Robject{"type"}
attribute will hold a string indicating the type of the identifiers.
The legitimate values for \Robject{type} are:

\begin{tabular}{|c|c|c}
\hline 
\Robject{AFFY\_ID} & \Robject{ENTREZ\_GENE\_ID} & \Robject{GENBANK\_ACCESSION}\tabularnewline
\hline 
\Robject{GI\_ACCESSION} & \Robject{PIR\_ID} & \Robject{PIR\_NREF\_ID}\tabularnewline
\hline 
\Robject{REFSEQ\_MRNA} & \Robject{REFSEQ\_PROTEIN} & \Robject{REFSEQ\_RNA}\tabularnewline
\hline 
\Robject{UNIPROT\_ACCESSION} & \Robject{UNIPROT\_ID} & \Robject{UNIREF100\_ID}\tabularnewline
\hline
\hline 
\Robject{GENPEPT\_ACCESSION} & \Robject{REFSEQ\_GENOMIC} & \Robject{UNIGENE}\tabularnewline
\hline
\end{tabular}

The third attribute is \Robject{"tool"}, which refers to the type
of report to be generated. Values which return useful results are
the strings \Robject{"gene2gene"}, \Robject{"list"}, \Robject{"geneReport"}
(the latter two nearly equivalent), \Robject{"annotationReport"},
and \Robject{"geneReportFull"}. The other choices for \Robject{tool},
related to DAVID's Functional Annotation tools, generate much more
complex output and cannot be handled by this package at this time.

A fourth attribute, the \Robject{"annot"} attribute, is relevant
to the \Robject{"annotationReport"}, tool. It names the additional
columns to appear in the annotation report. For other tools, \Robject{"annot"}
does not appear to affect the returned results, and is generally set
to \Robject{NULL}.

If the query contains \Rcode{tool=list} or \Rcode{tool=geneReport},
then the result (after formatting) is a three-column character data
frame. If the query contains \Rcode{tool=geneReportFull}, then the
result (after formatting) is a list with each element corresponding
to an identifier in the ID list. If the query contains \Rcode{tool=gene2gene},
then the result (after formatting) is a list with each element corresponding
to a functional group selected by a DAVID algorithm. The formats are
documented in detail in the manual documents for the function \Rfunction{formatDAVIDResult}.


\section{Motivating setting}

Our group received results of a proteomic mass spectrometry experiment
that generated over 12,000 protein UNIPROT identifiers, and needed
to compare these results to a microarray experiment that utilized
the Affymetrix U133 Plus 2 chip. Therefore the 12,000 identifiers
needed to be mapped as well as possible to Affymetrix probe-sets which
could confidently be assigned to protein-coding genes. There are numerous
strategies for accomplishing this mapping, such as utilizing the Affymetrix
NetAffx resource or NCBI Entrez, but each approach is known to generate
an occasional incorrect answer. Utilizing DAVID appears to be at minimum
competitive with the others, and possibly the best approach. 

An early version of \Rfunction{DAVIDQueryLoop} was used to retrieve
matching probe-sets. These results, together with comparisons to alternative
mapping methods, are to be reported in a manuscript in preparation.
The bulk of the work being performed by Kevin McDade at the University
of Pittsburgh.

It should be noted that, as of last look, the retrieval of Affymetrix
probe-set IDs via the DAVID API did not allow for restricting the
result to a specified chip. Lists of probe-sets by chip name are available
at DAVID. The function \Rfunction{AffyProbesetList} is provided in
this package to retrieve the list for the chip of interest, for intersection
with lists of probe-sets retrieved from DAVID via \Rfunction{DAVIDQueryLoop}.
(We caution that there is no guarantee that these probe-set lists
match comparable lists obtained elsewhere. )


\section{Launching a single query}

A single query is accomplished with the function \Rfunction{DAVIDQuery}.
The mechanics involve formulating a query URI, launching it and retrieving
identifiers from the returned HTML, formulating and launching a new
query, retrieving a result file name from the returned HTML, and finally
retrieving the file itself. Formatting of the final result is the
default option. (The result file remains on the server for 24 hours.) 


\subsection{Structured and unstructured}

A raw HTML character stream is transmitted by DAVID. By default, an
attempt to structure the results will be made. A structuring function
is defined for each tool. There is no guarantee that the structuring
functions will continue to work if or when the formats of the pages
returned by DAVID change. Also, not all combinations of the query
arguments have been tested, and there may be combinations of \Robject{ids},
\Robject{type}, \Robject{annot}, \Robject{tool} for which the tool's
structuring function does not work correctly. When a look at the raw
stream is desired, for example if the structuring fails or the result
is unexpected, then the call can be made with the argument assignment:
\Rcode{DAVIDQuery(structureIt=FALSE)}. This allows the user to receive
the raw character table actually returned.


\subsection{Examples}

<<chunk1>>= 
library("DAVIDQuery")
result = DAVIDQuery(type="UNIPROT_ACCESSION", annot=NULL, tool="geneReportFull")
names(result)

@

The result has been structured into a list of lists. Printing is suppressed
due to the size of the output. The code \Rcode{DAVIDQuery(testMe=TRUE)}
is the equivalent of the DAVDQuery call above. 

The result of the simpler query using \Rcode{tool="geneReport"} is
a matrix:

<<chunk2>>=
Sys.sleep(10)  ### Assure that queries are not too close in time.
result = DAVIDQuery(type="UNIPROT_ACCESSION", annot=NULL, tool="geneReport")
result$firstURL
result$secondURL
result$downloadURL
result$DAVIDQueryResult

@

The Gene Functional Classification query is obtained by the query
clause \Rcode{tool="gene2gene"}. The returned value has a complex
structure which we attempt to translate into a corresponding R object
respecting the structure, using the function \Rfunction{formatGene2Gene}. 

<<chunk3>>=
Sys.sleep(10)  ### Assure that queries are not too close in time.
result = testGene2Gene(details=FALSE)
length(result)
names(result$DAVIDQueryResult[[1]])
@

Convenience functions are provided to assist with integrating genomic
and proteomic data:

<<chunk4>>=
Sys.sleep(10)  ### Assure that queries are not too close in time.
affyToUniprot(details=FALSE)
Sys.sleep(10)  ### Assure that queries are not too close in time.
uniprotToAffy(details=FALSE)
@


\section{Launching large queries}

To control performance of the DAVID website, and to assure that queries
launched by the website can be successfully processed, policy limits
are implemented. When a user needs to retrieve answers which would
exceed these limits if a single query is attempted, the function \Rfunction{DAVIDQueryLoop}
can be used. It attempts to slow successive calls and to reduce the
query size, sufficiently to meet the website policies with a little
to spare.


\section{Limitations}

This package cannot use semantic interoperability, due to the nature
of DAVID API. This entails risk that future modifications to DAVID
will cause functions in this package to fail. In communication with
the DAVID team, it appears that improvements in the DAVID API itself
are desired but unlikely to reach the top of the work queue in the
foreseeable future. This \Rpackage{DAVIDQuery} package has withstood
the transition from DAVID 2007 to DAVID 2008. Therefore we anticipate
that maintenance in the face of changes will not be as brittle as
one might fear. If or when the API is modified, this package may be
adapted accordingly. 


\section{Future improvements and adaptations}

We would like to create a package targeted more generally to data
analysis combining protein expression data with mRNA expression data.
The main focus, initially at least, will be to provide support for
mapping between protein identifiers, for example those returned by
Sequest from mass spectrometry experimental results, and probe-set
identifiers for microarray chips. Multiple mapping methods will be
implemented and compared, extending ongoing research in our group. 

Ideally, the information in DAVID would be directly available via
a grid service. Neither the DAVID team nor we have current plans to
implement that, but note that Martin Morgan's team working with caBIG
has developed extensive tools for bridging between R and the caBIG's
caGRID, using the package \Rpackage{RWebServices} from Bioconductor.


\section{Session information }

This version of DAVIDQuery has been developed with R 2.8.0 GUI 1.26
(5256). 

R session information:

<<sessionInfo, results=tex>>=
toLatex(sessionInfo())
@


\section{Acknowledgements }

Brad Sherman and Da Wei Huang of the DAVID project kindly reviewed
this package and documentation. Their corrections and encouragement
were invaluable.

Thanks are due to Drs. Larry Maxwell and Thomas Conrads for provision
of the data and scientific collaborations that motivated this work,
Kevin McDade and Uma Chandran for discussions on the identifier-mapping
problem, and Richard Boyce for careful review of the package and documentation.
Grant support includes funding from the Gynecologic Diseases Program,
a collaboration whose bioinformatics components include Walter Reed
Army Medical Center, University of Pittsburgh, and Windber Research
Institute. Additional support came from the Telemedicine and Advanced
Technology Research Center (TATRC).


\section{References}

{[}1] Huang D.W., Sherman B.T., Tan Q., Kir J., Liu D., Bryant D.,
Guo Y., Stephens R., Baseler M.W., Lane H.C. et al. (2007) DAVID Bioinformatics
Resources: expanded annotation database and novel algorithms to better
extract biology from large gene lists. Nucleic Acids Res., 35, W169-W175.

{[}2] Huang D.W., Sherman B.T. and Lempicki R.A. (2008) Systematic
and integrative analysis of large gene lists using DAVID bioinformatics
resources. Nat. Protoc., doi: 10.1038/nprot.2008.211.
\end{document}