GOXMLParser            package:AnnBuilder            R Documentation

_F_u_n_c_t_i_o_n_s _t_o _r_e_a_d/_p_a_r_s_e _t_h_e _X_M_L _d_o_c_u_m_e_n_t _o_f _G_e_n_e _O_n_t_o_l_o_g_y _d_a_t_a

_D_e_s_c_r_i_p_t_i_o_n:

     These functions are used by 'GO-class' to read/parse the Gene
     Ontology data file (in XML formate) and figures out the
     parent-child relations.

_U_s_a_g_e:

     GOXMLParser(fileName)
     getChildNodes(goid, goData)
     getOffspringNodes(goid, goData, keepTree = FALSE)
     getParentNodes(goid, goData, sep = ";")
     getAncestors(goid, goData, sep = ";", keepTree = FALSE, top = "GO:0003673")
     getTopGOid(what = c("MF", "BP", "CC", "GO"))
     mapGO2Category(goData)
     getGOGroupIDs(onto = FALSE)
     mapGO2AllProbe(go2Probe, goData, goid = "", sep = ";", all = TRUE)

_A_r_g_u_m_e_n_t_s:

fileName: 'fileName' a character string for the name of the file of
          Gene  Ontology xml data that are stored locally

  goData: 'goData' a matrix with three columns for GO ids, parent GO
          ids, and the ontology terms

    goid: 'goid' a character string for the id of Gene Ontology term
          (e.g. GO:006742)

keepTree: 'keepTree' a boolean indicating whether the tree structure
          showing parent-child relations will be preserved

     sep: 'sep' a character string for separator used to separate
          multiple entries

     top: 'top' a character string for the GO id that is the root for
          all the other GO ids along parent-child relation tree

    what: 'what' a character string that has to be one of "mf", "bp",
          "cc", "go"

    onto: 'onto' a boolean that is set to TRUE if the GO id for the
          topmost node is to be returned or FALSE if the GO ids for the
          three categories (BP, MF, and CC) to be returned

go2Probe: 'go2Probe' a matrix that maps GO ids to probe ids

     all: 'all' a boolean to indicate whether to map all the GO ids
          contained in goData to probe ids (TRUE) or just the GO ids
          specified by goid (FALSE)

_D_e_t_a_i_l_s:

     The GO site provides an XML document for the molecular function,
     biological process, and cellular component of genes. The basic XML
     structure is something like: ' <go:term>
     <go:accession>GO:000xxx</go:accession> <go:name>a string for the
     function, process, or component</go:name> <go:isa
     rdf:resource="http://www.geneontology.org/go#GO:000xxxx" />
     <go:part-of:resource="http://www.geneontology.org/go#GO:000xxxx"
     /> . . </go:term>'

     The XML document read from Gene Ontology site does not
     differentiate among the molecular function,biological process, and
     cellular component of genes as a go:name tag is used for the
     function, process, and component of genes. To determine whether a
     go:name tag is for the function, process, or component of a given
     gene identified by a GO accession number, the go:isa or go:part-of
     tag that keep reference of the parent-child relationship have to
     be retained for later use to move up a tree to find the correct
     category. As the result, the matrix returned by 'GOXMLParser' has
     columns for the GOids, the GO ids of the direct parents (a ";" is
     used to separate multiple GO ids), and the ontology term defined,
     together with some columns for other data.

     'getChildNodes' finds the direct children of a given GO id based
     on a matrix containing the parent-child relationships (e. g. the
     one returned by 'GOXMLParser'). 

     'getOffspringNodes' finds all the direct or direct children of a
     given GO id based on a matrix containing the parent-child
     relationships (e. g. the one returned by 'GOXMLParser')

     'getParentNodes' finds the direct parent of a given GO id based on
     a matrix containing the parent-child relationships (e. g. the one
     returned by 'GOXMLParser').

     'getAncestors' finds all the direct or direct parents of a given
     GO id based on a matrix containing the parent-child relationships
     (e. g. the one returned by 'GOXMLParser')

     'getTopGOid' figures out the root GO id for "mf" - molecular
     funciton, "bp" - biological process, "cc" - celullar component, 
     and "go" - the whole Gene Ontology tree))

     'mapGO2Category' maps GO ids to the three categories (MF, BP, CC)
     they belong to. 

     'getGOGroupIDs' returns the GO id(s) for the topmost or the three
     nodes corresponding to the three categories (MF, BP, and CC).

     'mapGO2AllProbe' maps GO ids to probe ids that are related to the
     GO id and all its offsprings.

_V_a_l_u_e:

     'GOXMLParser' returns a matrix.

     'getChildNodes' returns a vector of character strings.

     'getOffspringNodes' returns a vector or list of vectors depending
     on wheter the tree structure of parent-childern will be preserved.

     'getParentNodes' returns a vector of character string.

     'getAncestors' returns a vector or list of vectors depending on
     whether the tree structure of parent-childern will be preserved.

     'mapGO2Category' returns a matrix with two columns containing GO
     ids and letters representing one of the three categories (MF, BP,
     and CC).

     'getGOGroupIDs' returns a vector of string(s) for GO id(s).

     'mapGO2AllProbe' returns a matrix with GO ids as one column and
     mappings to probe ids related to the GO ids and all its offsprings
     as the other column.

     'getTopGOid' returns a character string for a GO id.

_N_o_t_e:

     This function is part of the Biocondutor project within a package
     at the Dana-Farber Cancer Institute to provide Bioinformatics
     functionalities through R

_A_u_t_h_o_r(_s):

     Jianhua (John) Zhang

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://www.geneontology.org>

_S_e_e _A_l_s_o:

     'GO-class'

_E_x_a_m_p_l_e_s:

     # Create the XML doc
       cat(paste("<?xml version='1.0'?>",
              "<!-- A test file for the examples in GOXMLParser.R Doc -->",
              "<go>",            
                  "<go:term>",
                      "<go:accession>GO:0003674</go:accession>",
                      "<go:name>molecular_function</go:name>",
                      "<go:is_a rdf='http://wwww.myurl.org/go#GO:0003673' />",
                      "<go:part_of rdf = 'http://wwww.myurl.org/go#GO:0003672' />",
                  "</go:term>",
                  "<go:term>",
                      "<go:accession>GO:0005575</go:accession>",
                      "<go:name>cellular_cpmponent</go:name>",
                      "<go:is_a rdf= 'http://wwww.myurl.org/go#GO:0003673'/>",
                      "<go:part_of rdf = 'http://wwww.myurl.org/go#GO:0003674' />",
                  "</go:term>",
               "</go>"), file = "testDoc")

       # Parse the dummy file using GOXMLParser 
       goData <- GOXMLParser("testDoc")
       # Get the child nodes for a GO id
       getChildNodes("GO:0003674", goData)
       getOffspringNodes("GO:0003673", goData, FALSE)
       getParentNodes("GO:0005575", goData)
       getAncestors("GO:0005575", goData, ";", FALSE, "GO:0003674")
       getTopGOid("GO")
       unlink("testDoc")

