GOHyperG               package:GOstats               R Documentation

_H_y_p_e_r_g_e_o_m_e_t_r_i_c _T_e_s_t_s _f_o_r _G_O

_D_e_s_c_r_i_p_t_i_o_n:

     Given a set of unique LocusLink Identifiers, a microarray chip and
     the GO category of interest this function will compute all
     Hypergeomtric p-values for overrepresentation of the interesting
     genes (as indicated by the unique LocusLink Identifiers) at the
     nodes in the induced GO graph.

_U_s_a_g_e:

     GOHyperG(x, lib="hgu95av2", what="MF")

_A_r_g_u_m_e_n_t_s:

       x: A vector of unique LocusLink identifiers. 

     lib: The name of the annotation library for the chip that was
          used. 

    what: One of "MF", "BP", or "CC" indicating which of the GO
          categories the computations should be made for.

_D_e_t_a_i_l_s:

     Typical usage will be to have a microarray experiment from which a
     set of interesting genes/probes has been obtained. To determine
     whether there is an overrepresentation of these genes at
     particular GO terms a simple hypergeometric calculation has often
     been made. Two substantial issues arise. First and most
     importantly it is not clear how to do any form of p-value
     correction in this case. The tests are not independent and the
     underlying structure of the GO graph present certain problems that
     still need to be addressed. The second substantial issue is that
     arises is that the mappings are based on LocusLink identifiers and
     hence all computations should also be based on unique LocusLink
     identifiers. In 'GOHyperG' every attempt to appropriately correct
     for non-uniqueness of mappings has been made.

     The user provides a vector of unique LocusLink identifiers and
     these are used, together with the name of the chip to create the
     necessary counts. It is important that the correct chip be
     identified as that determines the overall counts and all inference
     will be incorrect if that is not correct.

     The test performed is a Hypergeometric test, using 'phyper', where
     at each GO node we determine how many LLIDs from the chip were
     annotated there, how many of the supplied LLIDs were annotated
     there and compute a $p$-value. This is the equivalent of using
     Fisher's exact test.

_V_a_l_u_e:

     The returned value is a list with components: 

pvalues : The ordered p-values.

goCounts: The vector of counts of LLIDs from the chip at each node.

intCounts: The vector of counts of the supplied LLIDs annotated at each
          node.

   numLL: The number of unique LLIDs on the chip that are mapped to
          some term in the specified GO category.

  numInt: The number of unique LLIDs from those supplied that are
          mapped to some term in the specified GO category.

    chip: A string identifying the chip used.

  intLLs: The input vector 'x'.

 go2Affy: A list with one element for each GO node, containing the
          Affymetrix identifiers associated with that node, for the
          whole chip (not just the interesting genes).

_A_u_t_h_o_r(_s):

     R. Gentleman

_S_e_e _A_l_s_o:

     'phyper'

_E_x_a_m_p_l_e_s:

     library(hgu95av2)
     library(GO)
     w1<-as.list(hgu95av2LOCUSID)
     w2<-unique(unlist(w1))
     set.seed(123)
     #pick a hundred interesting genes
      myLL <- sample(w2, 100)
      xx<-GOHyperG(myLL)
     xx$numLL
     xx$numInt
     sum(xx$pvalues < 0.01)

