Free Text Extraction

This document describes a system which populates the values of the attributes of
entities based on the free text found in documents. 


Extractors

An extractor maps strings of free text to attribute values. an example extractor
might map a string to 'true' if it contains the regular expression "kid'?s menu"
and 'false' otherwise. An extractor can return values of any type - there could
also be an extractor that maps a string to the number of times a particular word
occurs in it.

An extractor also knows how to aggregate its values. For instance, A single POI
may be described by several different stretches of free text. Some of these
stretches may contain the regular expression "kid'?s menu" and some may not. The
extractor knows how to aggregate these values into a single value.

The namespace leafgrabber.free-text.extractor contains the var extractor-table,
which is a map from keys (which identify individual extractors) to maps
containing information about those extractors. Each of the individual maps
contains a :classifier key, whose value is the function that computes values for
a single string, and an :aggregator key, whose value is the function that
determines a single value from a sequence of values produced by the classifier.

The namespace also contains functions which return appropriate classifier and
aggregator functions. The function regex-classifier takes a regular expression
and returns a function that returns true when given a string that contains that
regular expression. Similarly, the function bool-counter returns an aggregator
function which counts 'true' values, both in absolute terms and as a ratio of
all values.

Here is a sample extractor-table defining two extractors:

(def extractor-table
  {:kids_menu_1
   {:classifier (regex-classifier #"kid'?s menu")
    :aggregator (bool-counter 1 (/ 1 10))}
   :carry_out_1
   {:classifier (regex-classifier #"carry out")
    :aggregator (bool-counter 1 0)}})


Attributes

In general there may be more than one extractor that goes into determining the
value of an attribute for a POI. For instance, we might have one extractor that
looks for "kid'?s menu" and one that looks for "menu for kids". Both of these
may go into determing the value of a kids_menu attribute. Accordingly, the
definition of an attribute includes both a set of extractors and an aggregator
function which combines the values of those extractors.

The namespace leafgrabber.free-text.attribute contains the var attribute-table,
defining this information. Here is an example defining two attributes:

(def attribute-table
  {

   :kids_menu
   {:extractors #{:kids_menu_1}
    :aggregator (balanced-bool-counter 1 (/ 1 2))}

   :carry_out
   {:extractors #{:carry_out_1}
    :aggregator (balanced-bool-counter 1 0)}

   })


Process

The free-text attribute extractor takes as input an attribute definition table
and an extractor definition table, as described above, as well as a file
containing UUID/URL pairs, where each URL points to a web page containing text
which describes the POI identified by the UUID.

Things proceed in three steps. First, it creates a table of raw data, including
all the individual results of the various extractors. The result is a table that
has the following form:

{uuid1 {ext_key_1 (val1 val2 ...)
        ext_key_2 (val3 val4 ...)
        ...}
 uuid2 {ext_key_1 {val5 val6 ...)
        ext_key_2 {val7 val8 ...)
        ...}
 ...}


Second, it makes a table where the values are aggregated so each extractor has a
single value for each POI. The result is a table with this form:

{uuid1 {ext_key_1 val9
        ext_key_2 val10
        ...}
 uuid2 {ext_key_1 val11
        ext_key_2 val12
        ...}
 ...}

Third, it combines the values for the extractors into final values for the
attributes, resulting in a table like this:

{uuid1 {att_key_1 val13
        att_key_2 val14
        ...}
 uuid2 {att_key_1 val15
        att_key_2 val16
        ...}
 ...}


Revised Process, to fit better with Cascalog

First query: make a [uuid url ext] source from a dir:

(defn uuid-url-source [dir ext-table]
  (let [text-source (hfs-textline dir)]
    (<- [?uuid ?url ?ext] 
        (source ?line) 
        (extract-uuid ?line :> ?uuid) 
        (extract-url ?line :> ?url)
        (add-exts ext-table :> ?ext) 
  )))

Second query: make a set of tuples like [uuid ext_key val]

(defn raw-data [dir ext-table]
  (let [source (uuid-url-source dir ext-table)]
    (<- [?uuid ?ext ?value]
        (source ?uuid ?url ?ext)
        (add-value ?url ?ext :> value)
  )))

Third query: aggregate the tuples' values

(defn agg-ext [dir ext-table]
  (let [source (raw-data dir ext-table)]
    (<- [?uuid ?ext ?agg-value]
        (source ?uuid ?ext ?raw-value)
        (aggregate-ext-values ext-table ?raw-value :> ?agg-value)
  )))

Fourth query: aggregate the extractors into attributes

(defn agg-att [dir ext-table att-table]
  (let [source (agg-ext dir ext-table)]
    (<- [?uuid ?att ?att-value]
        (source ?uuid ?ext ?agg-value)
        (aggregate-att-values att-table ?ext ?agg-value)
  )))
