# bionlp

A Clojure library of tools for biomedical NLP tasks - like Named Entity Recognition (NER) for disease, chemical, genes and procedures.
Bionlp uses transformers models (HuggingFace) to perform NER and UMLS lexical tools for named entity resolution to UMLS CUIs (Concept Unique Identifiers).

## Usage

Add dependency in Leiningen:
```clojure
[md.datum/bionlp "0.1.0"]
```
### Required Python Packages
- transformers >= 3.1.0
```
$> pip install --user transformers==3.1.0
```
- onnxruntime >= 1.8.1
```
$> pip install --user onnxruntime
```
- onnx_transformers
```
$> pip install --user git+https://github.com/patil-suraj/onnx_transformers
```
### Additional Dependencies
- bionlp depends on [libpython-clj](https://github.com/clj-python/libpython-clj) - Clojure interface for python. We will need to add it as a dependency.
```clojure
[clj-python/libpython-clj "2.00-beta-15"]
```
- bionlp uses [clojure-opennlp](https://github.com/dakrone/clojure-opennlp) - We don't need to add it as a dependency but we will need to download a pos model to resources/models directory. We will need to download [en-pos-maxent.bin](http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin).
- bionlp uses UMLS RRF files particularly MRCONSO.RRF and MRSTY.RRF files. These need to be copied into the resources folder.
- bionlp uses [UMLS Lexical Tools](https://lhncbc.nlm.nih.gov/LSG/Projects/lvg/current/web/download.html) to perform term inflection and variant generation. After downloading, it can be installed to local maven repository as follows:
**Note:** Do this on the root folder i.e. lvg2021
 ```
 $> mvn install:install-file -Dfile=lib/lvg2021api.jar -DpomFile=pom.xml
 $> mvn install:install-file -Dfile=lib/lvg2021dist.jar -DpomFile=pom.xml
 ```
**Note:** We will also need to copy lvg.properties file from config to your resources folder and rename it to 'data.config.lvg':
```
 $> cp data/config/lvg.properties ~/projects/bionlp-proj/resources/data.config.lvg
```
### Basic Usage (From REPL)
In order to run biobert NER, you will first need to instantiate a transformers ner pipeline and pass it to the batched-ner function along with the text you want to classify.

```clojure
user> (require '[bionlp.biobert :as biobert])
nil

user> (def condition-nlp (biobert/nlp-pipeline "resources/models/output/NCBI-disease"))
#'user/condition-nlp

user> (def results (biobert/batched-ner condition-nlp "The objective of this study was to provide more accurate frequency estimates of breast cancer susceptibility gene 1 (BRCA1) germline alterations in the ovarian cancer population"))
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#'user/results

user> results
({:token "breast cancer", :index 15} {:token "ovarian cancer", :index 38})
```

Once you've identified tokens, you can further lookup the umls cui for each matching concept as follows:

```clojure
user> (require '[bionlp.umls :as umls])
nil

user> ;; First create a concept trie based on TUI's of semantic groups
(def concept-trie (umls/create-concept-lookup-trie :disease))
#'clintrials-clj.core/concept-trie

user> ;; Do a lookup for each token
(map #(umls/lookup-concept concept-trie (:token %)) results)
("C0678222" "C1140680")
```

## License

Copyright © 2021 datum.md

This program and the accompanying materials are made available under the
terms of the Eclipse Public License 2.0 which is available at
http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary
Licenses when the conditions for such availability set forth in the Eclipse
Public License, v. 2.0 are satisfied: GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or (at your
option) any later version, with the GNU Classpath Exception which is available
at https://www.gnu.org/software/classpath/license.html.
