# dbwalk

Dependency: [vincit/dbwalk "0.2.1"]


A Clojure library for querying nested maps to/from a database or other sources of relational data.

Dbwalk has multiple output formats, including one that just puts everything in a nested map exactly as it is
in the database and another where you can select which tables and columns you want to see.

Other ideas in dbwalk:

* The main components' input and output formats are just data. If you don't like the included API, roll your own.
* When you have a complex database and don't feel like specifying the full database path in every query, you can
  declare it just once. See QueryPath below.
* Is your data split between two databases or a db and file system? In dbwalk, relations do not have to begin and end in the same datasource. 
  See Relation and Datasource below.
* Automatic database schema detection using information schema (for postgresql).

## What can it do?


Let's assume that we have a db-spec for a database. The database contains companies and droids,
which are connected via the company_workers join table.

First, we'll autodetect the structure of the database in *demo-db-spec*. This queries the information schema for primary and foreign keys.
    
    (def db-conf (db-spec->configuration demo-db-spec))
    => #'dbwalk.api.simple/db-conf
    
Let's select everything in the database. For that, we need a nested tree of database queries.
It should look like this: 

    {:dbwalk/query {:select (:*), :from (:company)},
     :dbwalk/eager
     [{:dbwalk/query {:select (:*), :from (:company_workers)},
       :dbwalk/eager [{:dbwalk/query {:select (:*), :from (:droid)}}]}]}

But we do not want to write all that, so let's use a helper:

    (require '[dbwalk.api.vector :as v])
    => nil
    (v/vec->query-tree [:company [:company_workers [:droid]]])
    =>
    {:dbwalk/query {:select (:*), :from (:company)},
     :dbwalk/eager
     [{:dbwalk/query {:select (:*), :from (:company_workers)},
       :dbwalk/eager [{:dbwalk/query {:select (:*), :from (:droid)}}]}]}

Inside the vectors, a set of keywords will select only those columns and a HoneySQL WHERE clause can be used for filtering.

Now to get the data

    (def result (simple-query->graph db-conf (v/vec->query-tree [:company [:company_workers [:droid]]])))
    => #'dbwalk.api.simple/result

It is returned as a [Loom](https://github.com/aysylu/loom) digraph, which can be formatted in different ways. This is a nested map projection:
    
    (dbwalk.output.map/graph->map result)
    =>
    [{:id 1,
      :name "Universal Exports",
      :company_workers
      [{:company_id 1, :person_id [{:id 1, :name "Drone One"}]}
       {:company_id 1, :person_id [{:id 2, :name "Drone Two"}]}]}
     {:id 2,
      :name "Universal Imports",
      :company_workers
      [{:company_id 2, :person_id [{:id 3, :name "Bot One"}]}
       {:company_id 2, :person_id [{:id 4, :name "Bot Two"}]}]}]
       
Notice how it follows the database structure?

Here is the same data, but with properties namespaced to tables for use with clojure.spec:

    (dbwalk.output.map/graph->namespaced-map result)
    =>
    [{:company/id 1,
      :company/name "Universal Exports",
      :company_workers
      [{:company_workers/company_id 1,
        :company_workers/person_id 1,
        :person_id [{:droid/id 1, :droid/name "Drone One"}]}
       {:company_workers/company_id 1,
        :company_workers/person_id 2,
        :person_id [{:droid/id 2, :droid/name "Drone Two"}]}]}
     {:company/id 2,
      :company/name "Universal Imports",
      :company_workers
      [{:company_workers/company_id 2,
        :company_workers/person_id 3,
        :person_id [{:droid/id 3, :droid/name "Bot One"}]}
       {:company_workers/company_id 2,
        :company_workers/person_id 4,
        :person_id [{:droid/id 4, :droid/name "Bot Two"}]}]}]
    

Most of the time, the data in the company_workers table is not interesting because the nesting contains the same information.
We can select only some columns:

    (dbwalk.output.filtered/graph->filtered-map [:company/name :droid/name] result)
    =>
    [{:company/name "Universal Exports",
      :droid [{:droid/name "Drone One"} {:droid/name "Drone Two"}]}
     {:company/name "Universal Imports",
      :droid [{:droid/name "Bot One"} {:droid/name "Bot Two"}]}]

:droid/* would select all columns and :droid/id :droid/name would select columns id and name.
    

There is one more formatter, which is more like a database dump:

    (dbwalk.output.flat/graph->flat result)
    =>
    {:droid
     #{{:id 1, :name "Drone One"} {:id 2, :name "Drone Two"}
       {:id 3, :name "Bot One"} {:id 4, :name "Bot Two"}},
     :company_workers
     #{{:company_id 2, :person_id 3} {:company_id 1, :person_id 1}
       {:company_id 1, :person_id 2} {:company_id 2, :person_id 4}},
     :company #{{:id 1, :name "Universal Exports"}}}
    
It also has a version which outputs a table to rows-indexed-by-primary-key -mapping.
    
Note that these were all different projections of the same result.
    
Here's an example of filtering with HoneySQL:
    
    ; Make an output formatter to shorten example
    (def names-only (partial dbwalk.output.filtered/graph->filtered-map [:company/name :droid/name]))
    => #'dbwalk.api.simple/names-only
    
    
    ;Use HoneySQL to filter
    (def result (simple-query->graph db-conf
                                     (v/vec->query-tree [:company (sql/where [:= :id 1])
                                                         [:company_workers
                                                          [:droid]]])))
    => #'dbwalk.api.simple/result
    
    (names-only result)
    =>
    [{:company/name "Universal Exports",
      :droid [{:droid/name "Drone One"} {:droid/name "Drone Two"}]}]
    

Dbwalk works by selecting rows from a table, generating a SELECT ... FROM <next table> WHERE <foreign/primary key> IN (<values from previous table>).
This means that filtering does not work as we were using SQL JOINs:
    
    (def result (simple-query->graph db-conf
                                     (v/vec->query-tree [:droid
                                                         [:company_workers
                                                          [:company (sql/where [:= :id 1])]]])))
                                                         
    
    => #'dbwalk.api.simple/result
    
    (names-only result)
    =>
    [{:droid/name "Drone One",
      :company [{:company/name "Universal Exports"}]}
     {:droid/name "Drone Two",
      :company [{:company/name "Universal Exports"}]}
     {:droid/name "Bot One"}
     {:droid/name "Bot Two"}]
    
Notice how we got all droids even though we got only one company?
    
    
But is it really necessary to always write the join tables into the vector? Surely that information is in the database?
    
It is. You can specify the tree structure you want to use only once, manually or by BFS search:

    (def from-company (dbwalk.api.subgraph/build-path-from db-conf :company))
    => #'dbwalk.api.simple/from-company
    
    from-company
    => {:start :company, :link-map {:company #{:company_workers}, :company_workers #{:droid}}}
    
And then just ask for the columns you want:
    
    (dbwalk.input.filtered/query-for-columns from-company [:company/* :droid/name])
    {:dbwalk/query {:select (:*), :from (:company)},
     :dbwalk/eager
     [{:dbwalk/query {:select (:*), :from (:company_workers)},
       :dbwalk/eager [{:dbwalk/query {:select (:name), :from (:droid)}}]}]}
    
This does have some limitations. The path can contain each table only once and one of the tables given as parameter must be at the root of the generated query tree.

When these output formats are not what you want, write your own output plugin or try using 

* clojure.walk
* [Specter](https://github.com/nathanmarz/specter)
* [camel-snake-kebab](https://github.com/qerub/camel-snake-kebab)

## Limitations

Multicolumn foreign keys are not supported. 

## Overview

Dbwalk is based on the idea from [Objection.js](https://www.fi/en/blog/nested-eager-loading-and-inserts-with-objection-js/), namely that given a set of database rows, 
it is possible to extract primary/foreign key columns and fetch a set of related rows from another table 
using one WHERE *foreign key column* IN (*list of primary key columns values*) or vice versa. 

As this is a Clojure library, we do not use Objects and therefore dbwalk is not an ORM. 
The relations in dbwalk are the foreign key relations of the database tables themselves and there is no abstraction over the data contained in the database.

### Core idea, simplified

The "main" namespace is v.d.crawler. It contains functions that take a database description and a nested query,
read the database, and produce a [Loom](https://github.com/aysylu/loom) graph of the results.

The database description is a listing of all primary and foreign keys in the database. 
See v.d.schema-detect and v.d.config for postgreSQL autodetection and further formatting. 
This is a two-part process so that implementing schema autodetection for other databases is simpler.

The nested query is a tree in which every node corresponds to a single SELECT to that node's table.

The result (a Loom graph) contains a node for each row read from the database. 
The edges in the graph contain the database relations that were used to travel between tables and rows.

Most of the codebase consists of helper functions for generating the input data and formatting the output graph. 
Their use is encouraged but not required, as everything is just data.
The user is expected to roll their own API. See v.d.api.sweet for examples. 
Note that the schema detection should be run only once (after db migrations) in actual use.

The idea of a Datasource can be simply thought of as "the database" if only one database is used. 
See v.d.api.sweet/filtered->graph for how to write a single-database API function.

## Concepts in more detail

### QueryTree

Dbwalk's query format is a tree of SQL queries, each to one table.

For each level in the query tree, :dbwalk/query selects properties from that level's table and :dbwalk/eager contains the tables to branch the query to.

The query functions take HoneySQL queries. The FROM clause must have only one table, 
as it is used to determine the next table to walk to.
You can use WHERE, ORDER BY and pretty much everything provided by HoneySQL. 
Selecting only some columns is allowed, but dbwalk will automatically add all columns required for the relations in the query tree.

    {:dbwalk/query (-> (s/select :*) 
                (s/from :owners))              ;; Select from "owners"
    :dbwalk/eager [{:dbwalk/query (-> (s/select :*)          
                        (s/from :items))]}     ;; Then query "items" and match owners to items.

The above query is presented without the required datasource information, as it should be added using dbwalk.utils/with-datasource. 
See the examples in utils_test.

A simple helper for generating query trees can be found in v.d.api.vector. It transforms a nested vector into a query tree.

Inside a vector,

* the first keyword is the target table
* the first set of keywords contains the columns to select (default :*)
* the first map contains a base query to start from. SELECT and FROM clauses will be overwritten. Note that HoneySQL's helpers return a map.
* contained vectors are handled recursively and placed under the :dbwalk/eager key
    
As an example,

    [:my-table #{:id} 
     [:another-table #{:name}]
     [:third-table #{:foo} (sql/where [:= :name "bar"])]]

becomes

     {:dbwalk/query {:select (:id), :from (:my-table)},
                  :dbwalk/eager [{:dbwalk/query {:select (:name)
                                   :from   (:another-table)}}
                          {:dbwalk/query {:select (:foo)
                                   :from   (:third-table)}}]})))
                                   :where  [:= :name "bar"])

### QueryPath

The problem with the QueryTree format is that it complects the data you want to get with the path used to gather the data. 
When using SQL JOINs, the type of join affects the format and content of your result data. 
With the dbwalk method of getting data, when you go from table A to B you will always get 0..n rows of B in relation to one row from A.

If you consider the structure of a database as a non-directional graph, a QueryPath is one of its directed acyclic subgraphs. 
As duplicate nodes in a QueryPath are also forbidden, a QueryPath is in fact a tree. 

QueryPaths can be generated in the REPL using the helpers in v.d.api.subgraph. 
It also contains a function which will generate a QueryPath for the entire database structure using a breadth-first search starting from a given table.

The primary advantage of using QueryPaths is that in a "normal" database there are only a few directions of travel that produce meaningful data.
If, for example, your database queries join tables A -> B -> C -> D for one query and C -> D for another, you are using only one direction of travel.
Using a QueryPath, you can just state that the path is A -> B -> C -> D and request data from tables A and D in one query 
and data from C and D in another without repeating the path used to gather the data.

In practice, a QueryPath and a list of required data such as [:A/* :C/id :C/name] are enough to generate a query tree
from the smallest subtree of the QueryPath that still covers the requested tables. However, the helpers will not 
return a query tree unless one of the given tables is at the root of the minimal subtree. 

See also Output formats/Filtered for the companion output formatter. Implementations for both are in v.d.output.filtered.

### Datasource

A Datasource is anything that holds relational data. When building queries, each database is its own Datasource. 
Datasources are abstracted by using multimethods. See dbwalk.relations for the abstractions.

The datasource description for an SQL db can be anything that the functions in clojure.java.jdbc accept. 
The user is expected to handle transaction and connection pool management.

### Relation

A Relation in dbwalk is based on the idea of a foreign key in databases. 
It describes a unique identifier for some data item and the way it is contained in another item. The only Relations
currently implemented are

* source table's primary key in a column in the target table (OneToMany relation)
* a column in the source table, containing the target table's primary key (ManyToOne relation)

A join table as commonly used is seen as a ManyToOne relation followed by a OneToMany relation.

Currently an SQL endpoint has been implemented, so most SQL databases should be walkable. 
There is no restriction that both ends of a relation must be in the same database.

Other endpoints are fairly easy to implement, see 
test/dbwalk.data_source_abstraction_test.clj for a proof-of-concept EDN file endpoint. 
It allows a query to move from a table to an EDN file as if it were a table.

## Output formats

Look at the tests in writer_test to see inserting and querying examples for Tree output. 
Note that running the tests requires an empty PostgreSQL database. See test_setup.clj. 

### Map

All of the data from the database as a vector of top-level row items, each of which is a map. 

Each map in the result represents a row in the database and its related rows in other tables. 
The related rows are placed in a vector and assoc'ed to the row either replacing the foreign key or, 
when the foreign key is in the other table, under a new keyword key named after the database table the related rows came from.

The map will contain the requested columns plus all foreign and primary keys required for dbwalk to function.

### Flat

A one-level map of tablename -> vector of rows read from the table. This may be used when doing a web application
where the frontend's in-memory database is denormalized and you want the backend to gather related items from the db
and just add them to the frontend db.

### Filtered

This is an input generator/output formatter pair. Note that they can and should also be used separately.

See filtered_test for usage. The query generator (query-for-columns) takes a QueryPath and a list of namespaced keywords,
each of which describes a column. For example, :items/id is the column "id" in table "items". 
There is also a possibility to add HoneySQL queries (without SELECT or FROM) so that filtering and sorting is simple.

v.d.o.filtered/query-for-columns returns a QueryTree which walks a minimal set of tables to be able to SELECT all of the requested columns.
Note that one of the selected columns must be in a table that will be the root of the QueryTree.

The filtered output is similar to the Tree output except it contains only the requested columns 
and all related items are placed in a vector under a key named after the related items' source table. 

# Writing
  
Writing operations are under development and are not ready for production. The implementation is divided using the same principle as 
querying.
 
The writer component (v.d.a.writer) takes a Loom Digraph that is somewhat similar to the output graph
except the edges represent the insertion/deletion order. The writer component
always applies the requested operations to nodes without incoming edges, updates their successors with
the generated foreign keys and removes the nodes. This is done recursively until the graph has no nodes.

There is currently only one option for generating the graph, v.d.a.full-map. It takes a nested map in the
'map' output format with operations set as metadata for the maps. See the test in v.d.a.operations-test.

## Plans

Everything will be moved to clojure.spec when it becomes stable. 

## FAQ

### I get an exception when my dataset is too large!

This may be caused by the number of parameters in the generated JDBC query, specifically in the
WHERE IN part. Figure out what your database engine's maximum count is and assoc it to the config:

    (assoc-in dbwalk-config [:dbwalk/options :dbwalk/partition-size] 32000) ;; For PostgreSQL 

Note that this will perform multiple SELECTs to a single table at the same place you got an
exception. Unfortunately this means that using the database engine for sorting, aggregations etc.
will not work properly.

## License

Copyright © 2016 Vincit. All rights reserved.

Distributed under the [Eclipse Public License v 1.0](http://www.eclipse.org/org/documents/epl-v10.html), the same as Clojure.



##Changelog

### 0.2.1

Everything required for simple use is now in the dbwalk.api.simple ns.

### 0.1.6-SNAPSHOT

Major refactoring of the "simple" ns (previously "sweet").
Filtered ns was split into input/output.
RuntimeExpections are thrown where appropriate.

Added update and delete to insert ns, renamed dbwalk.action. Will hopefully be renamed again before first public release.

### 0.1.5

Last release that is compatible with older versions.

