# sedan [![Travis CI status](https://secure.travis-ci.org/QuiltProject/sedan.png)](http://travis-ci.org/#!/QuiltProject/sedan/builds)

![1972 Oldsmobile Delta 88](docs/1972delta88.jpg)

Sedan implements an encoding of scalar and simple aggretate data with
a static lexicographical ordering to match a defined runtime sort/collation
order, both intended to be language- and runtime-neutral.

Sedan is part of the [Quilt Project](http://quilt.org).

(Historical note re: the name: sortable
[EDN](https://github.com/edn-format/edn/) => sedan.)

## Why?

A lot of systems provide features that rely on ordered data (the most common
example perhaps being database indexes), but those systems' data models are
rarely very powerful, at least in their default/off-the-shelf form — often
limited to strings or sometimes even only bytes/blobs — and introducing any
runtime to such systems to support additional datatypes is almost always
impractical.

If we could encode rich data values into (byte)strings that retained those
values' runtime collation order, many of the benefits of our data model can be
layered on top of these systems' string/byte-only capabilities.

## Mise en garde

_Use may cause personal injury._

This encoding will change, at a minimum to en/decode to/from binary (see TODOs
below). When no further changes that will impact the encoding are pending, a
`1.0.0` release will be cut.

That said, it's been dog-fooded pretty heavily in conjunction with other parts
of Quilt, so the brave should hack away.

## "Installation"

Sedan is available in Clojars and Maven central. Add this `:dependency` to your
Leiningen `project.clj`:

```clojure
[org.quilt/sedan "0.0.4"]
```

Or, add this to your Maven project's `pom.xml`:

```xml
<dependency>
  <groupId>org.quilt</groupId>
  <artifactId>sedan</artifactId>
  <version>0.0.4</version>
</dependency>
```

Sedan requires Clojure 1.6.0+, or a recent version of ClojureScript (check the
`project.clj` file for the current dependency; assume it or a later rev of
ClojureScript is needed).

## "Specification"

Data serialized with Sedan maintains the natural runtime sort order among values
of the same type as implemented in Clojure[Script].<sup><small><a
href="#altlang">NOTE</a></small></sup> This implies a simple invariant for any
pair of values of the same type, `x` and `y`, leaving aside the detail of the
magnitude of the return value of `compare` (which is irrelevant to collation
order, `sort`, etc):

```clojure
; (require '[cemerick.sedan :as sedan])
(== (compare x y) (compare (sedan/encode x) (sedan/encode y)))
```

Sedan deviates from this intuition in only a few cases:

First, values of different types are generally not comparable (as per
`java.lang.Comparable` or `cljs.core.IComparable`), so Sedan
encoding stratifies values by prefixing all serializations with a _tag_, a
single-byte type identifier.  This ensures that values with very similar
serializations (e.g. the string `"foo"` and the symbol `'foo`) are kept very
separate when collated.

Second, of all of Clojure's collections, only vectors are comparable.  However,
their collation semantics (smaller collections sort before larger ones, and
same-sized collections sort based on the first pairwise difference between their
elements) are readily extended to other finite, ordered collections, including
seqs and lists.

Sedan's serialization does this.

Third, there are some types for which the "natural order" as defined by Clojure
or the JDK is not carried forward to the encoded representation.  The only such
values that are so affected currently are:

* UUIDs
* Certain numeric edge cases where enabling a distinct decodable encoding
  necessitates deviating from the ordering implementated by Clojure at runtime
  (e.g. `-0.0`, which is equivalent to other zero values as far as `compare` is
  concerned, but which has its own "partition" within numbers as encoded by
  Sedan). Doing otherwise would make it impossible to represent such values in
  Sedan-encoded data.

These are the rules of Sedan's encoding, which are implemented for use at
runtime by the `cemerick.sedan/compare` function.  So, revisiting the invariant
above, this expression exclusively describes Sedan's function:

```clojure
(== (sedan/compare x y) (compare (sedan/encode x) (sedan/encode y)))
```

### Non-objectives

* Compactness: efficiency is obviously preferable when possible, but reliable
  effective collation semantics are the ballgame here.  Sedan _is_ quite compact
  for what it does (e.g. the size of Sedan-encoded numbers grows sub-linearly
  with with the scale+precision of the values in question, though a larger
  "constant factor" is involved relative to other serialization strategies
  that). A reasonable generalization is that data encoded with Sedan is likely
  to be _very_ slightly larger than the same data encoded as EDN; the
  differential is effectively only due to Sedan's type tags, and will be totally
  dominated by the actual data, esp. string and byte[] values.

### Supported types/values

<a name="altlang"></a>The scalars and aggregates supported by Sedan are largely
a consequence of its Clojure/ClojureScript origins, but their encoding and the
semantics of Sedan's comparison function at runtime are intended to be language-
and runtime-neutral. It is hoped that Sedan encoding will eventually be
implemented in other languages for other runtimes.

Supported types (listed in the same order as Sedan's stratification):

* `nil`
* booleans
* all numbers, except `Ratio`s and `NaN`; includes:
 * shorts
 * ints
 * longs
 * floats
 * doubles
 * `BigInteger`
 * `BigDecimal`
* strings
* symbols
* keywords
* dates
* UUIDs
* vectors, lists, seqs (these are collapsed into a single partition, and
  materialized as vectors on read)
* all custom types (see next section)

The same set of types is supported in JavaScript / ClojureScript, with the
difference that big numerics are represented via opaque non-computable
containers, which represent the values as strings. You can bring those strings over
to your arbitrary-precision numerics library of choice in JavaScript. This is a
stopgap, pending developments in ClojureScript support for arbitrary-precision
numerics, and/or compelling demand for more concreate JS bignum support here.

### Extending Sedan

It is possible to extend your own types to Sedan's protocols so they may
participate in its encoding, decoding, and comparison operations, just as you
might add support for your own types' tagged literals to an EDN reader (with all
of the same tradeoffs and shortcomings).

The particulars of this are still under consideration from an API standpoint, so
are an active TODO item.  Feel free to check out the source and utilize as you
wish, but at your own risk.

## TODOs

### Items potentially affecting the encoding itself

* Encode/decode binary instead character data
  * IndexedDB supports binary _values_ already,
    [binary keys](https://www.w3.org/Bugs/Public/show_bug.cgi?id=23332) coming
    soon (already
    [in chrome](https://code.google.com/p/chromium/issues/detail?id=297304),
    though only when
    `chrome://flags/#enable-experimental-web-platform-features` is set)
* add support for additional types:
  * byte arrays
  * "references"
  * URI/URLs?
  * rationals, outside of the numeric partition as described below

#### (Space) Optimizations

(none planned a.t.m.)

### All other items (API / runtime related)

* `s/cemerick/quilt` namespaces
* Encoding/decoding runtime optimizations: yes, please.
* roll [Splice](http://github.com/QuiltProject/splice)'s custom Sedan types
  (references and ordered attributes) here? Tension between Sedan being a
  general-purpose library vs. one dedicated to supporting Quilt.
* disallow NaN in :number impls in cemerick.sedan.impl
* verify cross-runtime compat by generating N values, sorting them based on
  `sedan/compare`, writing them to disk, and then in the complementary runtime,
  read them in, and verify that the collation on disk is valid according to
  _its_ `sedan/compare` impl.  Do this in both directions (JVM <==> JS)
* provide a way to specify how to decode sequential things (list vs. vector
  vs. seq [maybe to support lazy decoding?)
* provide a way to indicate numeric decode preferences (e.g. force bigdec or
  doubles, the former ensuring that no precision inflation occurs, etc)
* eliminate (arbitrary) limit on "arbitrary precision" integers
* Dates _suck_, per usual:
  * Use Joda JVM-side
  * Find a JS Date type that can at least support the full range of longs as
    millis (even if we have to use goog.math.Integer)
* consume/produce JS big numerics using a single blessed arbitrary-precision
  library, maybe [bignumber.js](https://github.com/MikeMcl/bignumber.js)

## Design decision non sequiturs

Documentation of various design decisions made with rationales, hopefully:

* Despite the potential for snafus arising from necessary changes to the
  encoding (e.g. situations like [this](https://github.com/uwiger/sext/issues/5)
  that lead to unfortunate things like
  [that](https://github.com/basho/riak_kv/pull/504), rejected the notion of
  "versioning" each Sedan string/byte-array.  Given the desire to Sedan-encode
  small datums, even a one-byte overhead (for _each_ value!) seems bad.  The
  revision of Sedan encoding should be managed operationally, with up/downgrades
  happening as necessary at storage/transmission boundaries.
* Not going to support sorted set and map encodings at this time. No compelling
  use case yet, and the testing/usage complexity is potentially daunting.
* There is no benefit to having separate partitions for lists and vectors. All
  sequential values are encoded to the same partition. The concrete type could be
  determined upon decode per user preference, but currently defaults to vector.
* Rationals will need to be encoded as an "opaque" type in a separate partition
  from other numbers. Impossible to define appropriate collation order within
  other numeric types, and an approximation of one would be worse than having
  none at all.
* Support -0.0, Infinity, and -Infinity via dedicated _number_ (sub)partitions
* No support for encoding `NaN`. Since it's definitionally neither greater or
  less than any other number, collating it in any way would be highly confusing,
  and perhaps very bad.  Applications probably should encode a _representation_
  of `NaN`, since it's a fundamentally uncomputable artifact of IEEE 754.
  * The "uncomputable" assertion is somewhat undermined by the (ill-advised?)
    use of signalling `NaN`s in certain calcuations, but such things are
    probably best considered to be hacks, not valid instances of floating-point
    usage that should influence how Sedan should partition the space of possible
    encodable values.

## Credits

There is obviously a great deal of prior art around language- and
runtime-neutral data encoding schemes.

Sources for general inspiration for encoding values into strings (or bytes) that
lexicographically collate in a way consistent with the runtime sort order of
those values include:

* [sext](https://github.com/uwiger/sext), "A sortable serialization library
  [that] offers a serialization format (a la `term_to_binary()`) that preserves
  the Erlang term order."
* [bytewise](https://github.com/deanlandolt/bytewise), "Binary serialization of
  arbitrarily complex [JavaScript] structures that sort bytewise"
  
The strategy(ies) for encoding arbitrary-precision numbers were influenced /
informed by a number of prior efforts:

* ["Creating ASCII-Sortable Strings from Numbers"](http://www.xanthir.com/blog/b4K70),
  and the XKCD forum thread linked therein:
  (["Generating savegame names so they appear in sorted order"](http://forums.xkcd.com/viewtopic.php?f=12&t=81923)).
* The encoding implementation for arbitrary-precision integers
  (`java.math.BigInteger`s, specifically) in `org.apache.solr.util.BCDUtils`
  (originally written by Yonik Seeley AFAICT).  This implementation's strategy
  forms the basis for the treatment of numbers here, though extended to support
  decimals in addition to integers.
* Steven Rowe's speculation of a "thoroughly untested idea" to encode
  `java.math.BigDecimal`s for indexing in Lucene / Solr:
  http://grokbase.com/t/lucene/java-user/08bm2ysjxx/bigdecimal-values
* Doug Wood's draft IETF proposal for ["Directory string representation for
  floating point
  values"](http://tools.ietf.org/html/draft-wood-ldapext-float-00), and the
  implementation of it by D. Kavanagh in the [typica SimpleDB
  client](https:./code.google.com/p/typica)
  (`com.xerox.amazonws.simpledb.DataUtils`, which was later slurped into
  [Rummage](https://github.com/cemerick/rummage), a SimpleDB client largely
  written in Clojure).

## License

Copyright © 2012-* Chas Emerick.

Distributed under the terms of the
[Mozilla Public License, v2.0+](https://www.mozilla.org/MPL/2.0/).
