Last week we released version 0.2.0 of datahike which brings new features and marks a step away from its datascript origin. The major new features are schema flexibility and time travel. After integrating the latest datascript code we extended it through protocols on storage and index level.

Datascript Integration and Cleanup

While datahike began as a fork of datascript and integration of the hitchhiker tree, we are now adding novel features and moving away from the original project. For this release we integrated the latest datascript code and refactored it in preparation for the next features.

Additionally protocols were created for both index and storage layer. Alejandro Gómez provided protocols for the storage backends that support different solutions like in-memory, file-based, LevelDB, or PostgreSQL. New stores can be added by implementing four functions on top of konserve bindings. A good example are the PostgreSQL calls.

The underlying index is now also configurable supporting both the persistent sorted set used in datascript and the hitchhiker tree. Through protocols other interesting index data structures can be added easily.

Since our partners from the datopia project needed the datalog parsing functionality we moved all functions and tests around datalog query parsing into a separate project that will also support functions not used in datahike, like creating datalog queries from its syntax tree, so you can parse a datalog query, add addtionaly data, and create an optimized query for a database to use.

Schema Flexibility

Up until now datahike did not enforce any schema on its data, so any badly shaped data could be added. Starting with datahike 0.2.0 the schema-on-write approach is supported where the schema has to be defined explicitly and the transactor ensures all transacted data conforms to it. Through configuration the previous schema-on-read is still supported.

datahike supports large parts of Datomic’s schema definition except tuples, full text search and byte types. The data validation happens on transaction level using clojure.spec.

In contrast to datascript’s schema in the database record, datahike stores the schema additionally as transactions in the index, which allows queries and time travel that can help auditing schema changes.

Time Travel

Modern database solutions require auditable data history without integrating time stamp attributes to all data models. Similar to Datomic’s historic data datahike supports time travel capabilities that allow datalog queries against the whole history, against any point in the past, or against the difference since a certain point in time.

In accordance to the GDPR regulations data purging capabilities were added that completely remove data either from all indices or only from the historical indices.

In order to keep the current data view fast and clean, a separate set of indices were created that track all past data. This separation makes it very easy to clean up data after certain retention periods without interfering with the current view of data.

Configuration

With the addition of the latest features, datahike required better configuration. Therefore a clean concept was implemented to allow picking only those functions from datahike that you need the most at database creation. For example, both time travel and schema-on-write capabilities can be deactivated if you have no use for it.

Examples

Since it is always helpful to have examples, we provide projects inside the code base that show different features as well as documentation about new features like time travel, schema, and configuration.

These projects just need a REPL and can be worked through at your own pace. Where it was possible we added explanations and different ways to use a certain feature. Have a look at the basic example which introduces topics like stores, schema or time travel. Over the next weeks we will add more examples about queries and transactions.

Future Plans

For the next release we are planning to make datalog available to a wider audience by releasing API bindings for Java/Scala of datahike.

In collaboration with our partners from the 3DF project we aim to support remote capabilities with a HTTP layer, having only a thin client on the application side instead of the full datahike stack.

With support of a datalog query planner we will to be able to predict the costs of each query and recommend possible solutions to improve your overall queries.

For security reasons it would be useful to have an identity and access management concept ready to be used. Therefore user and role definitions will be added to the system entities.

Since we have experience with probabilistic programming in anglican we will try to integrate probabilistic reasoning in datalog and datahike itself.

From a development side we will try to release improvements on a smaller scale with more minor and patch releases and a more transparent development process around GitHub issues.

Thanks

Many thanks to all the people from the Job Tech project supporting us towards the implementation of the schema and time travel functionalities. Also thanks to Chrislain Razafimahefa for his code reviews, Alejandro Gómez for his storage contributions, and all you guys either on slack, GitHub or personally for fruitful discussions and feedback.

Have fun with the latest version of datahike and let us know what you think about it.