TerminusDB is an open source (GPLv3) full featured in-memory graph database management system with a rich query language: WOQL (the Web Object Query Language).
TerminusDB originated in Trinity College Dublin in Ireland in 2015 when we started working on the information architecture for ‘Seshat: the Global Historical Databank’, an ambitious project to store lots of information about every society in human history. We needed a solution that could enable collaboration among a highly distributed team on a shared database whose primary function was the curation of high-quality datasets with a very rich structure, storing information about everything from religious practices to geographic extent.
The historical databank was a very challenging data storage problem. While the scale of data was not particularly large, the complexity was very high. First there were a large number of different types of facts that needed to be stored. Information about populations, carrying capacity, religious rituals, varieties of livestock etc. In addition to this each fact had to be scoped with the period over which the fact was likely to be true. This meant we needed to have ranges with uncertainty bars on each of the endpoints. Then the reasoning for the data point had to be apparent on inspection. If the value for population was deduced from the carrying capacity, it is critical for analysis to understand that this is not a "ground fact". But even ground facts require provenance. We needed to store the information about which method of measurement gave rise to the fact, and who had undertaken it. And if that wasn't bad enough, we needed to be able to store disagreement - literally the same fact might have two different values as argued by two different sources, potentially using different methods.
On top of this we needed to be able to allow data entry by graduate students who may or may not be reliable in their transcription of information. So additional provenance information was required about who put the fact into the database.
Of course, all of this is possible in an RDBMS but it would be a difficult modeling task to say the least. The richness of data and the extensive taxonomic information makes using a knowledge graph look more appropriate. Given that Trinity College Dublin had some linked data specialists we opted to try a linked data approach to solving the problem.
Unfortunately, the linked-data and RDF tool-chains were severely lacking. We evaluated several tools in an attempt to architect a solution, including Virtuoso and the most mature technology StarDog, but found the tools were not really up to the task. While StarDog enforced the knowledge graph or ontology as a constraint on the data, it did not provide us with usable refutation witnesses. That is, when something was said to be wrong, insufficient information was given to attempt automated resolution strategies. In addition, the tools were set up to facilitate batch approaches to processing data, rather than a live transactional approach which is the standard for RDBMSs.
Our first prototype was simply a prolog program using the prolog database and temporary predicates which could be used within a transaction such that running updates could have constraints tested without disrupting the reader view of the database. (We thought that over time the use of prolog would be a hindrance to adoption as logic programming is not particularly popular – though there are now several green shoots!).
Not long after this we were asked to attempt the integration of a large commercial intelligence database into our system. The data was stored in a database in the 10s of gigabytes keeping information about all companies, boards of directors and various relationships between companies and important people over the entire course of the Polish economy since the 1990s. This, unsurprisingly, brought our prototype database to its knees. The problem gave us new design constraints. We needed a method of storing the graph that would allow us to find chains quickly (one of the problems they needed to solve), would be small enough that we didn't run out of main memory and which would also allow the transaction processing to which we had grown accustomed.
The natural solution was to reuse the idea of a temporary transaction layer, but to keep them around longer - and to make these layers very small and fast to query. We set about looking for alternatives. We spent some time trying to write a graph storage engine in postgres. Ultimately, we found this solution to be too large and too slow for multi-hop joins on the commercial intelligence use-case. Eventually we found HDT (Header-Dictionary-Triples). On evaluation this solution seemed to work very well to represent a given layer in our transactions as well as performing much better and so we built a prototype utilizing the library.
Unfortunately, the HDT library exhibited a number of problems. First it was not really designed to allow programmatic access which was required during transactions and so we found it quite awkward. Second, we had a lot of problems with re-entrance leading to segfaults. This was a serious problem on the Polish commercial use case which needed multi-threading to make search and update feasible. Managing all of the layering information in prolog was obviously less than ideal - this should clearly be built into the solution at the low level. We either had to fix HDT and extend it. Or build something else.
We opted to build something else and we opted to build it in Rust. Our specific use cases didn't require a lot of the code that was in HDT. We were not as concerned with using it as an interchange format (one of the design principles for HDT) and in addition to the layering system, we had plans for other aspects to change fairly drastically as well. For instance, we needed to be able to conduct range queries and we wanted to segment information by type. HDT was standardized for a very different purpose and it was going to be hard to fit it into that shoe. The choice of Rust was partly one borne of the pain of tracking down segfaults in HDT (written in C++). We were willing to pay some upfront cost in development time not to search, oft times fruitlessly, for segfaults.
At this point our transaction processing system had linear histories. The possibility of tree histories, i.e. of having something with branching was now obvious and relatively simple to implement. It went on to the pile of things which would be "nice to have" (a very large pile). It wasn't until we started using the product in a production environment for a Machine Learning pipeline that this "nice to have" became more and more obviously a "need to have".
As with any technical product of this nature there are many different paths we could have followed, after spinning out from university and attempting to pursue a strategy of implementing large scale enterprise graph systems with some limited success, we decided to shift to open source and become TerminusDB (https://github.com/terminusdb/terminus-server). Potentially a long road, but a happier path for us. We also decided to double down on the elements in TerminusDB that we had found useful in project implementation and felt were weakest in existing information architectures. These are the ability to have very rich schemas combined with very fine-grained data-aware revision control enabling the types of Continuous Integration / Continuous Deployment (CI/CD) used extensively in software engineering to be used with data. We think that DataOps, basically DevOps for data or the ability to pipeline and reduce the cycle time of data, is going to become more and more important for data intensive teams.
The name ‘TerminusDB’ comes from two sources. The Roman God of boundaries is named Terminus and good databases need boundaries. Terminus is also the home planet of the Foundation in the Issac Asimov series of novels. As our origin is in building the technical architecture for the Global History Databank, the parallels with Hari Seldon’s psychohistory are compelling.
To take advantage of this DataOps/’Git for data’ use case we implement a graph database with a strong schema so as to retain both simplicity and generality of design. Second, we implement this graph using succinct immutable data structures. Prudent use of memory reduces cache contention while write-once, read-many data structures simplify parallel access significantly. Third, we adopted a delta encoding approach to updates as is used in source control systems such as git. This provides transaction processing and updates using immutable database data structures, recovering standard database management features while also providing the whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel facilitating CI/CD approaches on data. Some of these features are implemented elsewhere, the immutable data structures of both Fluree and Datomic for example, but it is the combination that makes TerminusDB a radical departure from historical architectures.
In pursuing this path, we see that innovations in CI/CD approaches to software design have left RDBMSs firmly planted in the past. Flexible distributed revision control systems have revolutionized the software development process. The maintenance of histories, of records of modification, of distribution (push/pull) and the ability to roll back, branch or merge enables engineers to have confidence in making modifications collaboratively. TerminusDB builds these operations into the core of our design with every choice about transaction processing designed to ease the efficient implementation of these operations.
Current databases are shared knowledge resources, but they can also be areas of shared destruction. The ability to modify the database for tests, for upgrades etc is hindered by having only the single current view. Any change is forced on all. The lack of fear of change is perhaps the greatest innovation that revision control has given us in code. Now it is time to have it for data. Want to reorganize the structure of the database without breaking all of the applications which are using it? Branch first, make sure it works, then you can rebase master in confidence.
While we are satisfied with progress, databases are hard and typically take a number of years to harden. It is a difficult road as you need to have something which works while keeping your options for progress open. How do you eat an elephant? One bite at a time.
We hope that others will see the value in the project and contribute to the code base. We are committed to delivering all features as open source and will not have an enterprise version. Our monetization strategy is to roll out decentralized data sharing and a central hub (if TerminusDB is Git for data, TerminusHub will be the ‘GitHub for data’) in the near future. Any cash that we generate by charging for seats will subsidize feature development for the core open source database. We are not sure if this is the best strategy, but we are going to see how far it takes us.
> First there were a large number of different types of facts that needed to be stored. ... So additional provenance information was required about who put the fact into the database.
Wikibase (the underlying datastore of Wikidata.org, based on a MediaWiki extension) has had to include many of these features, if probably not all of them. It's organized as a document wiki with full history for each "subject" entry, and claims about "subject" entities can include references or other sorts of justification, including "deduced from". Multiple entries and disagreement are explicitly possible, but statements can also be editorially "deprecated", so that unambiguous errors are clearly identified as such and not reproduced as fact or relied upon in any way. Uncertainty, calendar systems etc. are supported for every "date+time" mention, including "start" and "end" indications for any statement.
It's worth mentioning though that querying is done on a separate system where only the "current" version of data is imported (ignoring all "deprecated" statements); it's not a true temporal DB in any real sense. But that's enough for their usecase, and the full history of entries is always available on the underlying "wiki" document store.
> the linked-data and RDF tool-chains were severely lacking.
I'm curious if you evaluated any of these solutions and if yes why you found them lacking:
* Jena
* RDF4j
* Blazegraph
Also, why not SPARQL? And if your query language is based on triples, why not Turtle for serialization instead of JSON-LD? JSON-LD is not very human friendly to write by hand, imho.
We evaluated Jena (we actually prototyped the first system in Jena) and RDF4j. At the time we didn't look into Blazegraph. In beginning the project we began to develop hypotheses of the kinds of features we would need. Namely the ability to check the instance data against a very complex and evolving schema. The need to do schema and instance updates in a single transaction. The need to recieve a program interpretable "witness of failure" (a refutation proof) so that automatic strategies could be undertaken to correct failures in certain cases.
As we went forward we also acquired the collaboration and time-travel requirements to facilitate large team data curation - which aren't present in these offerings.
As for SPARQL, it's essentialy a messy and truncated datalog. It just seemed a bit silly not to use an actual datalog! Then as I've said elsewhere, we needed to be able to do fast time-window queries which CLP(fd) is excellent at and which SPARQL doesn't support.
JSON-LD is most definitely not human friendly, and turtle is certainly better. In our interface you can manipulate the schema in turtle as it's far clearer.
The advantage of JSON-LD is as an interchange format. With turtle you are pasting together strings. With JSON-LD libraries for querying the database are just functions which build JSON-LD. It also simplifies parsing on the opposite end.
But in addition JSON-LD + strong schema give you a way to treat segments of the graph as documents! We can use our get_document word and you'll get back the JSON-LD fragment of the graph associated with an ID. JSON-LD was carefully designed so that this bidirectional interpretation is possible.
Yes, we follow Dgraph pretty closely. It is a cool project, but it is distributed first with all the challenges (and advantages) that brings. My general impression is that they are google (or similar) engineers that are focused on solving google type problems, whereas 95 to 99% of databases or data projects don't need sharding etc. and we are in the area. In fact, we think that as main memory becomes larger and cheaper the advantages of in-memory, on even big DBs, is going to become clearer.
Thanks! Building a database is a long road as the baseline for zero is so high, but we're really happy with the open source approach. Decentralized collaboration suits the team too (I actually saw an article called 'I Want Decentralized Version Control for Structured Data!' on here the other day and thought 'hey, that could be us!')
Thanks - you are definitely a terminator. You should join the community Discord to chat with the team about logic programming, graphs and anything else!
I'm a Prolog noob, but remember seeing some of y'all's Medium blog posts on the Prolog subreddit which led me to cliodynamics which seems very interesting, but very hard to do right. It seems y'all have experience with that and Peter Turchin. Any thoughts. I've meant to give his book a go.
Yes - we have worked with Peter on building the Seshat Databank of world history. His work on modelling long term dynamic processes is great - definitely worth reading any of his stuff.
aha - don't mention the war - I'm in the process of completing a release of the Seshat dataset with a TerminusDB tutorial on how to build it into a graph and query it. But I keep getting distracted when I'm just about finished. Thanks for the interest thought - it will encourage me to get it finished!
This looks really great. It's interesting to see that 99% of the source code is Prolog. I'm curious to know what advantage--real or perceived--did Prolog provide over some other, more mainstream language?
Also, @ggleason, based upon your experience with other languages, would you still use Prolog if you had to do the whole thing over again?
Prolog implementations are very efficient at implementing back-tracking, so if you end up using a lot of back-tracking it definitely makes sense. My first prototype was started in java and it was a nightmare. Secondly, for writing the query compiler, prolog was just such an elegant language.
SWIPL has a large enough and nice enough library that it makes it feel similar to other dynamic languages (python, etc.) in terms of implementing run-of-the-mill glue code.
I'm very fond of prolog as the implementation language for the constraint checking, and especially CLP(fd). I think CLP(fd) is such a killer feature, that once people start using it in their queries, they're going to wonder how they got-on before.
I would like prolog to be a lot more feature-mature for the current age however. It needs a bigger community to help flesh the language out! So many things could be made better - better mode analysis, better type checking and simply more libraries.
Thank you so much for your response. You may have convinced me to have another look at Prolog. I stumbled upon it 15 years ago, but never used it for any real project. I just remember really loving its declarative style.
Interesting project. Out of curiosity, why did you compare against kdb+ since the models are very different? KDB is mostly used for in-memory time-series while your seem to be a graph-oriented DB. Also, why did you choose to build your own language instead of using an existing one [1]?
The reason is that we were positioning to deal with customers who had financial data which was stored as time-series. We aren't hoping to compete with kdb+ on speed (which would be hopeless) but we have a prototype of Constraint Logic Programming [CLP(fd)] based approaches to doing time queries which is very expressive and which we hope to roll out in the main product in the near future on hub.
The graph database is still in its infancy and there are a lot of graph query languages about. We played around with using some that already exist (especially SPARQL) but decided that we wanted a number of features that were very non-standard (such as CLP(fd)).
Using JSON-LD as the definition, storage and interchange format for the query language has advantages. Since we can marshall JSON-LD into and out of the graph, it is easy to store queries in the graph. It is also very simple to write query libraries for a range of languages by just building up a JSON-LD object and sending it off.
We are firmly of the belief that datalog-style query languages which favour composibility will eventually win the query language wars - even if it is not our particular variety which does so. Composibility was not treated as centrally as it should have been in most of the graph languages.
I have two questions regarding the time-series aspect of TerminusDB:
1) Does TerminusdB support raw time-series data (one dimensional) for example electrocadiogram (ECG)? At the moment we have them including the metadata in column based CSV format. FYI, the size is around 1 MB for one minute raw ECG data duration.
2) For automated ECG analysis the data is transformed using time-frequency distribution (two dimensional), and the intermediate data must be kept in-memory for feature extraction purpose. Just wondering does TerminusDB can support this intermediate format/structure of time-frequency as well? FYI, the one minute time-frequency ECG data transformed from (1) will need around 4GB of working memory. For real-time analysis of longer ECG data duration from (1), for example 30 minutes duration or 30 MB data size, we need around 3 TB of working memory.
We will 100% consider it and have been engaging with the community about the best approach. Cypher is by far the biggest graph query language and they seem to have the most weight in the conversation so far, but we are going to try to represent datalog as far as possible. Even if woql isn't the end result we think datalog it is the best basis for graph query so we'll keep banging the drum (especially as most people realize that composability is so important)
I have a set of JSON-LD documents which I would like to query across and for each result return a nested JSON object for display in a user interface. For example, let's imagine I was querying an employee database to find employees with the job title "Software Developer" and for each matching Employee return the following nested structure:
I can see how I could write the filter based on the pattern matching syntax but I don't see how I could gather the data to produce the result (which in reality might be much more deeply nested.)
2. Have you benchmarked against RedisGraph?
They seem to have achieved very good performance from building atop of GraphBLAS.
In terms of framing - yes we do! Exactly like that.
TerminusDB has a document api endpoint and a special Document class - anything that inherits from this class is considered to be a complex document with internal structure - if you send a document id ("employee:bob") to the endpoint it will give you the full document for that id - and it will clip the document when you get to another document which gives you a way of supporting both document and graph views of the same data.
2. We've been keeping an eye on RedisGraph and they definitely do well when it comes to graph matrix operations - which we don't do yet, but in other areas we should be very competitive with their performance and we have more stuff going on - especially schema checking of transactions
1. We do support framing. Our documents are defined using a special superclass called "terminus:Document". Anything in the downward closed DAG up to another "terminus:Document" is considered part of that document. You can ask for some number of unfoldings of the internal documents by passing a natural number - in which case you will frame the underlying documents. We might extend this framing to allow more sophisticated approaches later if there is interest.
2. We have not performed benchmarks against RedisGraph. We intend to do benchmarks in the future but are currently focusing on collaboration features rather than raw speed.
I had just come up with some methods of using GPUs to speed up graph search when I saw the RedisGraph whitepaper and that they had already done it. I have to admit I was more than a little jealous! It's a good idea.
We'll look at the approach again in the future - our next steps are exposing CLP(fd) in our query language.
So if I understand correctly, unfolding one level would embed the referenced documents in the result one level down, two would also embed the documents referenced in the root document's referenced documents, etc.
I think it's definitely worth supporting differing depths of reference embedding along different paths. Ideally though you want want to select which properties are included as when you get to several levels of embedding the resulting document is very large and often you only need a subset of properties expanded for the embedded documents.
Additionally it can be very helpful to embed along reverse references (parent.children from the reference stored as child.parent.)
While pre-generating the deeply embedded document means fetching it is fast, as the site has grown keeping them up to date becomes challenging so I've been looking at options for embedding dynamically.
"So if I understand correctly, unfolding one level would embed the referenced documents in the result one level down, two would also embed the documents referenced in the root document's referenced documents, etc."
That's correct.
"I think it's definitely worth supporting differing depths of reference embedding along different paths."
Yeah, path unfoldings were something we were thinking about but ended up on our very tall stack of things we want to do. You can generally get around it just by calling the get_document API on the client end from javascript, although it is true this will be much slower.
Yes, this is an unfortunately buzzwordy phrase. In this case it does have some meaning though - we generate what we call class-frames from the AI - simple logical javascript programs which know how to render documents and talk to the API, but definitely AI Code generation is not a good phrase.
Which two? On ACID, I hear the arguments on Mongo Atlas since they integrated the wiredtiger tech (to basically become a RDBMS), but still not ACID. The other is cloud native I suppose? Again, have moved along and we gave the a balanced score.
Yeah, just lead with the plain facts. Your product is awesome, you've got nothing to prove. But then again, I'm biased to the technical side of things, and I'm imagining that your early adopters will be curious technical people (as contrasted with folks who might be moved by marketing copy alone.)
I’m using my own (hyper)graph API I developed over years so I have some (albeit limited) experience in the matter. My opinion after reading the home page is that most of the points are straight up bullshit (cloud native and AI code generation).
The second very worrying fact is the query language. Not only it’s not using something existing (cypher, gremlin) or the new upcoming Graph Query Language that will unify them, but it uses some ad-hoc JSON-LD. I couldn’t think of a worst idea. Verbosity (reinforced by JSON-LD compared to JSON), no comment possible, usage of the RDF data model, there are many disadvantages for no benefit (except for RDF that can be useful for interoperability in some cases).
Finally the comparison is listing Oracle Database but not MS SQL, which is rather strange given the later has some support for graph and geo queries.
JSON-LD allows us to marshall queries into and out of the database as objects in their own right.
The verbosity isn't a problem as we don't actually write queries in JSON-LD. We use a fluent style in a programming language: currently, either Python or Javascript.
OWL as a schema language is quite rich and well developed allowing multiple-hierarchies and complex constraints. Having a well defined rich schema language for graphs is extremely important, and something that isn't widely available.
I checked the origin of the project and it seems to be an European grant to a few universities working on Semantic Web. Given the track record of this technology I’ll pass my turn. Neither the home page nor the doc provide any compelling arguments or killer feature for it’s use over existing solutions.
I’m not convinced either that a strong schema is really important for a graph database. In fact, even in OOP long chain of inheritance have proven bad practice. I also found myself that working with flat-type (no inheritance) nodes and edges simplify the code a lot.
You checked the origin as in you read my comment on this thread? Pretty clear when I said it started in Trinity College Dublin working with linked data. Semantic web certainly suffers from true believers that refuse any break with orthodoxy, but there is a core of wisdom & imho if you can do something practical the best ideas can emerge.
Without schemata, you end up with spaghetti pretty quickly and no way to ensure data-quality. It's also impossible to treat segments of the graph as documents. With a strong schema you can move seemlessly between objects and graph view.
It is currently deployed in several industrial settings.
We generate data-input forms automatically from the structure of schema definitions in the database - the same definitions which make it possible to marshall data from JSON-LD documents into a graph and back. It's AI in the symbolic sense.
> We generate data-input forms automatically from the structure of schema definitions in the database
That's a lot better and clearer than "AI code generation" :)
> It's AI in the symbolic sense.
Unfortunately that AI still seems to be caught in winter. While we wait for spring to thaw a new round of hype, maybe something simpler, like "smart datagentry forms" or "data-mapper"?
Really happy about this nicely followed up show hn - logic/prolog and graph DBs seems like such an obvious complement to rdms - and as demonstrated by json/jsonb in postgres - simple document data can be rather nicely shoe-horned into heterogeneous relational systems. But proper graph DBs... Not so much? Nice to see more projects in this space.
It is trying to get something into a snappy form that is clear enough to as many people as possible. I like 'data - mapper' but it still needs explanation, so probably better to leave that stuff to the tutorials and try to highlight something else in a comparison table!
Not a stupid question! The database is in memory but we journal all transactions to disk, so it is persistent. In fact it's so persistent that it never goes away. We have an append-only storage approach allowing you to do time-travel. You can query past versions, or look at differences, or even branch from a previous version of the database.
Yes, although with databases as against log files, you also need to be able to somehow append 'deletes' and updates to existing records as well as just adding new records. The advantages of doing so is that it greatly simplifies transactional processing and makes it much more parellelisable - because you never change any existing data, just add new records on top. Blockchains have similar 'immutable' characteristics. The other reason why append only storage is desirable is that it allows you to time travel simply by backtracking through the append-logs and you can then do stuff like reconstructing future states by replaying all of the append events that got you there.
There are a variety of databases and database management systems that try to do this - most of them run into problems with the meta-language needed to describe updates and deletes to, for example, SQL Tables or something similar. This is a hideously tricky and detailed problem because there are all sorts of ways in which an SQL table can be changed many of which have implications for all sorts of other bits of data and you have to capture all of this in your 'update' append log.
On the other hand, if like TerminusDB, you use RDF triples as an underlying language, then the problem is pretty trivial - every update can always be expressed as a set of deleted triples and a set of added triples.
But, wouldn't an append only database demand substantially more storage which would also increase at a faster rate? How is that handled given we won't be able to fit the database on a single disk?
Also, I wanted to understand how databases work and wanted to build one from scratch. Is there any website/tutorial/blog you think could help me out? Thanks a lot!
Looks cool, and being written in swi-prolog is a nice bonus.
This is not a criticism, but I am curious why a new query language WOQL was designed and implemented instead of just using SPARQL. It seems like it would be not too difficult to write a converter between WOQL and SPARQL.
Also, swi-prolog has excellent semantic web libraries with SPARQL support.
We debated using SPARQL initially, and even had a test implementation leveraging the version shipped with SWI-prolog. However we found SPARQL to have a number of short comings that we wanted to see addressed.
Firstly, there were features. Tighter integration with OWL and schema-awareness was something we wanted to build in at a low level. We also wanted to leverage CLP(fd) for time and other range queries. If we were going to need to fundamentally alter the semantics anyhow, it didn't seem that important to start with SPARQL. Other features that are coming down the pipes very soon are access to recursive queries and manipulation of paths through the graph.
Secondly, we wanted better composability. One of the real strengths of SQL as a language is that it is very easy to compose language elements. By contrast SPARQL feels quite ad-hoc, mimicing some of the style of SQL but losing this feature.
Lastly, we wanted to have a language which would be easy to compose from Javascript, or python. Javascript, because we live in a web age, and python, because it's the choice for many datascientists. JSON-LD provides a nice intermediate language in which to write queries in either of these languages. No need to paste together strings! And then, because of the choice of JSON-LD, we can naturally store our queries in the graph (and even query our queries).
Good answer, thanks for that! I am working on a commercial product to help people form SPARQL queries, and I must admit pro-SPARQL bias. Your decision makes a lot of sense.
TerminusDB originated in Trinity College Dublin in Ireland in 2015 when we started working on the information architecture for ‘Seshat: the Global Historical Databank’, an ambitious project to store lots of information about every society in human history. We needed a solution that could enable collaboration among a highly distributed team on a shared database whose primary function was the curation of high-quality datasets with a very rich structure, storing information about everything from religious practices to geographic extent.
The historical databank was a very challenging data storage problem. While the scale of data was not particularly large, the complexity was very high. First there were a large number of different types of facts that needed to be stored. Information about populations, carrying capacity, religious rituals, varieties of livestock etc. In addition to this each fact had to be scoped with the period over which the fact was likely to be true. This meant we needed to have ranges with uncertainty bars on each of the endpoints. Then the reasoning for the data point had to be apparent on inspection. If the value for population was deduced from the carrying capacity, it is critical for analysis to understand that this is not a "ground fact". But even ground facts require provenance. We needed to store the information about which method of measurement gave rise to the fact, and who had undertaken it. And if that wasn't bad enough, we needed to be able to store disagreement - literally the same fact might have two different values as argued by two different sources, potentially using different methods.
On top of this we needed to be able to allow data entry by graduate students who may or may not be reliable in their transcription of information. So additional provenance information was required about who put the fact into the database.
Of course, all of this is possible in an RDBMS but it would be a difficult modeling task to say the least. The richness of data and the extensive taxonomic information makes using a knowledge graph look more appropriate. Given that Trinity College Dublin had some linked data specialists we opted to try a linked data approach to solving the problem.
Unfortunately, the linked-data and RDF tool-chains were severely lacking. We evaluated several tools in an attempt to architect a solution, including Virtuoso and the most mature technology StarDog, but found the tools were not really up to the task. While StarDog enforced the knowledge graph or ontology as a constraint on the data, it did not provide us with usable refutation witnesses. That is, when something was said to be wrong, insufficient information was given to attempt automated resolution strategies. In addition, the tools were set up to facilitate batch approaches to processing data, rather than a live transactional approach which is the standard for RDBMSs. Our first prototype was simply a prolog program using the prolog database and temporary predicates which could be used within a transaction such that running updates could have constraints tested without disrupting the reader view of the database. (We thought that over time the use of prolog would be a hindrance to adoption as logic programming is not particularly popular – though there are now several green shoots!).
Not long after this we were asked to attempt the integration of a large commercial intelligence database into our system. The data was stored in a database in the 10s of gigabytes keeping information about all companies, boards of directors and various relationships between companies and important people over the entire course of the Polish economy since the 1990s. This, unsurprisingly, brought our prototype database to its knees. The problem gave us new design constraints. We needed a method of storing the graph that would allow us to find chains quickly (one of the problems they needed to solve), would be small enough that we didn't run out of main memory and which would also allow the transaction processing to which we had grown accustomed. The natural solution was to reuse the idea of a temporary transaction layer, but to keep them around longer - and to make these layers very small and fast to query. We set about looking for alternatives. We spent some time trying to write a graph storage engine in postgres. Ultimately, we found this solution to be too large and too slow for multi-hop joins on the commercial intelligence use-case. Eventually we found HDT (Header-Dictionary-Triples). On evaluation this solution seemed to work very well to represent a given layer in our transactions as well as performing much better and so we built a prototype utilizing the library.
Unfortunately, the HDT library exhibited a number of problems. First it was not really designed to allow programmatic access which was required during transactions and so we found it quite awkward. Second, we had a lot of problems with re-entrance leading to segfaults. This was a serious problem on the Polish commercial use case which needed multi-threading to make search and update feasible. Managing all of the layering information in prolog was obviously less than ideal - this should clearly be built into the solution at the low level. We either had to fix HDT and extend it. Or build something else.
We opted to build something else and we opted to build it in Rust. Our specific use cases didn't require a lot of the code that was in HDT. We were not as concerned with using it as an interchange format (one of the design principles for HDT) and in addition to the layering system, we had plans for other aspects to change fairly drastically as well. For instance, we needed to be able to conduct range queries and we wanted to segment information by type. HDT was standardized for a very different purpose and it was going to be hard to fit it into that shoe. The choice of Rust was partly one borne of the pain of tracking down segfaults in HDT (written in C++). We were willing to pay some upfront cost in development time not to search, oft times fruitlessly, for segfaults.
At this point our transaction processing system had linear histories. The possibility of tree histories, i.e. of having something with branching was now obvious and relatively simple to implement. It went on to the pile of things which would be "nice to have" (a very large pile). It wasn't until we started using the product in a production environment for a Machine Learning pipeline that this "nice to have" became more and more obviously a "need to have".
As with any technical product of this nature there are many different paths we could have followed, after spinning out from university and attempting to pursue a strategy of implementing large scale enterprise graph systems with some limited success, we decided to shift to open source and become TerminusDB (https://github.com/terminusdb/terminus-server). Potentially a long road, but a happier path for us. We also decided to double down on the elements in TerminusDB that we had found useful in project implementation and felt were weakest in existing information architectures. These are the ability to have very rich schemas combined with very fine-grained data-aware revision control enabling the types of Continuous Integration / Continuous Deployment (CI/CD) used extensively in software engineering to be used with data. We think that DataOps, basically DevOps for data or the ability to pipeline and reduce the cycle time of data, is going to become more and more important for data intensive teams.
The name ‘TerminusDB’ comes from two sources. The Roman God of boundaries is named Terminus and good databases need boundaries. Terminus is also the home planet of the Foundation in the Issac Asimov series of novels. As our origin is in building the technical architecture for the Global History Databank, the parallels with Hari Seldon’s psychohistory are compelling.
To take advantage of this DataOps/’Git for data’ use case we implement a graph database with a strong schema so as to retain both simplicity and generality of design. Second, we implement this graph using succinct immutable data structures. Prudent use of memory reduces cache contention while write-once, read-many data structures simplify parallel access significantly. Third, we adopted a delta encoding approach to updates as is used in source control systems such as git. This provides transaction processing and updates using immutable database data structures, recovering standard database management features while also providing the whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel facilitating CI/CD approaches on data. Some of these features are implemented elsewhere, the immutable data structures of both Fluree and Datomic for example, but it is the combination that makes TerminusDB a radical departure from historical architectures.
In pursuing this path, we see that innovations in CI/CD approaches to software design have left RDBMSs firmly planted in the past. Flexible distributed revision control systems have revolutionized the software development process. The maintenance of histories, of records of modification, of distribution (push/pull) and the ability to roll back, branch or merge enables engineers to have confidence in making modifications collaboratively. TerminusDB builds these operations into the core of our design with every choice about transaction processing designed to ease the efficient implementation of these operations.
Current databases are shared knowledge resources, but they can also be areas of shared destruction. The ability to modify the database for tests, for upgrades etc is hindered by having only the single current view. Any change is forced on all. The lack of fear of change is perhaps the greatest innovation that revision control has given us in code. Now it is time to have it for data. Want to reorganize the structure of the database without breaking all of the applications which are using it? Branch first, make sure it works, then you can rebase master in confidence. While we are satisfied with progress, databases are hard and typically take a number of years to harden. It is a difficult road as you need to have something which works while keeping your options for progress open. How do you eat an elephant? One bite at a time.
We hope that others will see the value in the project and contribute to the code base. We are committed to delivering all features as open source and will not have an enterprise version. Our monetization strategy is to roll out decentralized data sharing and a central hub (if TerminusDB is Git for data, TerminusHub will be the ‘GitHub for data’) in the near future. Any cash that we generate by charging for seats will subsidize feature development for the core open source database. We are not sure if this is the best strategy, but we are going to see how far it takes us.