Datomic's is perfect for probably 90% of small-ish backoffice systems that never has to be web scale (i.e. most of what I do at work).
Writing in a single thread removes a whole host of problems in understanding (and implementing) how data changes over time. (And a busy MVCC sql db spends 75% of its time doing coordination, not actual writes, so a single thread applying a queue of transactions in sequence can be faster than your gut feeling might tell you.)
Transactions as first-class entities of the system means you can easily add meta-data for every change in the system that explains who and why the change happened, so you'll never again have to wonder "hmm, why does that column have that value, and how did it happen". Once you get used to this, doing UPDATE in SQL feels pretty weird, as the default mode of operation of your _business data_ is to delete data, with no trace of who and why!
Having the value of the entire database at a point in time available to your business logic as a (lazy) immutable value you can run queries on opens up completely new ways of writing code, and lets your database follow "functional core, imperative shell". Someone needs to have the working set of your database in memory, why shouldn't it be your app server and business logic?
Looking forward to see what this does for the adoption of Datomic!
> Someone needs to have the working set of your database in memory, why shouldn't it be your app server and business logic?
This one confused me. The obvious reason why you don't want the whole working set of your database in the app server's memory is because you have lots of app servers, whereas you only have one database[1]. This suggests that you put the working set of the database in the database, so that you still only need the one copy, not in the app servers where you'd need N copies of it.
The rest of your post makes sense to me but the thing about keeping the database's working set in your app server's memory does not. That's something we specifically work to avoid.
[1] Still talking about "non-webscale" office usage here, that's the world I live in as well. One big central database server, lots of apps and app servers strewn about.
Consider this use case - in addition to your web app, you have a reporting service that makes heavy duty reports; if you run one at a bad time, bad things might happen like users not being able to log in or do any other important work, because the database is busy with the reports.
So in a traditional DB you might have a DBA set up a reporting database so the operational one is not affected. Using Datomic the reporting service gets a datomic peer that has a copy of the DB in database without any extra DBA work and without affecting any web services. This also works nicely with batch jobs or in any situation where you don't want to have different services affect each others performance.
Its true that a lot more memory gets used, but it is relatively cheap - usually the biggest cost when hosting in the cloud being the vCPUs. But usually in Clojure/Datomic web application you don't need to put various cache services like Redis in front of your DB.
Thea assumption here is that the usual bottleneck for most information systems and business applications is reading and querying data.
I appreciated this insight into other people's use cases, thank you for that! This architecture brings RethinkDB to mind, which also had some ability to run your client as a cluster node that you alone get to query. (Although there it was more about receiving the live feed than about caching a local working set.)
> Client (notice not Proxy) caches uncommitted writes to support read-uncommitted-writes in the same transaction. This type of read-repair is only feasible for a simple k/v data model. Anything slightly more complicated, e.g. a graph data model, would introduce a significant amount of complexity. Caching is done on the client, so read queries can bypass the entire transaction system. Reads can be served either locally from client cache or from storage nodes.
RethinkDB user here. I've been running it in production for the last 8 years or so. It works. It doesn't lose data. It doesn't corrupt data (like most distributed databases do, read the Jepsen reports).
I am worried about it being unmaintained. I do have some issues that are more smells than anything else — like things becoming slower after an uptime of around three weeks (I now reboot my systems every 14 days). I could also do with improved performance.
I'm disappointed that the Winds of Internet Fashion haven't been kind to RethinkDB. It was always a much better proposition that, say, MongoDB, but got less publicity and sort of became marginalized. I guess correct and working are not that important.
I'm slowly rebuilding my application to use FoundationDB. This lets me implement changefeeds, is a correct distributed database with fantastic guarantees (you get strict serializability in a fully distributed db!), and lets me avoid the unneeded serialization to/from JSON as a transport format.
We've never had any issue with it on a typical three-node install in Kubernetes. It requires essentially no ongoing management. That said, it can't be ignored that the company went under and now it's in community maintenance mode. If you don't have a specific good use for Rethink's changefeed functionality, which sets it apart from the alternatives, I'm not sure I could recommend it for new development. We've chosen not to rip it out, but we're not developing new things on it.
I remember back when it came out it was a big deal that it could easily scale master-master nodes and the object format was a big thing because of Mongo back then.
That was before k8's wasn't a thing back then, and most of the failover for other databases wasn't a thing just yet. I'm too scared to use it because they have a community but they're obviously nowhere as active as the Postgres and other communities.
I suppose I think of this the other way around. When the query engine is inside your app, the query engine doesn't need to do loop-like things. So you can have a much simpler querying language and mix it with plain code, kind of similar to how you don't need a templating language in React (and in Clojure when you use hiccup to represent HTML)
Additionally, this laziness means your business logic can dynamically choose which data to pull in based on results of queries, and you end up with running fewer queries as you'll never need to pull in extra data in one huge query in case business logic needs it.
It's definitely a trade-off! If you have 10s or 100s of app servers that has the exact same working set in memory, it's probably not worth it.
But if you have a handful of app servers, it's much more reasonable. The relatively low scale back-office systems I tend to work with typically has 2, max 3. Also, spinning up an extra instance that does some data crunching does not affect the performance of the app servers, as they don't have to coordinate.
There's also the performance and practicality benefits you get from not having to do round-trips in order to query. You can now actually do 100 queries in a loop, instead of having to formulate everything as a single query.
And if you have many different apps that operates on the same DB, it becomes a benefit as well. The app server will only have the _actual_ working set it queries on in memory, not the sum of the working set across all of the app servers.
If this becomes a problem, you can always architecture your way around it as well, by having two beefy API app servers that your 10s or 100s of other app servers talks to.
SQLite provides a similar benefit with tremendous results using its in-process database engine, although the benefit there is more muted by default because of the very small default cache size. We do have one app where we do this. There's no database server, the app server uses SQLite to talk directly to S3 and the app server itself caches its working set in memory. I can definitely see the benefit of some situations, but for us it was a pretty unusual situation that we might not ever encounter again.
All that said... can't Datomic also do traditional query execution on the server? I thought it had support for scale-out compute for that. AIUI, you have the option to run as either a full peer or just an RPC client on a case-by-case basis? I thought you wouldn't need to resort to writing your own API intermediary, you could just connect to Datomic via RPC, right?
AIUI, the full peer is Datomic; the RPC server is just a full peer that exposes the API over http and is mainly intended to be used with clients that don't run on the JVM (and so can't run Datomic itself in-process).
I think the point is that treating your database as an arms-length, RPC component that's independent from your "application" isn't necessarily the only pattern.
Strong agree. there are vast, massive cost savings and performance advantages to be had if the model is that a shard of the dataset is in memory and the data persistence problem is the part that's made external. The only reason we are where we are today is that doing that well is hard.
Is this not the case already? Database drivers (or just your application code) are allowed to cache results if they like. The problem is cache invalidation.
Understood. For a single-node or read-only system it sounds fine, but then there are a variety of ways to solve that (e.g. a preloaded in-memory sqlite).
Having the working set present on app servers means they don't put load on a precious centralized resource which becomes a bottleneck for reads. The peer model allows app servers to service reads directly, avoiding the cost of contention and an additional network hop, allowing for massive read scale.
This is true, but the tradeoff is that now your central DB is a bottleneck that is difficult to scale.
Having the applications keep a cached version of the db means that when one of them runs a complex or resource intensive query, it's not affecting everyone else.
> Datomic's is perfect for probably 90% of small-ish backoffice systems that never has to be web scale (i.e. most of what I do at work).
I don’t think I agree with this as stated. It is too squishy and subjective to say “perfect”.
More broadly, the above is not and should not be a cognitive “anchor point” for reasonable use cases for Datomic. Making that kind of claim requires a lot more analysis and persuasion.
Datomic always seemed like a really cool thing to use. However, I'm not familiar with Clojure or any other JVM based language, nor do I have the time to learn it. And I can't find any supported way to use it with other languages (I'm not even talking about popular frameworks), or am I missing something?
It doesn't feel like the people behind Datomic actually want to have users outside of the Clojure world, which will be rather limiting to adoption.
Something I've been curious about: how well (or badly) would it scale to do something similar on a normal relational DB (say, Postgres)?
You could have one or more append-only tables that store events/transactions/whatever you want to call them, and then materialized-views (or whatever) which gather that history into a "current state" of "entities", as needed
If eventual-consistency is acceptable, it seems like you could aggressively cache and/or distribute reads. Maybe you could even do clever stuff like recomputing state only from the last event you had, instead of from scratch every time
Datomic already sort of does this :) You configure a storage backend (Datomic does not write to disk directly) which can be dynamodb, riak, or any JDBC database including postgres. You won't get readable data in PG though, as Datomic stores opaque compressed chunks in a key/value structure. The chunks are adressable via the small handful of built-in indexes that Datomic provides for querying, and the indexes are covering, i.e. data is duplicated for each index.
Because, for example, your application is not tied to the JVM? You are uncomfortable using closed source software for such a critical piece of infra? As far as I can tell they don't even have a searchable bug report database! I'd hate to be the one debugging an issue involving datomic.
That's a pretty common pattern in event-sourcing architectures. It is a completely viable way to do things as long as "eventual-consistency is acceptable" is actually true.
Good question! I don't have any personal experience in that regard. I would probably have paid up for enterprise support (or bought the entire company ;))
I don't how they do it, but the obvious answer is probably sharding. Their cloud costs must be no joke. Peers require tons of memory and I can only guess they must have thousands of transactors to support that workload and who knows how many peers. Add to this that they probably need something like Kafka for integrating/pipelining all this data.
As do most distributed databases. Even when you don't store your entire database (or working set) in memory, you'll likely still have to add quite a bit of memory to be used as I/O cache.
One thing which is quite hard to do in Datomic is simple pagination on a large sorted dataset, as one can easily do with LIMIT/OFFSET in MySQL for example. There are solutions for some of the cases, but general case is not solved, as far as I remember (it’s been a while I used it extensively)
It depends! If you want to lazily walk data, you can read directly from the index (keep in mind, the index = the data = lives in your app), or use the index-pull API which is a bit more convenient.
However, if you want to paginate data that you need to sort first, and the data isn't sorted the way you want in the index, you have to read all of the data first, and then sort it. But this is also what a database server would need to do :)
Yep, I am well aware of these specifics and workarounds, but in general case where is no general solution to the question asked here, for example [0].
And for big datasets with complex sorting it will take some effort to implement a seemingly simple feature.
Guess it is just one of the tradeoffs, as while some features Datomic has out of the box are hard to replicate in RDBMS-es, things like pagination which are often took for granted is a bit of work to do in Datomic. So it is something to keep in mind when considering the switch
Datomic's covering indexes are heavily based on their built-in ordering, and doesn't really have much flexibility in different ways to sort and walk data.
Personally, I'm a fan of hybrid database approaches. In the world of serverless, I really enjoy the combo of DynamoDB and Elasticsearch, for example, where Dynamo handles everything that's performance critical, and Elasticsearch handles everything where dynamic queries (and ordering, and aggregation, and ....) is required. I've never done this with Datomic, but I'd imagine mirroring the "current" value of entities without historical data is relatively easy to set up.
The main difference between event sourcing and datomic are the indexes and the "schema", which provides full SQL-like relational query powers out of the box, as well as point-in-time snapshots for every "event" (transactions of facts).
So, "events" in Datomic are structured and Datomic uses them to give you query powers, they're not opaque blobs of data.
> doing UPDATE in SQL feels pretty weird, as the default mode of operation of your _business data_ is to delete data, with no trace of who and why!
It's a good idea to version your schema changes using something like liquibase into git, that gets rid of at least some of those pains. Liquibase works on a wide variety of databases, even graphs like Neo4j
I got the same feeling in Erlang many times, once write operations start getting parallel you worry about atomic operations, and making an Erlang process centralize writes through its message queue always feels natural and easy to reason about.
Writing in a single thread removes a whole host of problems in understanding (and implementing) how data changes over time. (And a busy MVCC sql db spends 75% of its time doing coordination, not actual writes, so a single thread applying a queue of transactions in sequence can be faster than your gut feeling might tell you.)
Transactions as first-class entities of the system means you can easily add meta-data for every change in the system that explains who and why the change happened, so you'll never again have to wonder "hmm, why does that column have that value, and how did it happen". Once you get used to this, doing UPDATE in SQL feels pretty weird, as the default mode of operation of your _business data_ is to delete data, with no trace of who and why!
Having the value of the entire database at a point in time available to your business logic as a (lazy) immutable value you can run queries on opens up completely new ways of writing code, and lets your database follow "functional core, imperative shell". Someone needs to have the working set of your database in memory, why shouldn't it be your app server and business logic?
Looking forward to see what this does for the adoption of Datomic!