I feel like this is missing any mention of the history of KV stores. Unix came with an embedded database (dbm) from the early days (1979) [0] which was rewritten at Berkeley into the more popular bdb in the 80s. [1] Sendmail was one of the more common programs that used it. And then when djb built his replacement for sendmail, qmail, he invented cdb. [2]
This codebase shows how SSTables, WAL, memtables, recordio, skiplists, segment files, and other storage engine components work in a digestible way. Includes a demo database showing how it all comes together to make a RocksDB / LevelDB competitor (not really).
Judging from the precipitous decline in Badger commits since 20221 [0] and that the original/primary author is no longer with dgraph [1] or working on Badger, it may be worth looking at Cockroach's Pebble [2] instead.
+100 and an upvote. Badger db seems so under-rated to be and is a great drop in replacement for an embedded KV store. Amazing for several simple sharded simple side projects!
I get it that you can replace storage engine and you can theoretically get more performance, but in practice compatibility and standardization is more important, because a lot of products (including third-party) will already use PostgreSQL/MongoDB/Redis, so it's no-brainer to use it as well for your solution.
However for me to pick RocksDB or some other, new, shining database/storage engine, there would have to be more compelling reasons.
Unless you are building a database, these embedded KV store libraries are less likely to be the best solution the job. If you are considering them for an app that isn't a database, you should also take a long, hard look at SQLite first.
What's also interesting is the trend of newer distributed "database systems" like Vitess[0] or SpiceDB[1] that forego embedded KV stores and instead reuse existing SQL databases as their "embedded database". Vitess leverages MySQL and SpiceDB leverages MySQL, PostgreSQL, CockroachDB, or Spanner. Systems built this way get to leverage many high-level features from existing databases systems such that they can focus on innovating in even higher-level functionality. In the case of Vitess, it's scaling, distributing, and schema management of MySQL. In the case of SpiceDB, it's building a database specifically optimized for querying access control data in a way that can coordinate with causality across multiple services.
Like S3 or Redis, RocksDB is much more performant when you don't need the query engine and want to have highly compact storage with fast lookups and high write throughput.
Storage engines are different levels of complexity based on the query requirements. Simple K/V stores can run circles around Postgres/MySQL as long as you don't need the extra features.
In your list RocksDB is most like Redis, but even faster because the data doesn't have to leave the process.
Think of it as a high performance sports car like a Ferrari. It's not good at taking the kids to school or buying groceries. But if you need to prioritise performance at the expense of all other considerations then it's exactly what you need.
IMO it's just confusing to call both, say, RocksDB and MySQL "databases". They sit at different levels of the stack and it is easier to just think of them as entirely different things, your "SQL database" and your "storage engine". So your stack looks like
Application
|
MySQL
|
RocksDB
|
Filesystem
In general the MySQL layer is doing all the convenient stuff for application developers like supporting different queries and datatypes. The RocksDB layer is optimizing for performance metrics like throughput and reliability and just treats data as sequences of bytes.
Yes, that's right, InnoDB is the default MySQL storage engine and you can replace InnoDB with RocksDB. To summarize in one sentence, InnoDB is better at reads and RocksDB is better at writes, but if you were making an actual decision you should look at more detailed information than my one-sentence summary, such as:
The article misses the point. All data storage and query systems end up architected in layers. Upper layers deal with higher abstractions (objects, rows, whatever). Lower layers deal with simpler functions, closer to the hardware. The upper layers are consumers of the lower layers. This is where "embedded KV stores" like LevelDB, RocksDB, etc come from. They began as the embedded storage layer for some bigger thing. Every product you think of as a database or document store is built like this, including MySQL and PostgreSQL and Oracle. Such a storage layer, shipped as an independent library, is how you (or anyone) builds your own database-ish thing. That's what the article should say.
The list of examples are odd. For instance MongoRocks is cited for using RocksDB, but actual stock MongoDB uses Wired Tiger, which isn't mentioned.
Disclosure: I played a part in the late-beginning of this space when Netscape funded Sleepycat to develop BerkeleyDB. dbm and ndbm existed beforehand, but BerkeleyDB used in LDAP servers is I think the genesis point for this pattern as it exists today.
> Upper layers deal with higher abstractions (objects, rows, whatever)
Right, I'm waiting for standard for a level above relational databases which is Object-databases. I know there are several ones already and there are Object-Relational mapping layers.
I think the key point there is that Object databases are a level ABOVE relational databases. They are not "better" but they deal with the higher level of objects rather than "tables", just like relational databases can be seen to be are a level above key-value -stores.
I would like Object databases to become better and easier to use and more standardized.
I think there is value in being able to see both level, the objects, and the relational data that makes up the objects.
Neither objects nor relations are "above" the other. You can map them in a vacuous mathematical sense, but it's a massively leaky abstraction in either direction.
When I use the word "above" I mean "layers" of code. So if an Object-database was implemented by using a relational database, it would be "above" the layer of the RDBMs.
I think that is what object-to-relational mappers like Hibernate do.
I think it would seem quite natural to implement objects on top of, with the help of an RDBMS. But not sure if the opposite is true.
I do feel like there's a historical perspective missing from the article which the GP touches on. Embedded KV stores aren't new (although some of the algorithms behind the current crop certainly are). They used to dominate "backend" software development until their popularity waned as the world got obsessed with "model the domain, damn the computation cost" (because all resources were doubling or more yearly) followed by "we'll just distribute it".
The need for parallelism killed the first approach and the cost of increasingly complex reduce steps killed the second. Now we're back to "how much can we fit in RAM on a local machine" and it turns out, if you can still bang bits for smart key formats, a hell of a lot.
> They began as the embedded storage layer for some bigger thing.
I immediately thought of Kafka's streaming query stuff when I read the headline (ksqlDB). I'm not sure if that's the origin story of RocksDB, but it's the storage engine underlying that streaming query tooling in Kafka's ecosystem.
Should see a rise in embedded KV popularity in correlation with ML applications. Storing embeddings in something like leveldb in formats such as flatbuffer offer high-performance solutions for online prediction (i.e. for mapping business values to their embedding format on the fly to send off to some model for inference).
Would that be on mobile devices for offline usage? I'm thinking that for typical backend use cases one would use a dedicated key value store service, right?
This would depend on your requirements and type of inference. Say you need to compute inference across 1000's of content/documents/images every second or so, out of some corpus of millions-billions, then having a kv store on disk/SSD (NVME) might be for more efficient & cheaper (in terms of grabbing those embeddings to conduct a downstream ML task). How you update the corpus matters too -- a lot of embedding spaces need to be updated in aggregate.
I've heard this a lot recently about storing embeddings. As someone who has dabbled in ML I don't understand what it means. Can you point me to a good overview of the topic please?
I work on a storage engine at $dayJob. We have created a connector for MongoDB, although for a very ancient version. We are currently working with $cloudProvider to use our storage engine in their cloud DBaaS offerings.
This field is pretty interesting when you're talking about performance vs space amp vs write amp vs read amp.
My team has a use-case that involves a precomputed RocksDB database saved on an AWS EFS volume that is mounted on a lambda with 100's-1000's of invocations per second. It allows for some extremely fast querying of relatively static data. Another process is responsible for periodically updating the database and writing it back to the EFS volume.
I am building a general-purpose data management system called Didgets (https://didgets.com/) that extensively uses KV stores that I invented. Since it was primarily designed to be a file system replacement, I used them for attaching contextual meta-data tags to file objects.
My whole container started to look like a sparsely populated relational table where every row/column intersection could have multiple values (e.g. a photo could have a tag for every person in the picture attached). I started experimenting with using the KV stores as columns to form regular relational tables.
It turns out that it was relatively easy and was extremely fast. I started building tables with 50+ million rows and many columns and performing queries against them. Benchmarking the system against other databases revealed that it was very fast (and didn't need separate indexes to accomplish this).
Here is a video showing how it does a bunch of queries 10x faster than the same data stored in a highly indexed table in Postgres: https://www.youtube.com/watch?v=OVICKCkWMZE
Are you still using it? How is the pace going on the community-supported version? I stopped using it after the company folded, but I do kind of miss it. Definitely one of the more interesting designs, and light years beyond what MongoDB was at the time.
There are plenty of data explorers for other databases, especially SQL DBs. I don't think it being built into the DB should be a make-it-or-break-it feature.
I used RethinkDB back in the days because it was the first DB that had pretty good replication and sharding - it was zero effort. I felt the functional programming model to be strange, some stuff got executed locally, other parts remotely and it was not very straight forward when things didn't go as planned.
By the time the RethinkDB company folded, CockroachDB emerged and has been my go-to distributed DB since.
No I don't think that's relevant. They implement their own btree it seems [0].
They don't use a key-value store library.
I know it's a bit of a fine line. But I'm talking about standalone libraries people embed across different applications/databases. That's what RocksDB/LevelDB/Pebble are.
RethinkDB is utterly defunct as a project, has not had a substantive release in years, and in my experience just flat out doesn't work. And let's don't even discuss Mongo. Asking yourself to choose between these is like selecting your favorite brand of thumbtack to step on.
The last time I used MongoDB was when it was necessary for me to demonstrate to decision makers that it silently loses data in trivial, common failure scenarios. Then I put it away and never used it again.
I was about to defend it as having come far along but actually seems like it's still having some big issues as discussed in 2020 [0].
> Yeah, there's no workaround that I can find for 3.4 (duplicate effects), 3.5 (read skew), 3.6 (cyclic information flow), or 3.7 (read own future writes). I've arranged those in "increasingly worrying order"--duplicating writes doesn't feel as bad as allowing transactions to mutually observe each other's effects, for example. The fact that you can't even rely on a single transactions' operations taking place (or, more precisely, appearing to take place) in the order they're written is especially worrying. All of these behaviors occurred with read and write concerns set to snapshot/majority.
The article says that Consul or etcd are designed to always be up, but it’s actually quite the opposite. They both leverage Raft for maintaining consensus and thus optimize for consistency at the cost of availability in case of network partitions. See CAP theorem.
There are reasons to distribute DBs that do not need to be up constantly, e.g. distributing work (transactions or queries) across more resources than are available on one machine; or to bring a replica closer to some other service to reduce latency.
Kafka Streams is the first kind; the source-of-truth storage is HA (as HA as the Kafka topics it's backed with at least) but can only be queried with high consistency when the consumer is active, and it goes down for rebalances when you scale out or fail over (and in many operational setups also when you upgrade).
That being said, I think the etcd etc. examples are just meant to be in contrast to stock Redis or Memcache, which offer very little HA support, generally just failover with minimal consistency guarantee.
[0] https://en.wikipedia.org/wiki/DBM_(computing)
[1] https://en.wikipedia.org/wiki/Berkeley_DB
[2] https://cr.yp.to/cdb.html