What's the big deal about embedded key-value databases?

mprovost · on Aug 23, 2022

I feel like this is missing any mention of the history of KV stores. Unix came with an embedded database (dbm) from the early days (1979) [0] which was rewritten at Berkeley into the more popular bdb in the 80s. [1] Sendmail was one of the more common programs that used it. And then when djb built his replacement for sendmail, qmail, he invented cdb. [2]

[0] https://en.wikipedia.org/wiki/DBM_(computing)

[1] https://en.wikipedia.org/wiki/Berkeley_DB

[2] https://cr.yp.to/cdb.html

Xeoncross · on Aug 23, 2022

I highly recommend people comfortable with Go checkout the building blocks at https://github.com/thomasjungblut/go-sstables

This codebase shows how SSTables, WAL, memtables, recordio, skiplists, segment files, and other storage engine components work in a digestible way. Includes a demo database showing how it all comes together to make a RocksDB / LevelDB competitor (not really).

artificial · on Aug 23, 2022

Very cool! In a similar vein Distributed Services with Go [0] works through SST creating a KV store. I found it helpful for working with BadgerDB [1].

[0] https://pragprog.com/titles/tjgo/distributed-services-with-g...

[1] https://github.com/dgraph-io/badger

Xeoncross · on Aug 24, 2022

BadgerDB is quite a nice piece of software and (from my tests) the best key-value store in Go.

Dgraph wrote a great explanation article about why they wrote Badger and the tradeoffs and designs reasons: http://web.archive.org/web/20181116033431/https://blog.dgrap...

eatonphil · on Aug 24, 2022

Judging from the precipitous decline in Badger commits since 20221 [0] and that the original/primary author is no longer with dgraph [1] or working on Badger, it may be worth looking at Cockroach's Pebble [2] instead.

[0] https://github.com/dgraph-io/dgraph/graphs/contributors

[1] https://manishrjain.com/about

[2] https://github.com/cockroachdb/pebble

flashgordon · on Aug 31, 2022

+100 and an upvote. Badger db seems so under-rated to be and is a great drop in replacement for an embedded KV store. Amazing for several simple sharded simple side projects!

tjungblut · on Aug 24, 2022

Thank you! And thanks for all the stargazers :) Let me know if you have any issues, happy to help and fix things if necessary.

Xeoncross · on Aug 24, 2022

Thank you Thomas, we appreciate you taking the time to open source all this for our benefit.

Adiqq · on Aug 23, 2022

Honestly, I'm still not sure, why would I use something like RocksDB instead or in addition to plain PostgreSQL/MongoDB/Redis instances.

I don't work with a lot of data, but typically my decisions base on basic factors and purpose:

PostgreSQL - SQL, structured data, cannot scale horizontally

MongoDB - NoSQL, unstructured data

Redis - key-value, distributed cache

I get it that you can replace storage engine and you can theoretically get more performance, but in practice compatibility and standardization is more important, because a lot of products (including third-party) will already use PostgreSQL/MongoDB/Redis, so it's no-brainer to use it as well for your solution.

However for me to pick RocksDB or some other, new, shining database/storage engine, there would have to be more compelling reasons.

jzelinskie · on Aug 23, 2022

Unless you are building a database, these embedded KV store libraries are less likely to be the best solution the job. If you are considering them for an app that isn't a database, you should also take a long, hard look at SQLite first.

What's also interesting is the trend of newer distributed "database systems" like Vitess[0] or SpiceDB[1] that forego embedded KV stores and instead reuse existing SQL databases as their "embedded database". Vitess leverages MySQL and SpiceDB leverages MySQL, PostgreSQL, CockroachDB, or Spanner. Systems built this way get to leverage many high-level features from existing databases systems such that they can focus on innovating in even higher-level functionality. In the case of Vitess, it's scaling, distributing, and schema management of MySQL. In the case of SpiceDB, it's building a database specifically optimized for querying access control data in a way that can coordinate with causality across multiple services.

[0]: https://github.com/vitessio/vitess

[1]: https://github.com/authzed/spicedb

foota · on Aug 23, 2022

Me: that's Zanzibar, innit? Their GitHub repo: based on Zanzibar

Xeoncross · on Aug 23, 2022

Like S3 or Redis, RocksDB is much more performant when you don't need the query engine and want to have highly compact storage with fast lookups and high write throughput.

Storage engines are different levels of complexity based on the query requirements. Simple K/V stores can run circles around Postgres/MySQL as long as you don't need the extra features.

zarzavat · on Aug 23, 2022

In your list RocksDB is most like Redis, but even faster because the data doesn't have to leave the process.

Think of it as a high performance sports car like a Ferrari. It's not good at taking the kids to school or buying groceries. But if you need to prioritise performance at the expense of all other considerations then it's exactly what you need.

eis · on Aug 23, 2022

A few more entries that might be of interest:

  * DynamoDB and the Dynamo KV store
  * LMDB (embedded kv)
  * Dgraph (distributed graph db) and its embedded kv store BadgerDB

lacker · on Aug 23, 2022

IMO it's just confusing to call both, say, RocksDB and MySQL "databases". They sit at different levels of the stack and it is easier to just think of them as entirely different things, your "SQL database" and your "storage engine". So your stack looks like

Application

|

MySQL

|

RocksDB

|

Filesystem

In general the MySQL layer is doing all the convenient stuff for application developers like supporting different queries and datatypes. The RocksDB layer is optimizing for performance metrics like throughput and reliability and just treats data as sequences of bytes.

lcnPylGDnU4H9OF · on Aug 23, 2022

Actually, this helps a lot. I'd never heard of RocksDB and I'm barely familiar with InnoDB and hopefully I am not wrong to compare the two.

lacker · on Aug 24, 2022

Yes, that's right, InnoDB is the default MySQL storage engine and you can replace InnoDB with RocksDB. To summarize in one sentence, InnoDB is better at reads and RocksDB is better at writes, but if you were making an actual decision you should look at more detailed information than my one-sentence summary, such as:

https://minervadb.com/index.php/2018/06/06/a-friendly-compar...

tomhallett · on Aug 23, 2022

100% agreed. TIL that mysql uses RocksDB under the hood.

Here's another example of a realtime database which uses RocksDB under the hood: https://rockset.com/blog/how-we-use-rocksdb-at-rockset/

eatonphil · on Aug 23, 2022

As far as I'm aware, MySQL does not use RocksDB under the hood by default. MyRocks is a distribution of MySQL that uses RocksDB.

moralestapia · on Aug 23, 2022

Yeah, weird comment from GP. By the time RocksDB was born, MySQL was already going to prom.

ruw1090 · on Aug 23, 2022

Close, but in database years it was actually already in its mid life crisis.

icelancer · on Aug 23, 2022

Only if you configure it that way. Same as MyISAM/InnoDB/etc.

jeffbee · on Aug 23, 2022

I think the use of bare RocksDB is more common than the use of MyRocks.

rajko_rad · on Aug 23, 2022

Two more examples to check out: Yugabyte also persists with rocksDB https://www.yugabyte.com/blog/how-we-built-a-high-performanc...

And this is very cool, distributed SQLite with FDB: https://univalence.me/posts/mvsqlite

eatonphil · on Aug 23, 2022

Thank you, edited to include Yugabyte!

samsquire · on Aug 23, 2022

With RockSet's converged indexes and an SQL query optimiser you can build an SQL database.

https://rockset.com/blog/converged-indexing-the-secret-sauce...

Rockset's converged indexes + denormalisation means you can have fast querying.

aviramha · on Aug 23, 2022

Great article! One cool thing about RocksDB it's actually even used in other KV databases such as Redis on Flash https://redis.com/blog/hood-redis-enterprise-flash-database-...

dboreham · on Aug 23, 2022

The article misses the point. All data storage and query systems end up architected in layers. Upper layers deal with higher abstractions (objects, rows, whatever). Lower layers deal with simpler functions, closer to the hardware. The upper layers are consumers of the lower layers. This is where "embedded KV stores" like LevelDB, RocksDB, etc come from. They began as the embedded storage layer for some bigger thing. Every product you think of as a database or document store is built like this, including MySQL and PostgreSQL and Oracle. Such a storage layer, shipped as an independent library, is how you (or anyone) builds your own database-ish thing. That's what the article should say.

The list of examples are odd. For instance MongoRocks is cited for using RocksDB, but actual stock MongoDB uses Wired Tiger, which isn't mentioned.

Disclosure: I played a part in the late-beginning of this space when Netscape funded Sleepycat to develop BerkeleyDB. dbm and ndbm existed beforehand, but BerkeleyDB used in LDAP servers is I think the genesis point for this pattern as it exists today.

galaxyLogic · on Aug 23, 2022

> Upper layers deal with higher abstractions (objects, rows, whatever)

Right, I'm waiting for standard for a level above relational databases which is Object-databases. I know there are several ones already and there are Object-Relational mapping layers.

I think the key point there is that Object databases are a level ABOVE relational databases. They are not "better" but they deal with the higher level of objects rather than "tables", just like relational databases can be seen to be are a level above key-value -stores.

I would like Object databases to become better and easier to use and more standardized.

I think there is value in being able to see both level, the objects, and the relational data that makes up the objects.

morelisp · on Aug 23, 2022

Neither objects nor relations are "above" the other. You can map them in a vacuous mathematical sense, but it's a massively leaky abstraction in either direction.

eatonphil · on Aug 23, 2022

Some concrete examples:

1. Yugabyte's relational query layer sits on top of a document store (DocDB): https://www.yugabyte.com/blog/how-we-built-a-high-performanc....

2. You can put documents in a PostgreSQL JSON(B) column.

galaxyLogic · on Aug 24, 2022

When I use the word "above" I mean "layers" of code. So if an Object-database was implemented by using a relational database, it would be "above" the layer of the RDBMs.

I think that is what object-to-relational mappers like Hibernate do.

I think it would seem quite natural to implement objects on top of, with the help of an RDBMS. But not sure if the opposite is true.

eatonphil · on Aug 23, 2022

If there's a difference between what you wrote and what I wrote I'm missing it.

But you're also welcome to write your own post. :)

morelisp · on Aug 23, 2022

I do feel like there's a historical perspective missing from the article which the GP touches on. Embedded KV stores aren't new (although some of the algorithms behind the current crop certainly are). They used to dominate "backend" software development until their popularity waned as the world got obsessed with "model the domain, damn the computation cost" (because all resources were doubling or more yearly) followed by "we'll just distribute it".

The need for parallelism killed the first approach and the cost of increasingly complex reduce steps killed the second. Now we're back to "how much can we fit in RAM on a local machine" and it turns out, if you can still bang bits for smart key formats, a hell of a lot.

nicholasjarnold · on Aug 23, 2022

> They began as the embedded storage layer for some bigger thing.

I immediately thought of Kafka's streaming query stuff when I read the headline (ksqlDB). I'm not sure if that's the origin story of RocksDB, but it's the storage engine underlying that streaming query tooling in Kafka's ecosystem.

eatonphil · on Aug 23, 2022

Yup, FB's ZippyDB [0] is another example mentioned in the article.

[0] https://engineering.fb.com/2021/08/06/core-data/zippydb/

Edit: I've added Redis Enterprise Flash to the list now. Thanks!

ramoz · on Aug 23, 2022

Should see a rise in embedded KV popularity in correlation with ML applications. Storing embeddings in something like leveldb in formats such as flatbuffer offer high-performance solutions for online prediction (i.e. for mapping business values to their embedding format on the fly to send off to some model for inference).

jupp0r · on Aug 23, 2022

Would that be on mobile devices for offline usage? I'm thinking that for typical backend use cases one would use a dedicated key value store service, right?

ramoz · on Aug 23, 2022

This would depend on your requirements and type of inference. Say you need to compute inference across 1000's of content/documents/images every second or so, out of some corpus of millions-billions, then having a kv store on disk/SSD (NVME) might be for more efficient & cheaper (in terms of grabbing those embeddings to conduct a downstream ML task). How you update the corpus matters too -- a lot of embedding spaces need to be updated in aggregate.

porker · on Aug 24, 2022

I've heard this a lot recently about storing embeddings. As someone who has dabbled in ML I don't understand what it means. Can you point me to a good overview of the topic please?

tristan957 · on Aug 23, 2022

I work on a storage engine at $dayJob. We have created a connector for MongoDB, although for a very ancient version. We are currently working with $cloudProvider to use our storage engine in their cloud DBaaS offerings.

This field is pretty interesting when you're talking about performance vs space amp vs write amp vs read amp.

adammarples · on Aug 23, 2022

Plug for my python dict wrapper https://github.com/adammarples/rocksdbdict

kefir · on Aug 23, 2022

Apache Ignite 3 also uses RocksDB as a pluggable storage https://www.gridgain.com/resources/blog/apache-ignite-3-alph...

eatonphil · on Aug 23, 2022

Thanks! Adding this.

rad_gruchalski · on Aug 23, 2022

This is a good read. By the way, Kafka Streams is also built on top of RocksDB. Not strictly a database but relevant to a certain extent.

x3n0ph3n3 · on Aug 23, 2022

My team has a use-case that involves a precomputed RocksDB database saved on an AWS EFS volume that is mounted on a lambda with 100's-1000's of invocations per second. It allows for some extremely fast querying of relatively static data. Another process is responsible for periodically updating the database and writing it back to the EFS volume.

didgetmaster · on Aug 23, 2022

I am building a general-purpose data management system called Didgets (https://didgets.com/) that extensively uses KV stores that I invented. Since it was primarily designed to be a file system replacement, I used them for attaching contextual meta-data tags to file objects.

My whole container started to look like a sparsely populated relational table where every row/column intersection could have multiple values (e.g. a photo could have a tag for every person in the picture attached). I started experimenting with using the KV stores as columns to form regular relational tables.

It turns out that it was relatively easy and was extremely fast. I started building tables with 50+ million rows and many columns and performing queries against them. Benchmarking the system against other databases revealed that it was very fast (and didn't need separate indexes to accomplish this).

Here is a video showing how it does a bunch of queries 10x faster than the same data stored in a highly indexed table in Postgres: https://www.youtube.com/watch?v=OVICKCkWMZE

LAC-Tech · on Aug 23, 2022

When I read about event sourcing, my mind immediately went to how that would map to a K/V database. Has anyone done this in production?

Also - no mention of LMDB? RocksDB and LMDB feel like the ones that stand out in that field - levelDB definitely had a reputation for corrupting data.

legulere · on Aug 24, 2022

The article explains how you do primary key indices with key-value-stores. But how do you do secondary indexes?

morelisp · on Aug 23, 2022

"Time is a flat circle." - someone at Sleepycat, probably.

NetOpWibby · on Aug 23, 2022

You should add RethinkDB! I moved to it from MongoDB years ago.

orthecreedence · on Aug 23, 2022

Are you still using it? How is the pace going on the community-supported version? I stopped using it after the company folded, but I do kind of miss it. Definitely one of the more interesting designs, and light years beyond what MongoDB was at the time.

NetOpWibby · on Aug 23, 2022

I’m definitely still using it, via rethinkdb-ts (npm package). I even forked it to make it work with Deno.

The built-in Data Explorer is a must-have for me and idk of any other database that has something similar.

eis · on Aug 23, 2022

There are plenty of data explorers for other databases, especially SQL DBs. I don't think it being built into the DB should be a make-it-or-break-it feature.

I used RethinkDB back in the days because it was the first DB that had pretty good replication and sharding - it was zero effort. I felt the functional programming model to be strange, some stuff got executed locally, other parts remotely and it was not very straight forward when things didn't go as planned.

By the time the RethinkDB company folded, CockroachDB emerged and has been my go-to distributed DB since.

eatonphil · on Aug 23, 2022

No I don't think that's relevant. They implement their own btree it seems [0].

They don't use a key-value store library.

I know it's a bit of a fine line. But I'm talking about standalone libraries people embed across different applications/databases. That's what RocksDB/LevelDB/Pebble are.

[0] https://github.com/rethinkdb/rethinkdb/tree/v2.4.x/src/btree

tristan957 · on Aug 23, 2022

HSE[0] is another storage engine to throw on the pile.

[0]: https://github.com/hse-project/hse

jeffbee · on Aug 23, 2022

RethinkDB is utterly defunct as a project, has not had a substantive release in years, and in my experience just flat out doesn't work. And let's don't even discuss Mongo. Asking yourself to choose between these is like selecting your favorite brand of thumbtack to step on.

gqewogpdqa · on Aug 23, 2022

Lol. When did you last use MongoDB and why is it a thumbtack?

jeffbee · on Aug 23, 2022

The last time I used MongoDB was when it was necessary for me to demonstrate to decision makers that it silently loses data in trivial, common failure scenarios. Then I put it away and never used it again.

eatonphil · on Aug 23, 2022

I was about to defend it as having come far along but actually seems like it's still having some big issues as discussed in 2020 [0].

> Yeah, there's no workaround that I can find for 3.4 (duplicate effects), 3.5 (read skew), 3.6 (cyclic information flow), or 3.7 (read own future writes). I've arranged those in "increasingly worrying order"--duplicating writes doesn't feel as bad as allowing transactions to mutually observe each other's effects, for example. The fact that you can't even rely on a single transactions' operations taking place (or, more precisely, appearing to take place) in the order they're written is especially worrying. All of these behaviors occurred with read and write concerns set to snapshot/majority.

[0] https://news.ycombinator.com/item?id=23290844

NetOpWibby · on Aug 23, 2022

RethinkDB still works well for me /shrug

eis · on Aug 23, 2022

TiKV is not an embedded key-value store, it is distributed.

eatonphil · on Aug 23, 2022

Thanks! Fixed and attributed you at the end.

atmin · on Aug 23, 2022

No mention of SQLite as an embedded SQL database?

eatonphil · on Aug 23, 2022

This post is about key-value stores.

While foundationdb uses SQLite I didn't otherwise think of SQLite as being relevant here. :)

NonNefarious · on Aug 23, 2022

The term is “key/value.”

NonNefarious · on Aug 27, 2022

Yes, keep modding INFORMATION down, Redditards.

mdzn · on Aug 23, 2022

The article says that Consul or etcd are designed to always be up, but it’s actually quite the opposite. They both leverage Raft for maintaining consensus and thus optimize for consistency at the cost of availability in case of network partitions. See CAP theorem.

cloudhead · on Aug 23, 2022

All distributed databases are designed to "always be up", that's the point of making them distributed, otherwise a single instance is fine.

morelisp · on Aug 23, 2022

There are reasons to distribute DBs that do not need to be up constantly, e.g. distributing work (transactions or queries) across more resources than are available on one machine; or to bring a replica closer to some other service to reduce latency.

Kafka Streams is the first kind; the source-of-truth storage is HA (as HA as the Kafka topics it's backed with at least) but can only be queried with high consistency when the consumer is active, and it goes down for rebalances when you scale out or fail over (and in many operational setups also when you upgrade).

For an example of the second kind, see Fly.io's Litestream explanation - https://fly.io/blog/all-in-on-sqlite-litestream/.

That being said, I think the etcd etc. examples are just meant to be in contrast to stock Redis or Memcache, which offer very little HA support, generally just failover with minimal consistency guarantee.