SplinterDB: High performance embedded key-value store

tjungblut · on May 28, 2022

Checkout the limitations first, no fsync and no data recovery makes this of very little use. I wonder what makes you write a kv store without this from the start.

adra · on May 28, 2022

Like 99% if the use cases redis/memcached should be used for?

bufferoverflow · on May 28, 2022

Redis has optional durability

https://redis.io/docs/manual/persistence/

Matthias247 · on May 29, 2022

More like RocksDb, LevelDb or Sled - which are all just just libraries instead of services with a remote API.

tjungblut · on May 28, 2022

Why write anything to disk if you can just store it in an array?

capableweb · on May 28, 2022

Just one example: sharing data between processes

semi-extrinsic · on May 29, 2022

If only there was a standardized, robust, widely available cross-platform Message Passing Interface that could do this.

I don't grok why people outside of HPC seem to be shunning MPI. The shared-nothing memory model and asynchronous nature of MPI makes it very similar in spirit to a lot of the current web dev tech, AFAICT.

otterley · on May 29, 2022

If you’re not going to store the data on a fixed medium, you can use mmap for that.

inshadows · on May 28, 2022

And redis/memcached are like 1% of use cases for database.

seeekr · on May 29, 2022

Their paper (https://www.usenix.org/system/files/atc20-conway.pdf) reports a 6-10x speedup on insertions and 2x lower write amplification as compared to RocksDB, on fast NVMe hardware, for certain benchmarks. The authors found that RocksDB (and other storage engines) start being bottlenecked by CPU and have come up with some ways of tackling those bottlenecks, including a novel data structure named STB-tree. The current SplinterDB version has a list of acknowledged significant limitations and is not recommended for production use by the authors (yet).

yencabulator · on May 31, 2022

Yet if SplinterDB can't even sync changes to disk, how can that comparison be fair?

Or is it a purely theoretical comparison?

seeekr · on June 1, 2022

I am assuming the authors made a good faith attempt at comparing apples to apples. In another comment thread on here, there is one of the authors answering questions, about fsync, durability etc: https://news.ycombinator.com/item?id=31543212

killingtime74 · on May 28, 2022

Why would I pick this over SQLite?

necubi · on May 28, 2022

Totally different use-cases. This is an embedded key value store, not an RDBMS. You would use this in place of e.g., LevelDB or RocksDB, potentially as the storage layer of a database.

axblount · on May 28, 2022

There's always the venerable:

  CREATE TABLE kv (
    k TEXT PRIMARY KEY,
    v TEXT NOT NULL
  );

Even if sqlite is technically an RDBMS, I think it's a legitimate comparison. Is SplinterDB worth giving up sqlite's reliability and feature set?

necubi · on May 28, 2022

This is much lower-level than sqlite. In fact, you could use this as the storage layer for a SQL DB. See, e.g., MyRocks[0] which is a MySQL backend that uses RocksDB as the storage layer.

In other words, you'd use this when you just need a persistent KV store and want to build the higher level semantics according to your application's needs.

[0] http://myrocks.io/

4khilles · on May 28, 2022

> In other words, you'd use this when you just need a persistent KV store and want to build the higher level semantics according to your application's needs.

Why can't you use SQLite for this usecase? I believe FDB uses SQLite as an embedded KV store.

ajhconway · on May 28, 2022

What it comes down to is performance.

You can use a relational database such as SQLite for a low-level key-value store, such as RocksDB or SplinterDB, but then you pay for the higher-level semantics with lower performance.

necubi · on May 28, 2022

I'm not very familiar with foundationdb, but I'm confident they're not using sqlite as the storage layer. That would come with a tremendous performance penalty. The docs say that "The SSD storage engine stores the data in a B-tree based on SQLite" which makes me think that they're just using the storage layer from SQLite (i.e., the part of it that corresponds to splinterdb/rocksdb).

4khilles · on May 28, 2022

I did see a full copy of the SQLite amalgamation file in the FDB codebase, but you're probably right that they might be using internal APIs.

I'm still skeptical of the "tremendous performance penalty" you'd suffer from using SQLite. Just because you do fewer things doesn't necessarily mean you're faster at doing them. I've hit ~120,000 inserts/sec on SQLite without weakening any of it's durability guarantees. If you play fast and loose with fsync and WAL, I'm sure you can squeeze out even more performance.

I can also think of use-cases where you don't want the write amplification that comes with RocksDB or the memory constraints of LMDB.

stingraycharles · on May 29, 2022

I am not familiar with SplinterDB, but I do have a lot of familiarity with RocksDB. These types of k/v storage layers are designed to handle hundreds of thousands of not millions of operations per sec. Especially the way they handle writes (typically with an LSM) is very different from SQLite, and it shows in terms of throughput of e.g. random writes.

I'd say that these low-level storage engines have more in common with filesystems than SQLite, they're just not in the same ballpark at all.

fooster · on May 28, 2022

Your confidence is misplaced. The storage layer is SQLite.

leetrout · on May 29, 2022

They are leaving sqlite

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.

https://apple.github.io/foundationdb/architecture.html#stora...

inshadows · on May 28, 2022

You mean like dict() in Python? What's the use case for this?

necubi · on May 29, 2022

It's like a dict, but persisted to disk. This means that it's both durable (so if your process/machine crashes, you don't lose data) and also can store datasets larger than memory.

bufferoverflow · on May 28, 2022

You call it "high performance" and provide no benchmarks?

dilyevsky · on May 28, 2022

Paper has it https://www.usenix.org/system/files/atc20-conway.pdf but yeah if you check out list of limitations looks more like a research proj at this stage. Pretty interesting architecture overall though

tyingq · on May 28, 2022

Ah, that's helpful, and explains why it exists:

"Three novel ideas contribute to the high performance of SplinterDB: the STB-tree, a new compaction policy that exposes more concurrency, and a concurrent memtable and user-level cache that removes scalability bottlenecks. All three components are designed to enable the CPU to drive high IOPS without wasting cycles."

"At the heart of SplinterDB is the STB-tree, a novel data structure that combines ideas from log-structured merge tree and B-trees. The STB-tree adapts the idea of size-tiering (also known as fragmentation) from key-value stores such as Cassandra and PebblesDB and applies them to B-trees to reduce write amplification by reducing the number of times a data item is re-written during compaction."

bufferoverflow · on May 28, 2022

The numbers look very good actually.

I don't care if it's a research project. If it doesn't crash, doesn't corrupt data, and delivers performance, it's useful.

I'd want to see performance against Redis and KeyDB.

dilyevsky · on May 28, 2022

Well you should read the limitations… I think they are actually cheating by not calling fsync at all which makes writes not durable. This is different in rocks/pebble and friends.

> I'd want to see performance against Redis and KeyDB.

I think this is apples to oranges comparison as neither of these provide durability by default and if you enable it redis had terrible performance last I checked + redis needs to fit a whole dataset in memory

ajhconway · on May 28, 2022

Hi, research lead for SplinterDB here.

SplinterDB does make all writes durable and in fact has its own user-level cache which generally performs writes directly to disk (using O_DIRECT for example).

Like RocksDB's default behavior (no fsyncs on the log), it does not immediately sync writes to its log when they happen. It waits to sync in batches, so that writes may not be immediately durable, but logging is more efficient. This is a slightly stronger default durability guarantee, and we intend to make this configurable.

otterley · on May 28, 2022

I’m a little confused. If you don’t ensure data is committed to storage (log or otherwise) before acking the write request, how can you call it durable?

If it’s not truly 100% durable by default, it’s best not to suggest that it is. Experience says people will use the default settings and then become very cross if they lose data. It undermines trust and is harmful to reputation.

ajhconway · on May 28, 2022

With many workloads, there's a tradeoff between the granularity of durability and the overall performance.

If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).

So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.

RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).

otterley · on May 29, 2022

You’re absolutely correct about those facts, but you’re also avoiding the thrust of my argument about improperly calling your database durable when it is decidedly not and could fail a trivial power-cut test. A database’s one job is not to lose data.

I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.

The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.

ayende · on May 29, 2022

Transaction merging allows you to handle that nicely. By handling concurrent writes and merging them into a single write to the disk.

dilyevsky · on May 28, 2022

I missed the use of direct io and the comment about fsync threw me off, thanks. Very impressive then!

jlokier · on May 29, 2022

O_DIRECT doesn't provide power-cut durability on storage devices with write cache.

Recently written and acknowledged data can still be lost on a power cut.

You still need fsync, fdatasync or equivalent after an O_DIRECT write, to tell the storage device to commit its write cache to the non-volatile layer.

(And last time I looked, I think some filesystems even incorrectly failed to flush the device write cache on fsync after O_DIRECT writes because of no dirty page states.)

dilyevsky · on May 29, 2022

There’s a ton devices on the market that would lie to you too saying caches are flushed while they aint. If you really want that data to be there better use “server grade” hw with power loss protection

stingraycharles · on May 28, 2022

Yeah I would appreciate a benchmark against its main alternative, rocksdb. I know benchmark are typically manufactured and not too representative for real world load, but at least a ballpark figure would be nice to know what we’re talking about here.

Their main website is at https://splinterdb.org/ by the way, for those interested. Also no benchmarks there. :)

ridruejo · on May 28, 2022

The paper referenced in the other comment includes a benchmark against RocksDB https://news.ycombinator.com/item?id=31515765