Hacker News new | past | comments | ask | show | jobs | submit login
SplinterDB: High performance embedded key-value store (github.com/vmware)
98 points by ridruejo on May 28, 2022 | hide | past | favorite | 40 comments



Checkout the limitations first, no fsync and no data recovery makes this of very little use. I wonder what makes you write a kv store without this from the start.


Like 99% if the use cases redis/memcached should be used for?


Redis has optional durability

https://redis.io/docs/manual/persistence/


More like RocksDb, LevelDb or Sled - which are all just just libraries instead of services with a remote API.


Why write anything to disk if you can just store it in an array?


Just one example: sharing data between processes


If only there was a standardized, robust, widely available cross-platform Message Passing Interface that could do this.

I don't grok why people outside of HPC seem to be shunning MPI. The shared-nothing memory model and asynchronous nature of MPI makes it very similar in spirit to a lot of the current web dev tech, AFAICT.


If you’re not going to store the data on a fixed medium, you can use mmap for that.


And redis/memcached are like 1% of use cases for database.


Their paper (https://www.usenix.org/system/files/atc20-conway.pdf) reports a 6-10x speedup on insertions and 2x lower write amplification as compared to RocksDB, on fast NVMe hardware, for certain benchmarks. The authors found that RocksDB (and other storage engines) start being bottlenecked by CPU and have come up with some ways of tackling those bottlenecks, including a novel data structure named STB-tree. The current SplinterDB version has a list of acknowledged significant limitations and is not recommended for production use by the authors (yet).


Yet if SplinterDB can't even sync changes to disk, how can that comparison be fair?

Or is it a purely theoretical comparison?


I am assuming the authors made a good faith attempt at comparing apples to apples. In another comment thread on here, there is one of the authors answering questions, about fsync, durability etc: https://news.ycombinator.com/item?id=31543212


Why would I pick this over SQLite?


Totally different use-cases. This is an embedded key value store, not an RDBMS. You would use this in place of e.g., LevelDB or RocksDB, potentially as the storage layer of a database.


There's always the venerable:

  CREATE TABLE kv (
    k TEXT PRIMARY KEY,
    v TEXT NOT NULL
  );
Even if sqlite is technically an RDBMS, I think it's a legitimate comparison. Is SplinterDB worth giving up sqlite's reliability and feature set?


This is much lower-level than sqlite. In fact, you could use this as the storage layer for a SQL DB. See, e.g., MyRocks[0] which is a MySQL backend that uses RocksDB as the storage layer.

In other words, you'd use this when you just need a persistent KV store and want to build the higher level semantics according to your application's needs.

[0] http://myrocks.io/


> In other words, you'd use this when you just need a persistent KV store and want to build the higher level semantics according to your application's needs.

Why can't you use SQLite for this usecase? I believe FDB uses SQLite as an embedded KV store.


What it comes down to is performance.

You can use a relational database such as SQLite for a low-level key-value store, such as RocksDB or SplinterDB, but then you pay for the higher-level semantics with lower performance.


I'm not very familiar with foundationdb, but I'm confident they're not using sqlite as the storage layer. That would come with a tremendous performance penalty. The docs say that "The SSD storage engine stores the data in a B-tree based on SQLite" which makes me think that they're just using the storage layer from SQLite (i.e., the part of it that corresponds to splinterdb/rocksdb).


I did see a full copy of the SQLite amalgamation file in the FDB codebase, but you're probably right that they might be using internal APIs.

I'm still skeptical of the "tremendous performance penalty" you'd suffer from using SQLite. Just because you do fewer things doesn't necessarily mean you're faster at doing them. I've hit ~120,000 inserts/sec on SQLite without weakening any of it's durability guarantees. If you play fast and loose with fsync and WAL, I'm sure you can squeeze out even more performance.

I can also think of use-cases where you don't want the write amplification that comes with RocksDB or the memory constraints of LMDB.


I am not familiar with SplinterDB, but I do have a lot of familiarity with RocksDB. These types of k/v storage layers are designed to handle hundreds of thousands of not millions of operations per sec. Especially the way they handle writes (typically with an LSM) is very different from SQLite, and it shows in terms of throughput of e.g. random writes.

I'd say that these low-level storage engines have more in common with filesystems than SQLite, they're just not in the same ballpark at all.


Your confidence is misplaced. The storage layer is SQLite.


They are leaving sqlite

In the upcoming FoundationDB 7.0 release, the B-tree storage engine will be replaced with a brand new Redwood engine.

https://apple.github.io/foundationdb/architecture.html#stora...


You mean like dict() in Python? What's the use case for this?


It's like a dict, but persisted to disk. This means that it's both durable (so if your process/machine crashes, you don't lose data) and also can store datasets larger than memory.


You call it "high performance" and provide no benchmarks?


Paper has it https://www.usenix.org/system/files/atc20-conway.pdf but yeah if you check out list of limitations looks more like a research proj at this stage. Pretty interesting architecture overall though


Ah, that's helpful, and explains why it exists:

"Three novel ideas contribute to the high performance of SplinterDB: the STB-tree, a new compaction policy that exposes more concurrency, and a concurrent memtable and user-level cache that removes scalability bottlenecks. All three components are designed to enable the CPU to drive high IOPS without wasting cycles."

"At the heart of SplinterDB is the STB-tree, a novel data structure that combines ideas from log-structured merge tree and B-trees. The STB-tree adapts the idea of size-tiering (also known as fragmentation) from key-value stores such as Cassandra and PebblesDB and applies them to B-trees to reduce write amplification by reducing the number of times a data item is re-written during compaction."


The numbers look very good actually.

I don't care if it's a research project. If it doesn't crash, doesn't corrupt data, and delivers performance, it's useful.

I'd want to see performance against Redis and KeyDB.


Well you should read the limitations… I think they are actually cheating by not calling fsync at all which makes writes not durable. This is different in rocks/pebble and friends.

> I'd want to see performance against Redis and KeyDB.

I think this is apples to oranges comparison as neither of these provide durability by default and if you enable it redis had terrible performance last I checked + redis needs to fit a whole dataset in memory


Hi, research lead for SplinterDB here.

SplinterDB does make all writes durable and in fact has its own user-level cache which generally performs writes directly to disk (using O_DIRECT for example).

Like RocksDB's default behavior (no fsyncs on the log), it does not immediately sync writes to its log when they happen. It waits to sync in batches, so that writes may not be immediately durable, but logging is more efficient. This is a slightly stronger default durability guarantee, and we intend to make this configurable.


I’m a little confused. If you don’t ensure data is committed to storage (log or otherwise) before acking the write request, how can you call it durable?

If it’s not truly 100% durable by default, it’s best not to suggest that it is. Experience says people will use the default settings and then become very cross if they lose data. It undermines trust and is harmful to reputation.


With many workloads, there's a tradeoff between the granularity of durability and the overall performance.

If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).

So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.

RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).


You’re absolutely correct about those facts, but you’re also avoiding the thrust of my argument about improperly calling your database durable when it is decidedly not and could fail a trivial power-cut test. A database’s one job is not to lose data.

I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.

The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.


Transaction merging allows you to handle that nicely. By handling concurrent writes and merging them into a single write to the disk.


I missed the use of direct io and the comment about fsync threw me off, thanks. Very impressive then!


O_DIRECT doesn't provide power-cut durability on storage devices with write cache.

Recently written and acknowledged data can still be lost on a power cut.

You still need fsync, fdatasync or equivalent after an O_DIRECT write, to tell the storage device to commit its write cache to the non-volatile layer.

(And last time I looked, I think some filesystems even incorrectly failed to flush the device write cache on fsync after O_DIRECT writes because of no dirty page states.)


There’s a ton devices on the market that would lie to you too saying caches are flushed while they aint. If you really want that data to be there better use “server grade” hw with power loss protection


Yeah I would appreciate a benchmark against its main alternative, rocksdb. I know benchmark are typically manufactured and not too representative for real world load, but at least a ballpark figure would be nice to know what we’re talking about here.

Their main website is at https://splinterdb.org/ by the way, for those interested. Also no benchmarks there. :)


The paper referenced in the other comment includes a benchmark against RocksDB https://news.ycombinator.com/item?id=31515765




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: