Hacker News new | past | comments | ask | show | jobs | submit login
RocksDB – A persistent key-value store for fast storage environments (rocksdb.org)
196 points by MadeInSyria on Nov 15, 2013 | hide | past | favorite | 72 comments



Very nice work, and the wiki is also quite nice -- I wish more projects had a page like https://github.com/facebook/rocksdb/wiki/Rocksdb-Architectur.... It's really nice to see a clear, terse summary of what makes this project interesting relative to its predecessors.

At my company (scalyr.com), we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase). I'll be digging through rocksdb to see what ideas we might borrow. A few things we've implemented that might be interesting for rocksdb:

* The application can force segments to be split at specified keys. This is very helpful if you write a block of data all at once and then don't touch it for a long time. The initial memtable compaction places this data in its own segment and then we can push that segment down to the deepest level without ever compacting it again. It can also eliminate the need for bloom filters for many use cases, as you often wind up with only one segment overlapping a particular key range.

* The application can specify different compression schemes for different parts of the keyspace. This is useful if you are storing different kinds of data in the same database.

* We don't use timestamps anywhere other than the memtable. This puts some constraints on snapshot management, but streamlines get/scan operations and reduces file size for small values.

Do you have benchmarks for scan performance? This is an important area for us. I don't have exact figures handy, but we get something like 2GB/second (using 8 threads) on an EC2 h1.4xlarge, uncached (reading from SSD) and decompressing on the fly. This is an area we've focused on.

I'd enjoy getting together to compare notes -- send me an e-mail if you're interested. steve @ (the domain mentioned above).


SkyDB using LMDB gets 3GB/sec on a standalone PC. https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1...


Wow, Awesome link, LMDB always seems to fly under the radar, SkyDB+LMDB. Genius. (and written in go! I'm sold... well will at least give it a bash)


Thanks for your comments Steve.

1. RocksDb has a feature that allows an application to determine when to close a file (i.e. segment). You can write your compaction code via compaction_filter_factory defined in https://github.com/facebook/rocksdb/blob/master/include/rock...

2. RocksDb also has a feature that allows an application to close a block inside a segment. https://github.com/facebook/rocksdb/commit/fd075d6edd68ddbc1...

3. RocksDb has a feature to use different compression algorithms for different parts of the database. In the Level Style Compaction, you can configure a different compression algorithm for different levels. In Universal Style Compaction, you can specify that you want compression only for x% earliest data in your database.

4. We have internal benchmarks for scan performance but because of lack of developer resources, we might not be able to open source those numbers.

It will be great to catch up in person.


> we've built a more-or-less clone of LevelDB in Java, with a similar goal of extracting more performance on high-powered servers (and better integration with our Java codebase).

This sounds quite interesting; have you considered open-sourcing it?


Yes, we'd like to release it someday, but it won't be any time soon unfortunately. There are a lot of dependencies on other parts of our codebase, e.g. for configuration and monitoring, which would need to be cleaned up.

We will probably at least publish a report describing the work in more detail, some time in the next few months, on our blog (http://blog.scalyr.com).


It would be helpful if this description also includes a comparison with http://www.mapdb.org/, I am curious the differences (beyond just the ecosystem around the DB)


Hi guys, I am Dhruba and I work in the Database Engineering team at Facebook. We just released RocksDB as an open source project. If anybody has any technical questions about RocksDB, please feel free to ask. Thanks.


Hi Dhruba, thanks for volunteering to ask questions.

What are the big algorithmic ideas behind RocksDB?

My understanding is that LevelDB is based on log structured merge trees. These can be deamortized using methods from Overmars's "The Design of Dynamic Data Structures" or Bender et al.'s "Cache-Oblivious Streaming B-trees". How did you reduce latency?

What else was slowing down databases larger than RAM? How did you fix that?


RocksDB has an LSM architecture, similar in nature to HBase, leveldb, etc. But the implementation is based on a Theorem that we will be publishing shortly. I am working on the Theorem with a colleague of mine.

Cache Oblivious B-trees is an interesting paper. Similarly fractal trees. Most of them optimize the case when index nodes are not in memory. However, in our use-cases, we typically configure the system in such a way that most index nodes are in memory.

For an LSM database, the key component is "compaction". You can ingest data only as fast as you can compact, otherwise u get a unstable system. .1 RocksDB replaced the Level-style compaction of leveldb with UniversalStyleCompaction that has reduced write amplification. This boosts performance. 2. RocksDB implemented multi-threaded write, which means that parallel compactions on different parts of the database can occur simultaneously. This boosts performance. 3. Bloom filter for range-scans: this boost read performance 4. MergeType records that allows higher level objects (counters, lists) use only-write instead of a read-modify-write. Improves performance. 5. And many more...


Can you share with us the statement of that theorem?

What is "UniversalStyleCompaction", and why is it capitalized and missing spaces?

How does a Bloom filter for range scans work? Standard Bloom filters (as you know) are for existence only.


I'm guessing this may have been an early draft of some of the statements of the theorem:

http://webcache.googleusercontent.com/search?q=cache:fTxlRmb...


How much do you think RocksDB/LevelDB performance is impacted by the use of relatively coarse-grained locking? Another LevelDB fork, HyperLevelDB [1] implemented a fine grained scheme with performance benefits. Disclaimer: I am working on a (unreleased) fork of LevelDB that uses hardware transactional memory for synchronization using the new TSX-NI instructions present on Haswell processors.

[1] - http://hyperdex.org/performance/leveldb/


This is an excellent question. RocksDB has a lock per shard that is used to protect incrementing/decrementing refcounts on critical metaddata structures for writes. It is never held during actual IO to storage. We have been able to saturate a flash device for a pure-write workload when value size = 800 bytes without bottlenecking on this lock.


Can this be used as a drop-in replacement for LevelDB on queue technologies like ApolloMQ that use LevelDB as the default?


This can be used as a drop-in replacement. It disk format is compatible with leveldb. But once you upgrade to RocksDB, you won't be able to switch back to Leveldb (unless you restore ur data from a backup).

One definite use-case for RocksDB is a message queue that has a high number of puts/gets/deletes.

Also, please look at options.h to determine what options to tune. A not-tuned system will not show you much perf difference from leveldb.


Thank you. My company uses ApolloMQ but in a relatively low-data setting. (10-15 queues, with less than 5000 messages in a queue at a time) I don't know that it would make it worth it to move to a different data store but the idea was fascinating to me so I was curious. Thanks for the response.


https://github.com/facebook/rocksdb/issues/new is a 404.

Appreciated when project is on github and open for issues/PR, etc.


are you not able to access the github repo?


doesn't compile on centos 6, gcc 4.6.3 wanted to file a issue. but 404 stopped me.


Any replication support? (Or any sort of distribution?)


The primary enhancements over LevelDB seem to be parallel compactions of disjoint ranges, to take advantage of cheap seeks on flash storage, and the ability to parameterize core algorithms and data structures to suit a particular anticipated workload. All very cool; anything else major?

Also, there aren't JNI bindings... are there?

Thanks for the contribution. Just started using LevelDB on a project, but deployment will involve fast flash storage and rocksdb looks like a worthy successor.

Jon


Thanks for your comments Jon. RocksDB shares some of its genes with LevelDB.. something like a parent-child relationship.

Please check out Universal Comaction Style, multi-threaded-compaction, pipelined memtables.

I used to have JNI bindings that I pulled in from https://github.com/fusesource/leveldbjni but it was difficult for me to update the JNI everytime we added new apis to RocksDB. It would be great if somebody who needs Java support can implement JNI bindings for RocksDB.


Can it be configured as a distributed No SQL database like Cassandra?


I'm surprised that the C++ code is not using the RAII idiom in some obvious places.

For example: https://github.com/facebook/rocksdb/blob/master/db/db_impl.c

There are many places with bracketed calls to mutex_.Lock and mutex_.Unlock().

An example:

      mutex_.Unlock();
      LogFlush(options_.info_log);
      env_->SleepForMicroseconds(1000000);
      mutex_.Lock()
Why didn't the authors use the RAII idiom here? Even if there are no exceptions expected, the code would still be simpler and less error prone by using a guard object.


fixed your link: https://github.com/facebook/rocksdb/blob/master/db/db_impl.c...

Take another look! There's a guard object used at the function scope to ensure the lock is released, and this block is bracketed to release and reacquire the lock, not acquire and release. There may be a case for a guard object that does the release/reacquire, but its definitely not a slam dunk like acquire/release


Still, that's not exception-safe, correct? If LogFlush or SleepForMicroseconds throws an exception the mutex will be unlocked twice, which pthreads disallows for normal mutexes...


You know, for a second I thought you were wrong, but I changed my mind. This does look like a bug, and a simple on to avoid at that.

It's tough, because Rocks is still highly based on LevelDB, which conforms to Google's coding style guideline, which makes RAII more than a bit tricky to do right.


That link should be:

https://github.com/facebook/rocksdb/blob/master/db/db_impl.c...

(not db_impl.c but .cc) I was wondering for a while if it was possible to do RAII in idiomatic c99 -- and it appears it isn't (or doesn't make as much sense, anyway).


A lot of C++ code essentially is 'C with classes': no RAII, no exception handling.


Well LevelDB is already good. And if this improves on it, that's great.

I was looking at embedded key value stores and also found -- HyperLevelDB (from creators of Hyperdex database). They also improved on LevelDB in respect to compaction and locking:

http://hyperdex.org/performance/leveldb/

So now I am curios how it would compare.

Another interesting case optimized for reads is LMDB. That is a small but very fast embedded database at sits at the core of OpenLDAP. That one has impressive benchmarks.

http://symas.com/mdb/microbench/

(Note: LMDB used to be called MDB, you might know it by that name).


The LMDB statistics are very strange - why is synchronous SSD performance worse on most figures than HDD performance? Something seems very wrong with these benchmarks:

    Section 5 (SSD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB              342 ops/sec	
    Kyoto TreeDB          67 ops/sec	
    SQLite3              114 ops/sec	
    MDB                  148 ops/sec	
    MDB, no MetaSync     322 ops/sec	
    BerkeleyDB           291 ops/sec	
    
    Section 8 (HDD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB             1291 ops/sec	
    Kyoto TreeDB          28 ops/sec	
    SQLite3              112 ops/sec	
    MDB                  297 ops/sec	
    BerkeleyDB           704 ops/sec	
    
Really? LevelDB is four times faster on an HDD than an SSD with synchronous writes? BerkeleyDB is over twice as fast?

This smells.


I would guess the answer is SSD Write Amplification. SSDs in order to write have to erase first. They also try to minimize wear so internally they spread the data around as it gets written. Maybe someone else with more experience can explain more, but that's my guess.


Keep in mind, the HDD was using ext2 and the SSD was using reiserfs. Synchronous writes on ext2 are faster than all journaling filesystems.


Not three orders of magnitude faster, which is the difference between hdd and ssd random writes.


All of the source code is available on that page, you're welcome to rerun it on your own hardware configuration.


Three orders of magnitude faster would mean 1000x faster. You probably meant 3 times faster.


SSDs really are 1000x faster at random writes (~200,000 iops vs ~200 iops)


> The LMDB statistics are very strange - why is synchronous SSD performance worse on most figures than HDD performance?

Could it be that most database engines are based on algorithms that were developed before SSDs were significant, and were extremely optimized for HDD performance?


No. Most database engines are actually so old they were developed even before HDD hit the seek wall.


Paging hyc_symas. Howard Chu isn't shy about talking about benchmark results and he can be found here and on twitter.


Hello hello! Pretty sure what actually happened in these is that the HDD's internal cache was still active, while the Crucial M4 SSD has no internal cache. The only other explanation is that I screwed up my partition offset on the SSD but I already double-checked that and the partitions were all 2MB aligned.


I'm going to compile up LMDB and bench it on a 96GB DL380g8 with quad 3TB ioDrive-2s. Should be interesting to see how various database sizes play out, and what the write amp looks like. I am not seeing much about LMDB's NUMA awareness -- guess I need to keep digging.


For reads we get linear scaling out to 64 cores. Using cache-aligned data structures plays a big part in that for NUMA. (At the moment that's the largest machine we have in our lab.) For writes, there's basically no scaling. Write amplification is logN, proportional to tree height.


If HDD's internal cache was active how are synchronous writes still faster? Shouldn't the flush/sync ensure persistence of data?

If the cache was being used, the HDD results are not actually synchronous, a power loss event would result in data loss.


I'm not sure but I think HDD are better at sequential operations than SSD (which performs better at random operations). Some people say MySQL performs better on a high speed HDD than on a SSD.


LSMs have a long long way to go to catch up to LMDB. http://symas.com/mdb/hyperdex/


For reads, sure. LSMs are optimized for writes, while LMDB, which is a nice B-tree implementation is optimized for reads.

LSMs are getting popular because it's harder to scale durable writes than reads, which can be handled (in many cases independently) by caching.


LSMs are solving a problem that is rapidly becoming irrelevant due to the multiple NVRAM technologies entering the market. With NVDIMMs, MRAM, FeRAM, PRAM, etc., all your writes can be durable for free. And if you'll notice in that HyperDex benchmark I posted, the LSM write performance was still worse than LMDB, while wasting several times more CPU.

Going forward, power efficiency will still be crucial - for as long as civilization persists. But optimizing durable writes will be about as useful as optimizing delay loops.


Howard, is LMDB effectively limited to 128T (on 64bit machines and 2GB on 32bit ones, not that one should be running large databases on 32bit machines)?

Also what about concurrent writes? Does it have a database wide writer lock or is it per key (per page?) ?


It is limited to the logical address space. Since most current x86-64 machines have only 48bit address space, 256TB, and assuming the kernel keeps half of the space for itself, then yes, the current limit is 128TB. But I suspect we'll be seeing 56bit address spaces fairly soon.

It is a single-writer DB, one DB-wide writer lock. Fine-grained locking is a tar pit.


Fine-grained locking is hard, but "tar pit" is unfair and honestly a bad attitude. It's crucial for modern applications, and it can be done if you're careful, and it can be done really well.

We (Tokutek) tried for a long time to get by with a big monolithic lock, and a) competing with InnoDB was really hard since they do concurrent writers really really well, and b) when we did decide to break up the lock, it wasn't as hard as we thought it would be and it worked really really well.

Don't get discouraged, break up that lock!


In our own workloads, writers are always going after the same pages in their index updates, which inevitably led to deadlocks in BerkeleyDB. As a result, we get much higher throughput with fully serialized writers than with "concurrent" writers. A microbench might show greater concurrency on simple write tasks, but in a real live system with elaborate schema, there's no payoff for us.

As always, you have to profile your workload and see where the delays and bottlenecks really are. Taking a single mutex instead of continuously locking/unlocking all over the place was a win for us.


Is this the reason for your observation that LMDB is oriented towards read workloads?

I can see how the extra code locking/concurrency code would expand the library size out of the CPU cache, though.


Yes, since readers don't require any locks at all and don't issue any blocking calls of any kind - syscalls, malloc, whatever - they run completely unimpeded. The moment you introduce fine-grained locks of any kind the overall performance (reads and writes) will decrease by at least an order of magnitude because readers will have to deal with lock contention.


Makes sense.

Most impressive about LMDB to me is the zero-copy model for readers, with is no extra memcpy needed, maybe that is something obvious for database gurus but it is pretty clever trick I think.


It's pretty significant, yes. Eliminating multiple copies of everything got us a 4:1 reduction in memory footprint in OpenLDAP slapd (compared to our BerkeleyDB-based backend). This is another reason we don't spend too much time worrying about data compression and I/O bound workloads - when you've essentially expanded your available space by a factor of 4, you get the same benefits of compression, without wasting any of the memory or CPU time. And when you can fit a 4x larger working set into your space, you find that you need a lot less actual I/Os.


If I can pluck your brain for a little, do you think LMDB would be a good option as a back end for time series analysis?


I'm sorry, I'm not familiar enough with the workload to answer that. If you're primarily doing sequential writes, it seems like it could work well for it.


this is cool, though I'd wonder how it compares to Kyoto Cabinet. another big issue I've run into personally is the fact that both LevelDB and KC don't explicitly support multiple processes reading the db at once. (KC's API allows this but advises against it, LevelDB afaik doesn't even allow it.) I wonder if RocksDB gets past this.


RocksDB allows only one process to open the DB in read-write mode but other process can access the DB read-only (with a few configuration settings)


If you just need concurrent reading and update the database from a single process, you can use CDB. It always served me well.


Kyoto Cabinet will self-corrupt if you use it that way. LMDB supports multi-process explicitly.


can you explain how this happens? if it's just a read-only process, how can it corrupt anything?


HyperDex?


A very minor point:

The illustrative code snippet on the home page has a spurious semicolon on the first line:

    #include <assert>;


fixed! - thanks


The benchmark at https://github.com/facebook/rocksdb/wiki/Performance-Benchma... states that for LevelDB, "in 24 hours it inserted only 2 million key-values", and that "each key is of size 10 bytes, each value is of size 800 bytes".

I might be missing something, but that took just a few minutes on my ~2 year old desktop machine. Sample code: https://gist.github.com/wbolster/7487225


There was a typo, the 2 million should have been 200 million keys. I fixed the wiki page. Thanks again for pointing it out.


Node.js bindings (compatible with levelup) have already been released by rvagg: https://npmjs.org/package/rocksdb


Tnx for all the comments! Feel free to continue the discussion at https://www.facebook.com/groups/rocksdb.dev/


Looking forward to see this in Riak




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: