TKVDB: radix tree persistent key/value storage engine written in ANSI C

noahdesu · on Aug 25, 2018

This is really cool. We are doing something similar with CruzDB [0], but we are using red-black trees. What made you choose the radix tree? We've looked at several persistent data structures, and continue to look for alternatives.

[0]: https://nwat.xyz/blog/2018/02/15/cruzdb-architecture-part-1-...

vmxdev · on Aug 26, 2018

> What made you choose the radix tree?

Library was initially written for use with Netflow collector. More precisely for storing filtered and aggregated traffic information. Typical load for such applications is: a lot of updates in "hot" dataset, fewer inserts, deletions are rare and mostly batched, no needs for fast sequential scan (user can wait for a few seconds or even more to get report). Radix trees are perfectly fits this requirements.

Later we made some tests and decided to use it for IP (plus some other information) lookups without underlying file ("RAM-only" mode).

We have to support some old network hardware (for example almost abandoned Tilera TILE-Gx), that's why we're trying to avoid using OS and platform-specific features in library.

Radix trees has disadvantages, the most notable is that it's hard to implement user-defined sorting order for keys and they required more memory for sparse nodes compared to B-trees (however, this statement is not true for on-disk TKVDB database, we don't store empty pointers)

koverstreet · on Aug 25, 2018

Red black trees have crap performance due to pointer chasing.

rurban · on Aug 26, 2018

For small maps (compressed) radix trees (patricia, judy) are even faster than hash maps, esp. concurrent.

marknadal · on Aug 25, 2018

This is brilliant!!!! I hadn't heard of any Radix based storage engines and was wondering why there weren't a ton out there. So I built one (in pathetic Javascript https://github.com/amark/gun/blob/master/lib/radisk.js ) last year, and super happy to see this one in C, I might consider switching over to TKVDB! Thank you!!!

Can people link me to any other Radix based high performant storage engine? Radix seems like an obvious choice, and I was able to squeeze 2K/ack-to-disk over 1M tweet-like write load tests out of mine, I can't wait to try a C based version. I'm highly interested in this area. Very excited to try out TKVDB!

mypalmike · on Aug 25, 2018

Radix trees are great for IP address databases. With a couple of additional operations, this could be used as the basis of such a database.

Making certain assumptions about the implementation here...

Keys would be the string representation of the binary version of the CIDR (e.g. "10.0.0.0/12" -> "000010100000"). Searching for best fit involves traversing down the tree until finding the deepest terminating node that matches. Parent prefix is same as equality while keeping track of previous terminating node. Immediate child prefixes requires depth-first traversal of child nodes until hitting a terminal node.

Spiritus · on Aug 26, 2018

What's the deal with the `while ... while(0)` and `do ... while(0)` macros? Is it some scoping trick?

    #define TKVDB_EXEC(FUNC)           \
    do {                               \
        TKVDB_RES r = FUNC;            \
        if (r != TKVDB_OK) {           \
            return r;                  \
        }                              \
    } while (0)

    #define TKVDB_SKIP_RNODES(NODE)    \
    while (NODE->c.replaced_by) {      \
        NODE = NODE->c.replaced_by;    \
    } while (0)

jwilk · on Aug 26, 2018

do … while(0) is a well-know method of writing multi-statement macros:

http://c-faq.com/cpp/multistmt.html

while … while(0) looks like a mistake?

vmxdev · on Aug 26, 2018

Yes, the first macro is the common trick, and the second is the mistake (wondering, but it works). I update the code. Thanks a lot for pointing it out!

zambal · on Aug 26, 2018

You are correct. At least the do... while statement is a common c macro trick to make sure a macro has its own scope. The idea is that a compiler will optimize the do....while statement away, so it does not have any runtime overhead.

The wile...while(0) statement is new for me (didn't know this is actually legal syntax), but I suspect it's done for similar reasons.

jwilk · on Aug 26, 2018

> didn't know this is actually legal syntax

It looks like some unusual looping syntax; but in fact these are just two normal while loops one after another.

lixtra · on Aug 25, 2018

> It is similar to Berkeley DB, LevelDB or SQLite4 LSM.

Can you point out the pros and cons vs above projects?

> reading from transaction while resetting it can lead to unpredicatable consequences.

I read this as: only use it single threaded.

vmxdev · on Aug 25, 2018

> Can you point out the pros and cons vs above projects?

Briefly speaking, the above projects are way more mature, full-featured and use different data structures for storing data. BDB, LMDB are B-tree-based, LevelDB (and forks), SQLite4 LSM are using log-structured merge-trees.

TKVDB is a lot simpler, a bit more low-level, uses Radix trees and not tested as well as other key-value engines.

> I read this as: only use it single threaded.

If you protect access to data with your OS locks, everything will be fine.

We're using TKVDB in multi threaded environment (in RAM-only mode) for IP lookups with Netmap and DPDK. On 40Gbe full line rate there is only few hundred CPU cycles for decision on network packet, so the locking tricks becomes important.

Intel has guarantees about writing properly aligned data - if one core is writing 1, 2, 4 or 8 bytes, it will be written atomically. We're exploiting this feature when updating our tree - final update operation is 8 byte (4 byte for 32 bit CPU's) aligned pointer update.

So, in some cases you don't need to use locks at all (you may add HW memory barrier (__sync_synchronize()) after tree update, but it will slow down writer). If only one core is writing data, another cores can safely (well, almost) read it. Warning in docs was about this lockless case.

combatentropy · on Aug 25, 2018

I was wondering why the API makes you write transaction twice:

    transaction->begin(transaction)

Why not:

    transaction->begin()

glenda · on Aug 25, 2018

It looks like transaction is just a C struct with function pointer fields, not an actual C++ class, so there is no implicit `this` argument.

vyodaiken · on Aug 25, 2018

no "self" in C.

baq · on Aug 25, 2018

on one hand, it's very cool to see such projects.

on the other hand, writing anything that's supposed to reliably store data in C in 2018 raises eyebrows.

jnurmine · on Aug 25, 2018

But but but... every day, many people use implementations written in C to reliably store data. Just look at, say, Linux. The drivers, filesystems and networking stack are all written in C (and maybe some assembler). And once the data reaches the firmware controllers of the mass storage device... Well, the firmware is done in C or assembler.

It is not like implementing a rather thin layer for reliable data storage on the userspace in language x somehow invalidates the contribution of the C language, which the underlying system, all the way from the OS to the device firmware is likely written with.

That said, I do understand the arguments of e.g. Rust being superior in preventing memory bugs, and so on, but whatever gets implemented on the userspace is simply not the whole story regarding reliable data storage. Without system services no data will get replicated, distributed, stored and so on...

CamperBob2 · on Aug 25, 2018

If you intend to export the API to the widest possible variety of other languages, C is a good way to go for the initial implementation. The lack of a universally accepted, language-independent "Rosetta stone" ABI means that this is the one job that C still does better than anything else.

Nothing wrong with writing targeted, specialized components in C. It can be done properly, it just frequently isn't.

vortico · on Aug 25, 2018

It's a tradeoff between adoptability, readability, portability, ease of embedding, average familiarity of toolsets, ABI managability, and strictness of rules to access arrays and references in the implementation. C likely wins in all of the above except the last, with C++ in second place, and virtually the only other options are Rust and Go, which only wins in the last point.