LevelDB is really cool but I have to admit I still struggle with the variety of datastores where the design of your keys have such a huge influence down the line. I remember using the Node variety (levelup/leveldown) and seeing all kinds of weird patterns when dealing with ranges (using \xff as a marker for the end of the range) and how to access "groups" of related data.
How simple it is to setup and get going is still extremely appealing though.
Agreed. Reaching back into my memory banks. When bigtable first became available for app engine I was fascinated by the prospect of working with Google scale capable database. But there were so many odd hard to reason about vagaries of squeezing that performance out of the database by means of key structure. The developer ergonomics seemed simple at first but in the end were not at all what I expected.
To me, it's not about magic, it's about the abstraction. Most implementations of this style of database demand that you consider internals of the engine before storing anything. If you don't design your keys the right way, then you're screwed down the road, and there's no way to fix it without just copying the data to a new table (and losing metadata like TTL). When I use something like Redis, or any SQL databse, I don't need to consider the inner workings of it unless I'm trying to absolutely max out my performance (one slight exception being atomicity of Redis commands).
But with LSM style databases, you are going to have a really bad time if you don't have a team member who has dedicated serious time to understanding the internal workings of LSM itself, and the details of your chosen implementation. That's a real mark against it, IMO.
tldr; LSM databases are like a colander in the world of leaky abstractions.
There is MyRocks, you work like usual mysql. When you want distrubted-db, yeah you need to design. But comparing RocksDB vs LMDB, the second is easier but not by much ?
Think citusdb without consideration you're gonna expect good performance in sharded environment?
> When you want distrubted-db, yeah you need to design
But one shouldn't be designing for implementation details. Usually, we start technologies out with leaky abstractions, and gradually get better at it. A good example is game development, where it used to be that you always used the drawing method of the display and the clock speed of the cpu to your advantage. Nowadays, we've moved past that, because it was working on horrible abstractions, and because the technology underneath improved.
I'm not saying I have a solution, and I agree that this problem rears it's head the most when you start bringing in distributed storage. But my point stands: these databases run on a highly leaky abstraction, and that's a big problem going forward.
IMO, while still cool, LSM trees don't feel particularly novel at this point. It seems like every database and their dog has adopted/made available some kind of LSM tree storage engine, down to traditional relational databases (e.g. MySQL with MyRocks).
They were described in the Bigtable paper, and were widely used before leveldb was published (Cassandra, for example, used sstables and LSM concerts well before 2012)
I just wanted to point out that there is a good explanation of SSTables B-trees and other database structures in a whoke chapter of the great book Designing Data-Intensive Applications, by Martin Klepmann.
In the middle of this book right now, specifically the unreliable clocks part. The explanations have the right amount of depth. It's a great read in so far.
not to mention, some awesome references at the end of each chapter. I had made it a point after each chapter to randomly pick an interesting reference and read that as well.
How simple it is to setup and get going is still extremely appealing though.