Time Structured Merge Tree: From LSM Tree to B+Tree and Back Again

gtrubetskoy · on Oct 7, 2015

IMHO if your volume is not in 300K data points per second, there is still plain old PostgreSQL, which is perfectly suitable for storing time-series and is time-tested and very reliable, allowing you to keep your TS data alongside your other (business) data so that you can analyze it all together without having to extract out of unfamiliar storage. I've hacked a project which demonstrates how it can be done, see https://github.com/grisha/timeriver and my blog explaining the technique http://grisha.org/blog/2015/09/23/storing-time-series-in-pos...

acconsta · on Oct 7, 2015

Yeah, but...

Need replication? Gotta write your own sharding logic or set up pg_shard.

Need aggregations? Gotta write your own logic. Will you do them on the fly? Use triggers? On demand?

Need to remove old data? Gotta set up a cron job. But wait, what if I want to age different series at different rates? Now you need a policy system. Sigh.

Not saying Postgres can't be the storage engine, but there's a lot of work to do on top of that.

gtrubetskoy · on Oct 7, 2015

I believe that what you're referring to as "a lot work to do on top of that" is the correct solution, the ultimate OSS project.

The fallacy of the many time series db's out there is that they discarded the relational database as a viable storage option prematurely and are now trapped solving the very hard problem of horizontally scalable distibuted storage that takes many years to solve instead of focusing on the time-series aspect of it.

Sooner or later something along the lines of pg_shard will become standard in PostgreSQL and other databases, thus you don't really need to write your own sharding logic, you just have to wait. OR you can write it if you want. You have options.

Aggregations is what GROUP BY is for. Removing old data is a non-issue if you're using a round-robin approach (see my blog link), it also makes aging different series at different rates easy.

acconsta · on Oct 8, 2015

It's good to have competition in this area. Influx is giving me whiplash (it's on its third database engine in six months!)

The circular buffer approach is fine, but it does have the drawback of being unable to represent variable-length data (like key-value pairs). It's also harder to compress the data.

Can a GROUP-BY do windowed aggregations? Like, take an average over ten-minute windows? My SQL knowledge is not great.

math · on Oct 7, 2015

These guys seem to think relational databases are a valid alternative in the process automation world, and seem to have made a successful business around the idea:

https://inductiveautomation.com/news/process-historians-vs-s...

[I have no data on how well their product performs in practice]

chaotic-good · on Oct 8, 2015

SCADA systems usually handles small amount of data (because data comes from real sensors, this sensors is expensive and you simply can't have millions of them).

math · on Oct 8, 2015

yeah. I actually suggested that to someone who sells an industrial historian and his response (believe it or not!) was that no, there is too much data, a relational database solution is no good (and by the way! the business model of inductive automation will never work!)

I'm sure writing would be no problem, the only thing I wonder about are complex (aggregating) queries in situations where you care about delays (interactive visualization perhaps).

kbody · on Oct 7, 2015

It's not just about inserting data fast enough, you could save the data in-memory and batch-insert them for max performance to a text file. But time-series databases often come for certain needs, most commonly aggregation operations over time windows.

chaotic-good · on Oct 8, 2015

You can't compare this solution to InfluxDB or other time-series databases out there because you're loosing information here. You can't store timestamps with arbitrary precision, and all the data is resampled, there is no way to store raw data. This is not acceptable in many cases.

dkhenry · on Oct 7, 2015

Even at 300k points per second, PostgreSQL or MySQL can easily service your needs. The scheme that this article presents is really about batching writes to disk to get high throughput.

    mysql> CREATE TABLE test.polls_memory ( id INT(11) PRIMARY KEY AUTO_INCREMENT, poll_id INT(11), time DOUBLE, value INT(11)) ENGINE MEMORY;
    mysql> CREATE TABLE test.polls_disk (id INT(11) PRIMARY KEY AUTO_INCREMENT, poll_id INT(11), time DOUBLE, value INT(11)) ENGINE MyISAM;
     
    #> for i in `seq 1 500000`; do echo "INSERT INTO test.polls_memory (poll_id,time,value) VALUES ($i,UNIX_TIMESTAMP(),$RANDOM);" ; done | mysql -uroot
   
    mysql> INSERT INTO test.polls_disk SELECT * from test.polls_memory;
    Query OK, 510001 rows affected (0.95 sec)
    Records: 510001  Duplicates: 0  Warnings: 0

And this was just on my laptop. I know that on enterprise grade hardware I can get that write rate up to millions per second. The question isn't getting it to disk in batches. Its the read patterns you can support once you decide you are going to batch your writes.

weddpros · on Oct 7, 2015

Yet your measurement includes no networking, no parsing of statements, no concurrency...

Repeat the experiment with 200 different clients hitting the server via networking with 1500 insert statements, each second. Then you can compare MySQL with the InfluxDB use case cited as an example in the article.

And don't forget the index on the timestamp.

dkhenry · on Oct 7, 2015

The whole point was if you are going to batch writes to disk to get high thoughput then PostgreSQL or MySQL can easily sustain the same rates as InfluxDB ( which my little example shows ). I wouldn't have MySQL parse 500,000 insert statements per second, but I would have a proxy in front of it that could take in 500,000 data packets per second and then at some defined interval write them to MySQL.

I know its not as sexy as InfluxDB, but the old work horses are tried and true and if you use them right they are as good as if not better then some of the shiny new toys.

gtrubetskoy · on Oct 7, 2015

> I would have a proxy in front of it that could take in 500,000 data packets per second

Which pretty much exactly what timeriver is :) (That, and it can also do Graphite-like functions, e.g. movingAverage(), etc.)

weddpros · on Oct 7, 2015

Can you time the copy again, with an index on the timestamp this time? just out of curiosity.

Also, not sure it's wise to build system software (proxy) just to log data instead of using what already exists. Because that proxy you write will not be an old work horse either.

dkhenry · on Oct 7, 2015

So I added an index on (poll_id,time) since thats a more likely index for various reasons.

    mysql> ALTER TABLE polls_disk ADD INDEX `timestamp` (poll_id,time);
    mysql> INSERT INTO test.polls_disk SELECT * from test.polls_memory;
    Query OK, 510001 rows affected (1.78 sec)
    Records: 510001  Duplicates: 0  Warnings: 0

So the index takes me down to 286k points per second. Again this is just on my laptop, but it shows a great point. I can write stuff down to disk super fast ( especially if I can batch it ), but reading it back out is where the problems are.

As far as a proxy in front of the datastore, the amount of effort to batch up data packets to hand them off to the data store is several orders of magnitude less complexity then a really well designed data store, so where do I want to custom roll software and where do I want to leverage the broader community. I don't think there isn't a place for the new tools, but I think we jump to them much more quickly then we should in a lot of situations.

atombender · on Oct 8, 2015

Creating a time-series table is easy. However, you will quickly experience several challenges:

(1) Table scans to extract just a few columns will be extremely inefficient. It's better to create one table per column, each being a pair [time, value].

(2) B-tree indexes will be inefficient you for large extracts. For example, aggregating by month for one year will basically not use the indexes. The index will only mostly useful for looking up individual metrics, which is not useful. The new BRIN indexes in 9.5 should help, however.

(3) You'll end up reaching for something like Postgres' partitioned tables, which allows you to create child tables that are visible from the top-level table. Each table needs to be sharded by something like the current day, or week, or month.

(4) You say you need a proxy to batch data packets. So it's queuing, and needs persistence because you don't want to lose data if Postgres is lagging behind — your proxy is now a time series database! However, if Postgres is lagging behind, you have a problem. If you're writing at 600k/s and Postgres can only handle 500k/s, you have a never-ending queue.

(5) PostgreSQL can only have one master, and its write capability has a fixed ceiling (for HDDs you can calculate the max possible tps from the rotational speed, not sure about SSDs). At some point you'll end up sharding (by column, probably) the master into multiple concurrent masters.

Much of the point of something like InfluxDB is that there's no single bottleneck. You scale linearly by adding more nodes, each of which can write data at top speed. You don't need a "proxy" because it's designed to perform that very same function.

yxhuvud · on Oct 7, 2015

Both only reading or only writing is really simple, compared to when simultaneous high load on both reads and writes are needed, especially if at the same time as things are flushed to disk in a way that doesn't lose accepted data on power down.

That doesn't mean most people can't do well on relational databases as most people simply doesn't need both at the same time.

ciconia · on Oct 7, 2015

Try running that benchmark again on a table that already contains 500M records. I'm quite sure you'll see different numbers. Another problem you'll run into is reading back those records, and performing aggregations and window functions on your data. In my experience Postgresql is not a bad solution for time series data, but it's certainly not a perfect fit.

dkhenry · on Oct 7, 2015

If I take the index off the benchmark will run just about as fast for 500M rows as it did for 0 rows, at 500M rows MySQL's BTREE index starts to fall over, but I have plenty of options to mitigate that. I don't think anything is a perfect fit for time series data, but again my point was that the traditional relational database's are _really_ good, and they do in fact scale up well beyond what most people think they will.

kbody · on Oct 7, 2015

So, we were using InfluxDB on v0.8.x but had two data corruption incidents. Both times I tried the recovery utilities with no luck. Even with some data loss, it was time to move to another db.

We were only getting ~1k tps with some spikes of 10k tps. I hope for the rest of the users, there was some work done of that front. Maybe in the future we will revisit out of curiosity, since we have no issues at the moment.

gtrubetskoy · on Oct 7, 2015

Curious - what db did you move to?

kbody · on Oct 7, 2015

kdb+ pricey for sure, so it's not a fair fight but you gain the power of q and amazing performance.

skybrian · on Oct 7, 2015

I'm wondering if a new low-level db will come out of this as an alternative to LevelDB, BoltDB, and so on? Or do you have to use InfluxDB?

btrask · on Oct 8, 2015

We definitely need more and better low level key-value stores, especially write-optimized ones. It sounds like InfluxDB has correctly identified the pain points with LevelDB and LMDB, but it's yet to be seen whether they've succeeded in addressing them. The article kind of petered out without a proper conclusion.

The people who are talking about Postgres in this thread are thoroughly confused. This isn't a competitor to Postgres, this is a platform for building a competitor to Postgres (or, hypothetically, something Postgres itself could use). I guess maybe they mean InfluxDB rather than the subject of the article.

robochat42 · on Oct 8, 2015

seriously? I'm always confused by the large number of different low level key-value stores on offer. All claiming to be the fastest, with benchmarks to back it up.

Reading the article, it seems that they could have contributed some development to Rocksdb and saved themselves a lot of time rather than implementing their own system.

btrask · on Oct 10, 2015

RocksDB has the same problems as LevelDB (uses large numbers of file descriptors, considerable read overhead compared to LMDB), plus it's even more complex and harder to configure. I had high hopes for the LSM tree from SQLite4[1], but last I checked development was stalled.

[1] https://www.sqlite.org/src4/doc/trunk/www/lsmusr.wiki

shubhra51 · on Oct 7, 2015

Woot! Nice throughput improvement over traditional LSM types with both read and write accelerated.

philsnow · on Oct 7, 2015

typo in the article's headline: Strucutred -> Structured