Scaling Out PostgreSQL for CloudFlare Analytics Using CitusDB (YC S11)

gane5h · on April 9, 2015

Really cool write up – thanks! First time I’m hearing about CitusDB. They appear to be building a columnar, distributed database while preserving the Postgres frontend (similar to redshift, aster, greenplum, etc.)

It’s all in the details. I’m planning to investigate the following during my next weekend hack. Hope somebody can answer some pre-sales questions for me:

  - how complete is the postgres functionality (e.g.: lateral joins)
  - can you set a sharding key to control the shard distribution
  - does the database do multiple passes for queries with subselects
  - usually one increases the replication factor (limited by budget) to improve query times, with the limitation that it slows down loading time. does the DB stage intermediate writes to batch them, so does the user need to do this? this works really well for append-only, timestamped event data.
  - do you have a job manager or scheduler, needed when you have multiple views that need to be updated without melting your infrastructure
  - how easy is it to operate? does the database expose operational metrics so that you can see the load on each shard to potentially detect unbalanced shards?
  - tips on hardware configuration (big advantage of redshift here is that you don’t have to run your own warehouse.) maybe partner with MongoHQ?

It’ll be nice to see some sample query plans graphically visualized.

jaytaylor · on April 9, 2015

This was a great read.

Question/thoughts regarding CitusDB:

It looks to be really cool, and also proprietary.

On the heels of the recent abrupt FoundationDB shutdown after being acquired by Apple, I'm apprehensive and reluctant to even consider investing more energy into proprietary datastores.

I'm torn because I love shiny future tech from outer space, but the FDB burn felt horrible to me.

I'm keen to hear thoughts on other perspectives which might help me figure out a better balance or attitude on these matters.

EdwardDiego · on April 9, 2015

I can't think of any columnar DBs that are FOSS aside from Cloudera's Impala (under the Apache licence), and IMO while it represents as a DB, it's a sometimes leaky abstraction over the Hadoop ecosystem, and its used columnar format, Parquet, has some data-type limitations compared to other products.

But I agree - we've had some "fun" with our columnar DB's vendor, and the support we're paying so much for has been rather useless.

teraflop · on April 9, 2015

Impala is cool but it's a data warehouse, not a full-featured DBMS. The biggest difference is that it only supports batch inserts, so forget about UPDATE/DELETE queries.

In any case, CitusDB's home page doesn't say anything about it being a columnar database, and this blog post says it uses the same storage engine as PostgreSQL.

EdwardDiego · on April 9, 2015

> Impala is cool but it's a data warehouse, not a full-featured DBMS.

I wouldn't even call it that to be honest, you spend a lot of time thinking about Hadoop and HDFS files when working with Impala.

> In any case, CitusDB's home page doesn't say anything about it being a columnar database

It does, if you dig deep enough. :) https://www.citusdata.com/citus-products/cstore-fdw

But you're right that it's completely optional and not needed to access their distributed query processing.

cbsmith · on April 10, 2015

> I can't think of any columnar DBs that are FOSS aside from Cloudera's Impala

Cassandra isn't strictly a columnar store, but it has much the same properties.

MonetDB.

(O)RCFile.

InfiniDB (which actually did go out of business, and is still available as open source).

Druid.

I'm sure I missed a bunch of others.

ddorian43 · on April 10, 2015

Monetdb (although not distributed).

ddorian43 · on April 9, 2015

They have opensourced pg_shard (oltp sharding, citusdb is olap) and column_store extension for postgresql.

jaytaylor · on April 9, 2015

I went through their GitHub and I saw those projects. I think they're good stuff, and I'm still concerned about proprietary solutions like theirs.

Open-sourcing pieces doesn't put me at ease here, because in the event that their core mission changes I'll still be left out in the cold, since future development and support will still infeasible.

cbsmith · on April 10, 2015

You'd be left out in the cold without an ability to do OLAP queries efficiently. Fortunately, most OLAP use cases where you'd use CitusDB would be ones where the latency penalty of extracting form CitusDB to process in another system, while painful, would be acceptable as an interim solution.

cbsmith · on April 10, 2015

> It looks to be really cool, and also proprietary.

They open sourced the OLTP portion of the distributed postgres store.