Really cool write up – thanks! First time I’m hearing about CitusDB. They appear to be building a columnar, distributed database while preserving the Postgres frontend (similar to redshift, aster, greenplum, etc.)
It’s all in the details. I’m planning to investigate the following during my next weekend hack. Hope somebody can answer some pre-sales questions for me:
- how complete is the postgres functionality (e.g.: lateral joins)
- can you set a sharding key to control the shard distribution
- does the database do multiple passes for queries with subselects
- usually one increases the replication factor (limited by budget) to improve query times, with the limitation that it slows down loading time. does the DB stage intermediate writes to batch them, so does the user need to do this? this works really well for append-only, timestamped event data.
- do you have a job manager or scheduler, needed when you have multiple views that need to be updated without melting your infrastructure
- how easy is it to operate? does the database expose operational metrics so that you can see the load on each shard to potentially detect unbalanced shards?
- tips on hardware configuration (big advantage of redshift here is that you don’t have to run your own warehouse.) maybe partner with MongoHQ?
It’ll be nice to see some sample query plans graphically visualized.
On the heels of the recent abrupt FoundationDB shutdown after being acquired by Apple, I'm apprehensive and reluctant to even consider investing more energy into proprietary datastores.
I'm torn because I love shiny future tech from outer space, but the FDB burn felt horrible to me.
I'm keen to hear thoughts on other perspectives which might help me figure out a better balance or attitude on these matters.
I can't think of any columnar DBs that are FOSS aside from Cloudera's Impala (under the Apache licence), and IMO while it represents as a DB, it's a sometimes leaky abstraction over the Hadoop ecosystem, and its used columnar format, Parquet, has some data-type limitations compared to other products.
But I agree - we've had some "fun" with our columnar DB's vendor, and the support we're paying so much for has been rather useless.
Impala is cool but it's a data warehouse, not a full-featured DBMS. The biggest difference is that it only supports batch inserts, so forget about UPDATE/DELETE queries.
In any case, CitusDB's home page doesn't say anything about it being a columnar database, and this blog post says it uses the same storage engine as PostgreSQL.
I went through their GitHub and I saw those projects. I think they're good stuff, and I'm still concerned about proprietary solutions like theirs.
Open-sourcing pieces doesn't put me at ease here, because in the event that their core mission changes I'll still be left out in the cold, since future development and support will still infeasible.
You'd be left out in the cold without an ability to do OLAP queries efficiently. Fortunately, most OLAP use cases where you'd use CitusDB would be ones where the latency penalty of extracting form CitusDB to process in another system, while painful, would be acceptable as an interim solution.
It’s all in the details. I’m planning to investigate the following during my next weekend hack. Hope somebody can answer some pre-sales questions for me:
It’ll be nice to see some sample query plans graphically visualized.