Hello! For the past year I’ve been working on a fully-managed data warehouse built on Clickhouse. I built this because I was frustrated with how much work was required to run an OLAP database in prod: re-writing my app to do batch inserts, managing clusters and needing to look up special CREATE TABLE syntax every time I made a change. I found pricing for other warehouses confusing (what is a “credit” exactly?) and worried about getting capacity-planning wrong.
I was previously building accounting software for firms with millions of transactions. I desperately needed to move from Postgres to an OLAP database but didn’t know where to start. I eventually built abstractions around Clickhouse: My application code called an insert() function but in the background I had to stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse configs, and manage schema changes.
This was all a big distraction when all I wanted was to save data and get it back. So I decided to build a better developer experience around it. The software is open-source: https://github.com/scratchdata/ScratchDB and and the paid offering is a hosted version: https://www.scratchdb.com/.
It's called “ScratchDB” because the idea is to make it easy to get started from scratch. It’s a massively simpler abstraction on top of Clickhouse.
ScratchDB provides two endpoints [1]: one to insert data and another to query. When you send any JSON, it automatically creates tables and columns based on the structure [2]. Because table creation is automated, you can just start sending data and the system will just work [3]. It also means you can use Scratch as any webhook destination without prior setup [4,5]. When you query, just pass SQL as a query param and it returns JSON.
It handles streaming and bulk loading data. When data is inserted, I append it to a file on disk, which is then bulk loaded into Clickhouse. The overall goal is for the platform to automatically handle managing shards and replicas.
The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.
So far ScratchDB has ingested about 2 TB of data and 4,000 requests/second on about $100 worth of monthly server costs.
Feel free to download it and play around - if you’re interested in this stuff then I’d love to chat! Really looking for feedback on what is hard about analytical databases and what would make the developer experience easier!
[1] https://scratchdb.com/docs
[2] https://scratchdb.com/blog/flatten-json/
[3] https://scratchdb.com/blog/scratchdb-email-signups/
[4] https://scratchdb.com/blog/stripe-data-ingest/
[5] https://scratchdb.com/blog/shopify-data-ingest/
Is your goal explicitly to replicate all Snowflake capabilities? https://docs.snowflake.com/en/user-guide/intro-supported-fea...