Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ScratchDB – Open-Source Snowflake on ClickHouse (github.com/scratchdata)
261 points by memset on Oct 27, 2023 | hide | past | favorite | 56 comments
Hello! For the past year I’ve been working on a fully-managed data warehouse built on Clickhouse. I built this because I was frustrated with how much work was required to run an OLAP database in prod: re-writing my app to do batch inserts, managing clusters and needing to look up special CREATE TABLE syntax every time I made a change. I found pricing for other warehouses confusing (what is a “credit” exactly?) and worried about getting capacity-planning wrong.

I was previously building accounting software for firms with millions of transactions. I desperately needed to move from Postgres to an OLAP database but didn’t know where to start. I eventually built abstractions around Clickhouse: My application code called an insert() function but in the background I had to stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse configs, and manage schema changes.

This was all a big distraction when all I wanted was to save data and get it back. So I decided to build a better developer experience around it. The software is open-source: https://github.com/scratchdata/ScratchDB and and the paid offering is a hosted version: https://www.scratchdb.com/.

It's called “ScratchDB” because the idea is to make it easy to get started from scratch. It’s a massively simpler abstraction on top of Clickhouse.

ScratchDB provides two endpoints [1]: one to insert data and another to query. When you send any JSON, it automatically creates tables and columns based on the structure [2]. Because table creation is automated, you can just start sending data and the system will just work [3]. It also means you can use Scratch as any webhook destination without prior setup [4,5]. When you query, just pass SQL as a query param and it returns JSON.

It handles streaming and bulk loading data. When data is inserted, I append it to a file on disk, which is then bulk loaded into Clickhouse. The overall goal is for the platform to automatically handle managing shards and replicas.

The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.

So far ScratchDB has ingested about 2 TB of data and 4,000 requests/second on about $100 worth of monthly server costs.

Feel free to download it and play around - if you’re interested in this stuff then I’d love to chat! Really looking for feedback on what is hard about analytical databases and what would make the developer experience easier!

[1] https://scratchdb.com/docs

[2] https://scratchdb.com/blog/flatten-json/

[3] https://scratchdb.com/blog/scratchdb-email-signups/

[4] https://scratchdb.com/blog/stripe-data-ingest/

[5] https://scratchdb.com/blog/shopify-data-ingest/




Can you explain what "open-source Snowflake" means, since you don't explain it in this description, in the repo, or on the site?

Is your goal explicitly to replicate all Snowflake capabilities? https://docs.snowflake.com/en/user-guide/intro-supported-fea...


This is good feedback on writing a clear message - I appreciate it.

The goal of this project is to build a superb developer experience around analytical databases. I think that is one of Snowflake's (many) value propositions. It is also a goal for users to be able to have full control of their data, how it is processed, and be able to make economical use of their compute.

This project does not have a goal of matching anyone else's features. But of course, as we grow, we'll end up building things which are important to enterprises.


Calling it "Open-Source Snowflake" implies you are trying to be a drop-in replacement for Snowflake for at least a subset of features, SQL syntax, etc.


I too was confused by the title. If it doesn't support feature-parity, nor approach the problem space from the same perspective as Snowflake, I don't think it's the open source version.


Snowflake started as cloud-first (elastic) data warehouse. DX came later.


Disclaimer: I work at ClickHouse.

Thank you! Looks really interesting!

I personally agree that real-time OLAP databases have potential to better serve workloads currently in Postgres or cloud data warehouses that need real-time ingest and analytical queries. And simplifying developer experience on top of that, so you don't have to learn about all the details of a powerful database, really speeds up developer velocity.

I'm curious, how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Congratulations again on the launch!


Good to meet you! I hadn't seen graphjson before, I'll check that out. I'm also a fan of tinybird, and I think we have similar goals of wanting to make it easier for people to adopt OLAP.

On the technical side, I've made different design decisions when it comes to ingesting and processing data. For example, after a table is created, you can post new JSON, with different columns, and we ingest it without needing a manual migration step. I also have a different treatment for JSON arrays, where we can split them into multiple Clickhouse rows (rather than using a clickhouse array.)

Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience. I've been writing OSS for years and basically think this is the best way to build successful developer tools.


> Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience.

+1 to that! Welcome to the ClickHouse community!


I would like to be an OSS dev but how do you pay the bills?


> how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Not OP, but neither of these appear to be open source?


Great product! Thanks for sharing it!

Question: I thought Clickhouse already has native support for flattening JSON [1], although it was released recently (version 22.3.1). Did you start working on it [2] before that? Or is it a different take? I'm curious about the pros and cons of each one.

[1] https://clickhouse.com/docs/en/integrations/data-formats/jso... [2] https://scratchdb.com/blog/flatten-json/


This is a really good question. My snarky answer is "the way we ingest JSON doesn't require 50 pages and n configuration settings to explain."

Here's a more factual one:

- We don't use tuples for nested JSON. We just use underscores to show parent_child relationships in keys.

- We don't use arrays. Instead, we make it easy to split into multiple rows so you can use regular SQL.

I haven't directly compared Clickhouse's all of different JSON treatments compared to what I've implemented, but my goal was to build something that you could run and would "just work the way it should."


ClickHouse JSON certainly has some sharp edges because they do a lot of schema inference to transparently column store the component parts, which makes it tricky with anything that could be irregular.

Big fan of CH on my end. Will follow your project closely.


What happens with this?

  {
    "some_key": "1"
    "some": {
      "key": "2"
    }
  }


Probably an error! Should create a ticket for that. What would we want the behavior to be?

(This hasn't come up with existing customers because theoretically it could and we should handle it.)


maybe double the underscore on conflict?

some_key =1

some__key = 2


The convention when flattening is exactly that I thought - an underscore per level of nesting. This is what I would expect to see too.


Agree. (What if the original users key has the double underscore?) maybe this can be defined as a param?


> What if the original users key has the double underscore?

Double it again.

some_key = 1

some__key = 2

some____key = 3

This is analogous to what C#11 does to allow raw string literal quotes containing those same quotes. https://learn.microsoft.com/en-us/dotnet/csharp/language-ref...


escape the underscore. this is the only sane one, I think.


I tested it with the “explode” flattening mode, and it resulted in { “some_key”: “2” }


Feel free to create a gh issue - really interested in what the right solution would be here!


as a huge clickhouse fan, i have to agree json is simply not a great experience even with all the changes in the last year


maybe use the json flattening RFC instead of inventing your own?


Schema inference for nested JSON objects is a feature of ClickHouse 23.9. I've made a video about it: https://www.youtube.com/watch?v=yS8YU-rBpMM&t=1846s


AGPL-3.0 license, for those wondering.



Congrats on the release! Can this be used for log data? How long is ingested data kept?


You can use it for logs. I have a basic example here: https://scratchdb.com/blog/fly-logs-to-clickhouse/

It's just storing data in a database so it persists as long as you want. For the hosted version you just pay per GB of compressed data, so if you have tons of logs then you can keep them, or you can clear out old data to save on disk space.


Thanks for sharing. Looks very clean and simple to use.

Do you plan on supporting non-JSON data types for insertion? For example, inserting CSV files, parquet files, Avro or Protobuf messages?


Yes! Have an issue for that https://github.com/scratchdata/ScratchDB/issues/19

What would you want it to look like?


Not the parent poster, but I wouldn't race to solve the problem via supporting many more connectors that require lots of config options. You'll just spend time supporting lots of connectors.

Instead, recommend that users use another (connector heavy) tool to get data in -- my mind jumped to using Benthos to convert a CSV or parquet file (or any other input stream) into a series of JSON calls -- and just ask users to hammer ingestion requests at your server. From there, your job is "just" to handle JSON ingestion as fast as you can, rather than maintain many connectors.

If JSON becomes a problem, then find exactly one other well-defined file format for bulk data loads (parquet, perhaps?), and support that.

When I saw this submission, I think I fell in love. JSON to get data in; SQL to get data out; what could be simpler?


Agreed. I guess it depends on the target market. If your target is small to medium sized businesses, this is great to start doing analytics. But for large organisations, they generally have ETL/ELT jobs that do all the extraction from data sources to push to an analytics store or save in some format that is performant for analytics and storage (i.e. parquet). I'm also not sure how many data visualisation tools support API endpoints for query serving.


You should submit your benchmarks to ClickBench.


What does the license mean, if I don't change any of the code you provide, but use it to provide a public-facing service? Like if I use it for a forum, for instance, but am using a separate bit of code to push data into and retrieve out of ScratchDB?


Why is your storage 10X that of bigquery? How does your compute price compare to bigtable?

Edit: bigtable->bigquery


I don't bill separately for the compute vs storage the way bigtable does. The pricing per GB of data is inclusive of compute. The goal is for pricing to be modeled similarly to DynamoDB - just pay for what you use. The other way I charge is by query wall time - so a 30s query will cost you more than a 500ms one.

I haven't used bigtable but it seems like the minimum charge is on the order of $300 before you have any data. With ScratchDB, the minimum charge is $10 for 30 GB.

Additionally, on average, data has a 25% compression ratio. So if your 1 TB of data only takes up 250GB, you only pay for that.

Bigtable isn't OLAP, so you would not use them for the same data. This competes more directly with GCP's BigQuery.

Finally, I'm interested in pricing feedback! The goal is to be able to sustain the development of this, so I want to do what makes sense.


I wrote wrong, I meant bigquery, sorry. It's 9x of bigquery.


I haven't calculated the break-even point for bigquery vs scratchdb. It would be impossible to do so, as I'd need to know the number of rows bigquery will scan in order to do an estimation in advance. Also what is a "slot-hour"?

That is why I chose pricing uses units of "GB" and "hours" for storage and compute - those are things you can more easily observe.

It is a good question, though, and perhaps I can do an experiment and write a blog post using example data showing the differences. I might be surprised at how efficient bigquery is!


Congrats with the launch. This looks great. Inferring schemas on the fly is awesome to get started quickly, but are there ways to explicitly define a schema if I wanted to? For example, thinking of setting column specific compression


Currently no, but I’m open to the idea of being able to set this. Perhaps the ability to set it on the fly (or better yet, inform the user of which type of compression to use based on their actual data!) would be useful. Happy to discuss in a gh issue too.


This looks great. I have one question. When you are automatically creating tables, how do you choose primary keys order for clickhouse table?


Great question. Each new table is given a __row_id column which ScratchDB populates and that is the default primary key.

Then, as data is added, I look at the cardinality of each column and set an ORDER BY appropriately. The goal (implementation in progress, not launched yet) is for ScratchDB to do this automatically - we should be able to automatically look at both the data in your table and queries and analytically come up with an appropriate sorting key.


This sounds like a very interesting problem. How costly is it to modify order by after data has been loaded into the table? It sounds expensive because it should reorder the data after changing order by.


It is not a cheap operation, I basically create a new table and then "INSERT INTO SELECT..." Then again, it isn't something you typically need to do a lot.


Thank you for the answers, and best of luck with this product.


Just signed up but didn't receive a confirmation email. Are you currently accepting new sign-ups for the managed service?


Yes! I should set up confirmation emails - I'm reaching out to the folks who have signed up so you should hear back soon! If I miss it (apologies!) then feel free to email directly.


I love everything about your story and what you built. In the process of doing something similar.

Nice work!


> The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.

Cool, glad to see Hetzner, at least presumably for compute, rather than the almost routine, absurdly expensive, mega cloud providers.

I have a few questions if you've got time.

1. What made you pick Hetzner in particular, and did you evaluate any of their primary competitors? (e.g., OVH, etc)

2. In your $100/month figure, did you decide to go with dedicated servers or the "cloud" VPS line? If the latter, was there any particular reason over going with the bare-metal offerings?

3. Are you making use of Hetzner's U.S. servers as well or is everything currently in Europe (or vice-versa)?

4. Was there any particular reason for choosing B2 and SQS as opposed to self-hosting object-storage on the SX servers?

Normally, I wouldn't even wonder why someone wouldn't want the burden of more infrastructure. But given the choice of going with relatively unmanaged Hetzner servers, presumably self-hosting clickhouse, etc, and then with your compute provider also happening to offer fairly large storage servers on the cheap, I might've been tempted to cut out the additional providers and DIY it:

- less costly for large amounts of data

- zero lock-in [1]

- fewer companies to deal with

  - likely better negotiating power with Hetzner when the time comes if a bigger percentage of your overhead is with them as opposed to spread out across three providers

  - fewer points of failure; if the Hetzner servers are down, I would assume you're in trouble anyway, so perhaps keeping [most] of your eggs on the same network might not be as bad as it sounds

  - presumably better latency and bandwidth + the ability to communicate over a private network [2]
5. I see the license is AGPL. But I don't see the usual "you must dual-license all contributions under MIT/BSD/ISC as well [so that only we can re-license the project]" nor "before contributing, sign this agreement transferring copyright [and your first born child]".

Was this just an oversight, or do you intend to be one of the few SaaS companies that really truly is open-source rather than "open-source" [until peopled are locked-in] and then going "open"-core? If the latter, then awesome -- cool to see.

6. Any regrets, disasters, or lessons learned so far? Usually, I find these stories the most interesting but unfortunately too few are willing to share.

---

[1]: I know B2 provides a relatively standard, at this point, S3-compatible API and everything as well. But I think there is also still something to be said about a somewhat Juche-esque approach to infrastructure, wherein should prices rise, contracts change, service degrades, or whatever else, you'd have the ability to almost immediately switch at a moment's notice to literally anyone else who can lease you a box with some hard drives or any colo provider.

[2]: This goes out the window somewhat if you're using the VPS line and American servers, though.


Really interesting question(s)! Will try to answer the gist.

First, the system is designed for users to be able to drop in whatever hardware they want. So your questions are around my initial deployment options. These will certainly change in the future.

I chose Hetzner because of cost. I will probably end up using other providers in the future but Hetzner let me begin this experiment without burning my runway.

I use metal servers for Clickhouse, and small cloud boxes and LBs for the API. It happens I'm using US-based cloud servers because that's where my users are.

I'm using B2 + SQS because I did not want to take on sysadmin for those components. They are not performance-sensitive. Using B2 at $0.006/GB vs managing minio on an SX server at $0.001/GB was acceptable to me :)

Why use regular servers for the API instead of fly.io? Because the API writes data to disk, and then bulk loads it to clickhouse. This means I needed durable and reliable disk, which you can only have with actual VMs. I didn't want the process to be randomly SIGKILL'd since the process shuts down safely to avoid data loss, and I didn't want the risk of ephemeral storage. So to control all this I have to run the HTTP servers myself.

I might experiment with a PaaS for this, but it was easy enough to just set up an init script to run the daemon.

re: negotiating power - that conversation only happens when I have volume, at which point I will be able to negotiate with any provider.

re: points of failure/latency - this will continually change: the deployment strategy for today will be different than when I'm managing 100s of TB of data across thousands of users. The main focus is to make the system flexible to handle different topologies and to be able to change providers by updating a configuration.

Thank you!


TiDB is an HTAP whose OLAP component (TiFlash) was based on Clickhouse: https://news.ycombinator.com/item?id=23584022

If you have analyzed the competition, what are your selling points? Benchmarks welcome. Thank you!


I haven't seen TiDB before - thanks for sharing. I honestly haven't spent tons of much time analyzing other databases (there are so many!) but I haven't seen much out there which tries to be "Firebase for OLAP" - that is, make it so easy that a developer who maybe doesn't have time to learn Clickhouse can be effective with it. That's the most important goal of this project: lower the barrier for people who need to use a columnar data structure to be able to do so [without thinking too much about it] and get back to solving their engineering problems.


TiFlash is designed for TiDB users and seamlessly moving between inserting row data and querying column storage.


ScratchDB has save my business and it's awesome. I think if you need a columnar store, you should really try these guys


Thank you - really excited about what you're building too!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: