Show HN: ScratchDB – Open-Source Snowflake on ClickHouse

CharlesW · on Oct 27, 2023

Can you explain what "open-source Snowflake" means, since you don't explain it in this description, in the repo, or on the site?

Is your goal explicitly to replicate all Snowflake capabilities? https://docs.snowflake.com/en/user-guide/intro-supported-fea...

memset · on Oct 27, 2023

This is good feedback on writing a clear message - I appreciate it.

The goal of this project is to build a superb developer experience around analytical databases. I think that is one of Snowflake's (many) value propositions. It is also a goal for users to be able to have full control of their data, how it is processed, and be able to make economical use of their compute.

This project does not have a goal of matching anyone else's features. But of course, as we grow, we'll end up building things which are important to enterprises.

willseth · on Oct 27, 2023

Calling it "Open-Source Snowflake" implies you are trying to be a drop-in replacement for Snowflake for at least a subset of features, SQL syntax, etc.

skeeter2020 · on Oct 27, 2023

I too was confused by the title. If it doesn't support feature-parity, nor approach the problem space from the same perspective as Snowflake, I don't think it's the open source version.

az226 · on Oct 28, 2023

Snowflake started as cloud-first (elastic) data warehouse. DX came later.

tbragin · on Oct 27, 2023

Disclaimer: I work at ClickHouse.

Thank you! Looks really interesting!

I personally agree that real-time OLAP databases have potential to better serve workloads currently in Postgres or cloud data warehouses that need real-time ingest and analytical queries. And simplifying developer experience on top of that, so you don't have to learn about all the details of a powerful database, really speeds up developer velocity.

I'm curious, how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Congratulations again on the launch!

memset · on Oct 27, 2023

Good to meet you! I hadn't seen graphjson before, I'll check that out. I'm also a fan of tinybird, and I think we have similar goals of wanting to make it easier for people to adopt OLAP.

On the technical side, I've made different design decisions when it comes to ingesting and processing data. For example, after a table is created, you can post new JSON, with different columns, and we ingest it without needing a manual migration step. I also have a different treatment for JSON arrays, where we can split them into multiple Clickhouse rows (rather than using a clickhouse array.)

Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience. I've been writing OSS for years and basically think this is the best way to build successful developer tools.

tbragin · on Oct 27, 2023

> Philosophically, I think there is a lot of room for open-source software that also has an amazing UI and developer experience.

+1 to that! Welcome to the ClickHouse community!

hawk_ · on Oct 27, 2023

I would like to be an OSS dev but how do you pay the bills?

mike_d · on Oct 27, 2023

> how you see your project differs from GraphJSON (https://www.graphjson.com/) and Tinybird (https://www.tinybird.co/)?

Not OP, but neither of these appear to be open source?

giovannibonetti · on Oct 27, 2023

Great product! Thanks for sharing it!

Question: I thought Clickhouse already has native support for flattening JSON [1], although it was released recently (version 22.3.1). Did you start working on it [2] before that? Or is it a different take? I'm curious about the pros and cons of each one.

[1] https://clickhouse.com/docs/en/integrations/data-formats/jso... [2] https://scratchdb.com/blog/flatten-json/

memset · on Oct 27, 2023

This is a really good question. My snarky answer is "the way we ingest JSON doesn't require 50 pages and n configuration settings to explain."

Here's a more factual one:

- We don't use tuples for nested JSON. We just use underscores to show parent_child relationships in keys.

- We don't use arrays. Instead, we make it easy to split into multiple rows so you can use regular SQL.

I haven't directly compared Clickhouse's all of different JSON treatments compared to what I've implemented, but my goal was to build something that you could run and would "just work the way it should."

antoniojtorres · on Oct 27, 2023

ClickHouse JSON certainly has some sharp edges because they do a lot of schema inference to transparently column store the component parts, which makes it tricky with anything that could be irregular.

Big fan of CH on my end. Will follow your project closely.

zepolen · on Oct 27, 2023

What happens with this?

  {
    "some_key": "1"
    "some": {
      "key": "2"
    }
  }

memset · on Oct 27, 2023

Probably an error! Should create a ticket for that. What would we want the behavior to be?

(This hasn't come up with existing customers because theoretically it could and we should handle it.)

namtab00 · on Oct 27, 2023

maybe double the underscore on conflict?

some_key =1

some__key = 2

dicriseg · on Oct 27, 2023

The convention when flattening is exactly that I thought - an underscore per level of nesting. This is what I would expect to see too.

memset · on Oct 27, 2023

Agree. (What if the original users key has the double underscore?) maybe this can be defined as a param?

namtab00 · on Oct 27, 2023

> What if the original users key has the double underscore?

Double it again.

some_key = 1

some__key = 2

some____key = 3

This is analogous to what C#11 does to allow raw string literal quotes containing those same quotes. https://learn.microsoft.com/en-us/dotnet/csharp/language-ref...

gostsamo · on Oct 28, 2023

escape the underscore. this is the only sane one, I think.

mosen · on Oct 27, 2023

I tested it with the “explode” flattening mode, and it resulted in { “some_key”: “2” }

memset · on Oct 27, 2023

Feel free to create a gh issue - really interested in what the right solution would be here!

datatrashfire · on Oct 27, 2023

as a huge clickhouse fan, i have to agree json is simply not a great experience even with all the changes in the last year

lionkor · on Oct 28, 2023

maybe use the json flattening RFC instead of inventing your own?

zX41ZdbW · on Oct 27, 2023

Schema inference for nested JSON objects is a feature of ClickHouse 23.9. I've made a video about it: https://www.youtube.com/watch?v=yS8YU-rBpMM&t=1846s

tiffanyh · on Oct 27, 2023

AGPL-3.0 license, for those wondering.

iFire · on Oct 28, 2023

https://github.com/scratchdata/ScratchDB/blob/main/LICENSE

throwaway295729 · on Oct 27, 2023

Congrats on the release! Can this be used for log data? How long is ingested data kept?

memset · on Oct 27, 2023

You can use it for logs. I have a basic example here: https://scratchdb.com/blog/fly-logs-to-clickhouse/

It's just storing data in a database so it persists as long as you want. For the hosted version you just pay per GB of compressed data, so if you have tons of logs then you can keep them, or you can clear out old data to save on disk space.

pitah1 · on Oct 27, 2023

Thanks for sharing. Looks very clean and simple to use.

Do you plan on supporting non-JSON data types for insertion? For example, inserting CSV files, parquet files, Avro or Protobuf messages?

memset · on Oct 27, 2023

Yes! Have an issue for that https://github.com/scratchdata/ScratchDB/issues/19

What would you want it to look like?

NortySpock · on Oct 28, 2023

Not the parent poster, but I wouldn't race to solve the problem via supporting many more connectors that require lots of config options. You'll just spend time supporting lots of connectors.

Instead, recommend that users use another (connector heavy) tool to get data in -- my mind jumped to using Benthos to convert a CSV or parquet file (or any other input stream) into a series of JSON calls -- and just ask users to hammer ingestion requests at your server. From there, your job is "just" to handle JSON ingestion as fast as you can, rather than maintain many connectors.

If JSON becomes a problem, then find exactly one other well-defined file format for bulk data loads (parquet, perhaps?), and support that.

When I saw this submission, I think I fell in love. JSON to get data in; SQL to get data out; what could be simpler?

pitah1 · on Oct 28, 2023

Agreed. I guess it depends on the target market. If your target is small to medium sized businesses, this is great to start doing analytics. But for large organisations, they generally have ETL/ELT jobs that do all the extraction from data sources to push to an analytics store or save in some format that is performant for analytics and storage (i.e. parquet). I'm also not sure how many data visualisation tools support API endpoints for query serving.

didip · on Oct 27, 2023

You should submit your benchmarks to ClickBench.

shrubble · on Oct 28, 2023

What does the license mean, if I don't change any of the code you provide, but use it to provide a public-facing service? Like if I use it for a forum, for instance, but am using a separate bit of code to push data into and retrieve out of ScratchDB?

ddorian43 · on Oct 27, 2023

Why is your storage 10X that of bigquery? How does your compute price compare to bigtable?

Edit: bigtable->bigquery

memset · on Oct 27, 2023

I don't bill separately for the compute vs storage the way bigtable does. The pricing per GB of data is inclusive of compute. The goal is for pricing to be modeled similarly to DynamoDB - just pay for what you use. The other way I charge is by query wall time - so a 30s query will cost you more than a 500ms one.

I haven't used bigtable but it seems like the minimum charge is on the order of $300 before you have any data. With ScratchDB, the minimum charge is $10 for 30 GB.

Additionally, on average, data has a 25% compression ratio. So if your 1 TB of data only takes up 250GB, you only pay for that.

Bigtable isn't OLAP, so you would not use them for the same data. This competes more directly with GCP's BigQuery.

Finally, I'm interested in pricing feedback! The goal is to be able to sustain the development of this, so I want to do what makes sense.

ddorian43 · on Oct 27, 2023

I wrote wrong, I meant bigquery, sorry. It's 9x of bigquery.

memset · on Oct 27, 2023

I haven't calculated the break-even point for bigquery vs scratchdb. It would be impossible to do so, as I'd need to know the number of rows bigquery will scan in order to do an estimation in advance. Also what is a "slot-hour"?

That is why I chose pricing uses units of "GB" and "hours" for storage and compute - those are things you can more easily observe.

It is a good question, though, and perhaps I can do an experiment and write a blog post using example data showing the differences. I might be surprised at how efficient bigquery is!

gbrits · on Oct 27, 2023

Congrats with the launch. This looks great. Inferring schemas on the fly is awesome to get started quickly, but are there ways to explicitly define a schema if I wanted to? For example, thinking of setting column specific compression

memset · on Oct 27, 2023

Currently no, but I’m open to the idea of being able to set this. Perhaps the ability to set it on the fly (or better yet, inform the user of which type of compression to use based on their actual data!) would be useful. Happy to discuss in a gh issue too.

jed_sanders12 · on Oct 27, 2023

This looks great. I have one question. When you are automatically creating tables, how do you choose primary keys order for clickhouse table?

memset · on Oct 27, 2023

Great question. Each new table is given a __row_id column which ScratchDB populates and that is the default primary key.

Then, as data is added, I look at the cardinality of each column and set an ORDER BY appropriately. The goal (implementation in progress, not launched yet) is for ScratchDB to do this automatically - we should be able to automatically look at both the data in your table and queries and analytically come up with an appropriate sorting key.

jed_sanders12 · on Oct 27, 2023

This sounds like a very interesting problem. How costly is it to modify order by after data has been loaded into the table? It sounds expensive because it should reorder the data after changing order by.

memset · on Oct 27, 2023

It is not a cheap operation, I basically create a new table and then "INSERT INTO SELECT..." Then again, it isn't something you typically need to do a lot.

jed_sanders12 · on Oct 27, 2023

Thank you for the answers, and best of luck with this product.

anon3949494 · on Oct 28, 2023

Just signed up but didn't receive a confirmation email. Are you currently accepting new sign-ups for the managed service?

memset · on Oct 28, 2023

Yes! I should set up confirmation emails - I'm reaching out to the folks who have signed up so you should hear back soon! If I miss it (apologies!) then feel free to email directly.

yoav · on Oct 28, 2023

I love everything about your story and what you built. In the process of doing something similar.

Nice work!

OmarAssadi · on Oct 28, 2023

> The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.

Cool, glad to see Hetzner, at least presumably for compute, rather than the almost routine, absurdly expensive, mega cloud providers.

I have a few questions if you've got time.

1. What made you pick Hetzner in particular, and did you evaluate any of their primary competitors? (e.g., OVH, etc)

2. In your $100/month figure, did you decide to go with dedicated servers or the "cloud" VPS line? If the latter, was there any particular reason over going with the bare-metal offerings?

3. Are you making use of Hetzner's U.S. servers as well or is everything currently in Europe (or vice-versa)?

4. Was there any particular reason for choosing B2 and SQS as opposed to self-hosting object-storage on the SX servers?

Normally, I wouldn't even wonder why someone wouldn't want the burden of more infrastructure. But given the choice of going with relatively unmanaged Hetzner servers, presumably self-hosting clickhouse, etc, and then with your compute provider also happening to offer fairly large storage servers on the cheap, I might've been tempted to cut out the additional providers and DIY it:

- less costly for large amounts of data

- zero lock-in [1]

- fewer companies to deal with

  - likely better negotiating power with Hetzner when the time comes if a bigger percentage of your overhead is with them as opposed to spread out across three providers

  - fewer points of failure; if the Hetzner servers are down, I would assume you're in trouble anyway, so perhaps keeping [most] of your eggs on the same network might not be as bad as it sounds

  - presumably better latency and bandwidth + the ability to communicate over a private network [2]

5. I see the license is AGPL. But I don't see the usual "you must dual-license all contributions under MIT/BSD/ISC as well [so that only we can re-license the project]" nor "before contributing, sign this agreement transferring copyright [and your first born child]".

Was this just an oversight, or do you intend to be one of the few SaaS companies that really truly is open-source rather than "open-source" [until peopled are locked-in] and then going "open"-core? If the latter, then awesome -- cool to see.

6. Any regrets, disasters, or lessons learned so far? Usually, I find these stories the most interesting but unfortunately too few are willing to share.

---

[1]: I know B2 provides a relatively standard, at this point, S3-compatible API and everything as well. But I think there is also still something to be said about a somewhat Juche-esque approach to infrastructure, wherein should prices rise, contracts change, service degrades, or whatever else, you'd have the ability to almost immediately switch at a moment's notice to literally anyone else who can lease you a box with some hard drives or any colo provider.

[2]: This goes out the window somewhat if you're using the VPS line and American servers, though.

memset · on Oct 28, 2023

Really interesting question(s)! Will try to answer the gist.

First, the system is designed for users to be able to drop in whatever hardware they want. So your questions are around my initial deployment options. These will certainly change in the future.

I chose Hetzner because of cost. I will probably end up using other providers in the future but Hetzner let me begin this experiment without burning my runway.

I use metal servers for Clickhouse, and small cloud boxes and LBs for the API. It happens I'm using US-based cloud servers because that's where my users are.

I'm using B2 + SQS because I did not want to take on sysadmin for those components. They are not performance-sensitive. Using B2 at $0.006/GB vs managing minio on an SX server at $0.001/GB was acceptable to me :)

Why use regular servers for the API instead of fly.io? Because the API writes data to disk, and then bulk loads it to clickhouse. This means I needed durable and reliable disk, which you can only have with actual VMs. I didn't want the process to be randomly SIGKILL'd since the process shuts down safely to avoid data loss, and I didn't want the risk of ephemeral storage. So to control all this I have to run the HTTP servers myself.

I might experiment with a PaaS for this, but it was easy enough to just set up an init script to run the daemon.

re: negotiating power - that conversation only happens when I have volume, at which point I will be able to negotiate with any provider.

re: points of failure/latency - this will continually change: the deployment strategy for today will be different than when I'm managing 100s of TB of data across thousands of users. The main focus is to make the system flexible to handle different topologies and to be able to change providers by updating a configuration.

Thank you!

esafak · on Oct 27, 2023

TiDB is an HTAP whose OLAP component (TiFlash) was based on Clickhouse: https://news.ycombinator.com/item?id=23584022

If you have analyzed the competition, what are your selling points? Benchmarks welcome. Thank you!

memset · on Oct 27, 2023

I haven't seen TiDB before - thanks for sharing. I honestly haven't spent tons of much time analyzing other databases (there are so many!) but I haven't seen much out there which tries to be "Firebase for OLAP" - that is, make it so easy that a developer who maybe doesn't have time to learn Clickhouse can be effective with it. That's the most important goal of this project: lower the barrier for people who need to use a columnar data structure to be able to do so [without thinking too much about it] and get back to solving their engineering problems.

gregwebs · on Oct 27, 2023

TiFlash is designed for TiDB users and seamlessly moving between inserting row data and querying column storage.

wkoszek · on Oct 27, 2023

ScratchDB has save my business and it's awesome. I think if you need a columnar store, you should really try these guys

memset · on Oct 27, 2023

Thank you - really excited about what you're building too!