Show HN: Melchi – Open-Source Snowflake to DuckDB Replication with CDC Support

chrisjc · 2024-11-06T01:33:30 1730856810

Real awesome project. This would be even better if some of the differential storage functionality makes its way into duckdb.

https://motherduck.com/blog/differential-storage-building-bl...

Unsure if it's even possible, but if there was a way to write out (simulate) the streams to WAL files, you might be able to accomplish the same without having to consolidate the duckdb file every time.

A couple of other ideas, that may or may not diverge from your project's purpose or simplicity...

It's also too bad that duckdb isn't included in the out-of-the-box Snowpark packages, potentially allowing you to run all of this in a Snowflake procedure/DAG and persisting the duckdb file to object storage. You could of course achieve this with Snowflake containers. (But this would probably ruin any plans to support Redshift/BigQuery/etc as sources)

If the source tables were Iceberg, then you could have an even simpler duckdb file to produce/maintain that would just wrap views around the Iceberg tables using the duckdb iceberg extension.

    create view x 
    as select 
        * 
    from 
        iceberg_scan('metadata_file');

If you're going to enable change tracking on tables then perhaps another solution is to use CHANGES to select the data from the table instead of reading streams. That way you could use the same timestamp across all the selects and get better consistence between your materializations to duckdb.

    set ts = {last_materialization_ts};
    select * from X CHANGES AT(timestamp=>$ts);
    select * from Y CHANGES AT(timestamp=>$ts);

    -- triggered task to kick off materialization to duckdb
    create task M
    when
        SYSTEM$HAS_DATA(X_STREAM) AND
        SYSTEM$HAS_DATA(Y_STREAM) AND ...
    AS 
        CALL MY_MATERIALIZE(); -- send SNS, invoke ext-func, etc

Here's another similar (but different) project that I've been tracking https://github.com/buremba/universql

pratio · 2024-11-05T20:54:39 1730840079

This looks interesting, are there any plans to support snowflake external browser auth?

ryanwaldorf · 2024-11-05T22:00:35 1730844035

Would that be to support SSO? If so, I haven't planned it yet but I could. Would that make this more useful to you?

chrisjc · 2024-11-06T01:17:17 1730855837

It should be just as easy as adding:

    authenticator="externalbrowser"

Adding a Snowflake connection configuration option that allowed for standard Snowflake connection conventions might be a good option. That way you could connect to Snowflake with your existing configurations (.snowsql/, .snowflake/). Or explicitly specify by adding matching args/params to your project's config.

    # myconf.toml
    [test-connection]
    account=mysfaccount
    authenticator="externalbrowser"
    ...

    # config/config.yaml
    source:
        type: snowflake
        connection:
            file_path: test-connection
            name: ./myconf.toml
        change_tracking_database: melchi_cdc_db
        change_tracking_schema: streams

    sf.connect(connection_name=?, connections_file_path=Path(?).resolve())

ryanwaldorf · 2024-11-06T01:37:18 1730857038

Thank you! Will take a look at this over the next few days

codekisser · 2024-11-05T19:09:05 1730833745

What other local databases did you consider, and why did you choose DuckDB? Google tells me a lot of other people stream CDC pipelines to DuckDB, but I'm not familiar enough with it to know what makes it such a compelling choice.

ryanwaldorf · 2024-11-05T19:30:52 1730835052

I wanted to start with duckdb since it's really an incredibly powerful tool that people should try out. The performance you can get on analytical queries running on your local compute is just really impressive. And with snowflake streams you can actually stream live data into it without changing anything about your existing data. On why not other databases, I wanted to focus on OLAP to start as there are already other great tools like DLT that help you load data from OLTP sources like postgres and mysql to OLAP sources already, but OLAP to OLAP is pretty rare.

Have you run into a use case for streaming data between data warehouses yourself yet? If so which warehouses?

adrianbr · 2024-11-05T19:04:42 1730833482

The whole idea is pretty nuts! I can imagine it being used as the dev env for teams that use SQLMesh and can thus port the sql. Might be worth investigating with them

ryanwaldorf · 2024-11-05T19:32:19 1730835139

That's really interesting! Could you tell me a bit more of what you're thinking? I'm not the most familiar with SQL Mesh and the typical workflows there.

chrisjc · 2024-11-06T01:25:50 1730856350

Perhaps similar to https://github.com/duckdb/dbt-duckdb , but SQLMesh instead of DBT obviously.

ryanwaldorf · 2024-11-06T01:45:17 1730857517

Ah gotcha! Do you have a use case where you'd look to remodel/transform the data between warehouses?

chrisjc · 2024-11-06T02:01:42 1730858502

Not the original parent, so unsure of their use-case. But I've seen the approach where some/basic development can be done on duckdb, before making its way to dev/qa/prd.

Something like your project might enable grabbing data (subsets of data) from a dev enviroment (seed) for offline, cheap (no SF warehouse cost) development/unit-testing, etc.