Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Melchi – Open-Source Snowflake to DuckDB Replication with CDC Support (github.com/ryanwith)
6 points by ryanwaldorf 12 hours ago | hide | past | favorite | 12 comments
Hey hacker news! I built Melchi, an open-source tool that handles Snowflake to DuckDB replication with proper CDC support. I'd love your feedback on the approach and potential use cases.

*Why I built it:* When I worked at Redshift I saw two common scenarios that were painfully difficult to solve: Teams needed to query and join data from other organizations' Snowflake instances with their own data stored in different warehouse types, or they wanted to experiment with different warehouse technologies but the overhead of building and maintaining data pipelines was too high. With DuckDB's growing popularity for local analytics, I built this to make warehouse-to-warehouse data movement simpler.

*How it works:* * Supports three CDC strategies: standard streams (full change tracking), append-only streams (insert-only tables), and full refresh * Handles schema matching and type conversion automatically * Manages all the change tracking metadata * Uses DataFrames for efficient data movement instead of CSV dumps * Provides transactional consistency with automatic rollback on failures * Processes data in configurable batch sizes for memory efficiency

Quick setup example: ```yaml source: type: snowflake account: ${SNOWFLAKE_ACCOUNT} warehouse: YOUR_WAREHOUSE change_tracking_schema: streams

target: type: duckdb database: output/local.duckdb ```

*Current limitations:* * Geography/Geometry columns not supported with standard streams (Snowflake limitation) * Primary keys must be defined in Snowflake (or a row ID will be auto-generated) * All tables must be replaced together when modifying transfer configuration * Cannot replicate tables with identical schema/column names into DuckDB, even from different Snowflake databases

*Questions for the community:* 1. What use cases do you see for this kind of tool? 2. What features would make this more useful for your workflow? 3. Any concerns about the approach to CDC? 4. What other source/target databases would be valuable to support?

GitHub: https://github.com/ryanwith/melchi Discord: https://discord.gg/bTg9kJ92

Looking forward to your thoughts and feedback!






Real awesome project. This would be even better if some of the differential storage functionality makes its way into duckdb.

https://motherduck.com/blog/differential-storage-building-bl...

Unsure if it's even possible, but if there was a way to write out (simulate) the streams to WAL files, you might be able to accomplish the same without having to consolidate the duckdb file every time.

A couple of other ideas, that may or may not diverge from your project's purpose or simplicity...

It's also too bad that duckdb isn't included in the out-of-the-box Snowpark packages, potentially allowing you to run all of this in a Snowflake procedure/DAG and persisting the duckdb file to object storage. You could of course achieve this with Snowflake containers. (But this would probably ruin any plans to support Redshift/BigQuery/etc as sources)

If the source tables were Iceberg, then you could have an even simpler duckdb file to produce/maintain that would just wrap views around the Iceberg tables using the duckdb iceberg extension.

    create view x 
    as select 
        * 
    from 
        iceberg_scan('metadata_file');
If you're going to enable change tracking on tables then perhaps another solution is to use CHANGES to select the data from the table instead of reading streams. That way you could use the same timestamp across all the selects and get better consistence between your materializations to duckdb.

    set ts = {last_materialization_ts};
    select * from X CHANGES AT(timestamp=>$ts);
    select * from Y CHANGES AT(timestamp=>$ts);

    -- triggered task to kick off materialization to duckdb
    create task M
    when
        SYSTEM$HAS_DATA(X_STREAM) AND
        SYSTEM$HAS_DATA(Y_STREAM) AND ...
    AS 
        CALL MY_MATERIALIZE(); -- send SNS, invoke ext-func, etc

Here's another similar (but different) project that I've been tracking https://github.com/buremba/universql

This looks interesting, are there any plans to support snowflake external browser auth?

Would that be to support SSO? If so, I haven't planned it yet but I could. Would that make this more useful to you?

It should be just as easy as adding:

    authenticator="externalbrowser"
Adding a Snowflake connection configuration option that allowed for standard Snowflake connection conventions might be a good option. That way you could connect to Snowflake with your existing configurations (.snowsql/, .snowflake/). Or explicitly specify by adding matching args/params to your project's config.

    # myconf.toml
    [test-connection]
    account=mysfaccount
    authenticator="externalbrowser"
    ...

    # config/config.yaml
    source:
        type: snowflake
        connection:
            file_path: test-connection
            name: ./myconf.toml
        change_tracking_database: melchi_cdc_db
        change_tracking_schema: streams

    sf.connect(connection_name=?, connections_file_path=Path(?).resolve())

Thank you! Will take a look at this over the next few days

What other local databases did you consider, and why did you choose DuckDB? Google tells me a lot of other people stream CDC pipelines to DuckDB, but I'm not familiar enough with it to know what makes it such a compelling choice.

I wanted to start with duckdb since it's really an incredibly powerful tool that people should try out. The performance you can get on analytical queries running on your local compute is just really impressive. And with snowflake streams you can actually stream live data into it without changing anything about your existing data. On why not other databases, I wanted to focus on OLAP to start as there are already other great tools like DLT that help you load data from OLTP sources like postgres and mysql to OLAP sources already, but OLAP to OLAP is pretty rare.

Have you run into a use case for streaming data between data warehouses yourself yet? If so which warehouses?


The whole idea is pretty nuts! I can imagine it being used as the dev env for teams that use SQLMesh and can thus port the sql. Might be worth investigating with them

That's really interesting! Could you tell me a bit more of what you're thinking? I'm not the most familiar with SQL Mesh and the typical workflows there.

Perhaps similar to https://github.com/duckdb/dbt-duckdb , but SQLMesh instead of DBT obviously.

Ah gotcha! Do you have a use case where you'd look to remodel/transform the data between warehouses?

Not the original parent, so unsure of their use-case. But I've seen the approach where some/basic development can be done on duckdb, before making its way to dev/qa/prd.

Something like your project might enable grabbing data (subsets of data) from a dev enviroment (seed) for offline, cheap (no SF warehouse cost) development/unit-testing, etc.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: