Hey hacker news! I built Melchi, an open-source tool that handles Snowflake to DuckDB replication with proper CDC support. I'd love your feedback on the approach and potential use cases.
*Why I built it:*
When I worked at Redshift I saw two common scenarios that were painfully difficult to solve: Teams needed to query and join data from other organizations' Snowflake instances with their own data stored in different warehouse types, or they wanted to experiment with different warehouse technologies but the overhead of building and maintaining data pipelines was too high. With DuckDB's growing popularity for local analytics, I built this to make warehouse-to-warehouse data movement simpler.
*How it works:*
* Supports three CDC strategies: standard streams (full change tracking), append-only streams (insert-only tables), and full refresh
* Handles schema matching and type conversion automatically
* Manages all the change tracking metadata
* Uses DataFrames for efficient data movement instead of CSV dumps
* Provides transactional consistency with automatic rollback on failures
* Processes data in configurable batch sizes for memory efficiency
Quick setup example:
```yaml
source:
type: snowflake
account: ${SNOWFLAKE_ACCOUNT}
warehouse: YOUR_WAREHOUSE
change_tracking_schema: streams
target:
type: duckdb
database: output/local.duckdb
```
*Current limitations:*
* Geography/Geometry columns not supported with standard streams (Snowflake limitation)
* Primary keys must be defined in Snowflake (or a row ID will be auto-generated)
* All tables must be replaced together when modifying transfer configuration
* Cannot replicate tables with identical schema/column names into DuckDB, even from different Snowflake databases
*Questions for the community:*
1. What use cases do you see for this kind of tool?
2. What features would make this more useful for your workflow?
3. Any concerns about the approach to CDC?
4. What other source/target databases would be valuable to support?
GitHub: https://github.com/ryanwith/melchi
Discord: https://discord.gg/bTg9kJ92
Looking forward to your thoughts and feedback!
https://motherduck.com/blog/differential-storage-building-bl...
Unsure if it's even possible, but if there was a way to write out (simulate) the streams to WAL files, you might be able to accomplish the same without having to consolidate the duckdb file every time.
A couple of other ideas, that may or may not diverge from your project's purpose or simplicity...
It's also too bad that duckdb isn't included in the out-of-the-box Snowpark packages, potentially allowing you to run all of this in a Snowflake procedure/DAG and persisting the duckdb file to object storage. You could of course achieve this with Snowflake containers. (But this would probably ruin any plans to support Redshift/BigQuery/etc as sources)
If the source tables were Iceberg, then you could have an even simpler duckdb file to produce/maintain that would just wrap views around the Iceberg tables using the duckdb iceberg extension.
If you're going to enable change tracking on tables then perhaps another solution is to use CHANGES to select the data from the table instead of reading streams. That way you could use the same timestamp across all the selects and get better consistence between your materializations to duckdb. Here's another similar (but different) project that I've been tracking https://github.com/buremba/universqlreply