Hacker News new | past | comments | ask | show | jobs | submit login
Meltano: ELT for the DataOps era (meltano.com)
112 points by bargavas on March 1, 2021 | hide | past | favorite | 33 comments



They have the right idea, it's a real need and a major opportunity. Airbyte[1] is a few steps ahead. I also expect that the Airflow community will also wade into these waters at some point.

1. https://airbyte.io/


Another one on the market is pipelinewise.

Meltano and pipelinewise are both ways how to orchestrate Singer.io taps and targets while Airbyte is its own thing.

EDIT: I also don't know how just released Airbyte can be ahead of something that's surely in production for a while.


Singer.io seems to run the data plane on json -- is there an Apache Arrow equiv for this ecosystem/problem, or something like it?


I'm not aware of any. I did just open this issue[0] in the Meltano project to open discussion with the team/community. It could be an interesting iteration on the Singer Spec[1] if we find that users are interested in it and it helps solve some bottleneck challenges.

[0] https://gitlab.com/meltano/meltano/-/issues/2616 [1] https://github.com/singer-io/getting-started/blob/master/doc...


yeah we do not push an etl pipeline through json unless we have to (and generally cannot), most etl-scale data engineering we do is almost all arrow/parquet/orc/protobuf etc, and slow legacy, odbc/json, which is streams that we turn into typed and compact data. I think json fine for command/metadata layers though, esp early on, but pretty core to what I look for an etl/streaming tool is out-of-the-box foundations for the data plane

the good news is the implicitly typed json examples look arrow friendly, so users can to/from_json if they don't care about data speed/quality like when prototyping and not think about it. there may be other data-engineering-friendly formats that'd work too.

prefect, dask, and friends solve it by abstracting over it. you can send whatever you want.. and it happens to be friendly to dataframes (pydata) / compact & typed data. but there projects seem to be more about source/sink, so encouraging structure by default would be helpful...


AFAIK Meltano uses JSON only in the interface between a tap (source) and a target, to communicate schema, state and records.

It's up to the target what it does with the JSON messages it receives, so you can for example have a target-avro that takes JSON records and outputs them as an Avro file and translates the JSON schema to the corresponding Avro schema.


Yep, definitely.

Then the holy grail would be to have bunch of taps-targets running in parallel for a single pipeline, each working on a subset of streams.


I'm currently using Stitch (https://www.stitchdata.com/) to move data from a few sources to Redshift.

Since Stitch, Meltano and Pipelinewise all use Singer.io taps and targets under the hood, I wonder if there's any reason to choose one over the other?


It's dependent on your use case, but the three examples you've listed here all have a slightly different approach to the market. Stitch (as mentioned in a comment below) is SaaS only. Pipelinewise is open source but as far as I know have no plans to build a company around it. With Meltano, we're aiming to grow the project and community and eventually build a business around it in a similar manner to what GitLab has done. Our docs[0] have more information about our current focus and roadmap if you're curious.

[0] https://meltano.com/docs/#focus


Stich is SaaS only. That by nature makes it suitable for some uses, and unsuitable for others (like when you have a provision that your data can't leave your network or when you are in a company where adding a new vendor isn't a quick process.

Meltano and Pipelinewise are open source projects that someone built for themselves but are sharing. You can just start playing with it and change the code or whatever, but there's no support to pay for.

For example where I'm standing the best one would be "Stitch I can self-host for free for a PoC and then eventually engage vendor about a support contract while still self-hosting it for security reasons" but there doesn't seem to be anything like it.


Easy -- Meltano has not been in production. They are both relatively new tools.


The GitLab Data Team is running Meltano in production[0]. We're currently extracting Zoom data with it and have plans for several more extractors (Slack, GMail, PTO by Roots, EdCast, and a few more). I just made this MR[1] to update the list of Extractors to include Zoom too.

[0] https://about.gitlab.com/handbook/business-ops/data-team/pla... [1] https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_request...


And the GitLab Data Team is not alone! The Meltano Slack community (link on the homepage) is about 800 strong right now, and every day we've got people discussing their production deployments and helping new users set up their own.

PS. Like Taylor, I'm on the Meltano team at GitLab.


This has been called "ETL" for as long as I can remember. Could someone explain why they're using "ELT"?


While there are technical differences with transforming the data within it's storage location/database, I've interacted with my former employer's board of directors enough to know there's a bigger, rather obnoxious reason for this:

VC valuations in Silicon Valley for traditional "ETL" tools isn't great. The term "ETL" has baggage of Talend and other big enterprise tools that have an aura of "not being user friendly".

"ELT" is intended to signal that a tool is trying really hard to not be like those sad, difficult to use, dirty "ETL" tools.

That's the real reason. There really isn't an actual difference in how the tools work. Traditional ETL tools are perfectly capable of loading the data and initiating the transform in the database it's being stored in. This has been done for decades.

I have been repeatedly lectured by senior strategy team executives about using the proper vernacular to ensure perception of value is maximized in the trendy, buzzword driven VC world. So fucking stupid.


"Transform" step is happening after the load to data warehouse here. https://blog.getdbt.com/what--exactly--is-dbt-/ is a pretty good overview.


ELT means: let's focus on moving data between from system A to system B and skip the transform part, which is a business of whoever has access to the target system. Which is usually a sql database so you can then do arbitrary data transforms there.

In other words just get the data in house into a queryable form and worry about the T part later.


Marketing. There's been a huge marketing push of tools that designed to automate the SQL portion of data pipeline.

Doesn't make much sense to me, because anyone who's been in the data space quickly sees the rampant marketing of different tools that all do the same thing but simply change the order of the pipeline.


Keeping a raw version of the data make it easier to change in the future. If you use s3 as a data lake you can transform later into the shape you want it. Also can use tools like presto where schema is on read instead of hardcoded so transform step is moved to another time.


This article[0] from Census is a good primer on the difference and what's driving it. But the TL;DR is that storage is cheap and it's easier to change SQL-based transformations using something like dbt[1] than to re-extract the data when the business logic changes.

[0] https://blog.getcensus.com/dbt-the-etl-elt-disrupter/ [1] https://www.getdbt.com/


Looks like it will integrate with Airflow and dbt. I like to see this kind of synergy in the open-source community.


You're correct, it currently does integrate with Airflow and dbt. I'm a big fan of both having used them basically every day for the past 3 years. The dbt community has done wonders for me personally and professionally and we're hoping to help build upon the dbt community (and others) to include open source data integration.


Any plans to integrate with dagster for pipeline orchestration (instead of airflow)?


Yep! We have this issue[0] open. I'm also interested in getting Prefect and GitLab CI as orchestrators too.

[0] https://gitlab.com/meltano/meltano/-/issues/2393


before I try yet another ETL tool. How does this work with datasets that do not come from 3rd party providers like salesforce etc? I have had to build ETL pipelines for highly customized datasets either row level based or xml with I would say tricky code as the nesting or flows were not so simple and a lot of data missing.

How would Meltano or the other mention tools handle this?

Example is EDIFACT or FHIR or BDT


We're working on an SDK[0] for building taps that should make it much easier to build to the Singer spec with all of the features out of the box. In theory if you can write some python against whatever you're pulling data out of, then it can work within the Singer ecosystem and Meltano. It's nearly ready to go but we'd love feedback if you decide to test it out!

[0] https://gitlab.com/meltano/singer-sdk


okay fair enough. but I guess then it doesn't offer anything extra for us as we have already the ETL Platform self made and we add the process with just an extra file. thanks!


I can speak for singer taps which this is based on for the EL bits. Our postgres etl works quite well. Logical replication broke down quite quickly, but primary keys + updated_at keys are working very well for us. I can't speak about XML though.


Why did logical replication break down for you?


Great question—-I don’t really know! We were using stitch and couldn’t really get a good answer.


There's a good intro to this project from last summer or so on the Data Engineering Podcast. It's actually a project within Gitlab iirc


Yep - here's the link to the podcast [0]. And you're correct, it is a project with GitLab. For the past year or so there's been 1 person on the project, but there are now 2 (today is when I switch from the GitLab Data team to Meltano) and a 3rd will be joining us in a week!

[0] https://www.dataengineeringpodcast.com/meltano-data-integrat...


read through the homepage and not sure what value add this has over airflow? i must be completely misinterpreting the software.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: