Meltano: ELT for the DataOps era

dataminded · on March 1, 2021

They have the right idea, it's a real need and a major opportunity. Airbyte[1] is a few steps ahead. I also expect that the Airflow community will also wade into these waters at some point.

1. https://airbyte.io/

glogla · on March 1, 2021

Another one on the market is pipelinewise.

Meltano and pipelinewise are both ways how to orchestrate Singer.io taps and targets while Airbyte is its own thing.

EDIT: I also don't know how just released Airbyte can be ahead of something that's surely in production for a while.

lmeyerov · on March 1, 2021

Singer.io seems to run the data plane on json -- is there an Apache Arrow equiv for this ecosystem/problem, or something like it?

tayloramurphy · on March 1, 2021

I'm not aware of any. I did just open this issue[0] in the Meltano project to open discussion with the team/community. It could be an interesting iteration on the Singer Spec[1] if we find that users are interested in it and it helps solve some bottleneck challenges.

[0] https://gitlab.com/meltano/meltano/-/issues/2616 [1] https://github.com/singer-io/getting-started/blob/master/doc...

lmeyerov · on March 1, 2021

yeah we do not push an etl pipeline through json unless we have to (and generally cannot), most etl-scale data engineering we do is almost all arrow/parquet/orc/protobuf etc, and slow legacy, odbc/json, which is streams that we turn into typed and compact data. I think json fine for command/metadata layers though, esp early on, but pretty core to what I look for an etl/streaming tool is out-of-the-box foundations for the data plane

the good news is the implicitly typed json examples look arrow friendly, so users can to/from_json if they don't care about data speed/quality like when prototyping and not think about it. there may be other data-engineering-friendly formats that'd work too.

prefect, dask, and friends solve it by abstracting over it. you can send whatever you want.. and it happens to be friendly to dataframes (pydata) / compact & typed data. but there projects seem to be more about source/sink, so encouraging structure by default would be helpful...

edgarrmondragon · on March 1, 2021

AFAIK Meltano uses JSON only in the interface between a tap (source) and a target, to communicate schema, state and records.

It's up to the target what it does with the JSON messages it receives, so you can for example have a target-avro that takes JSON records and outputs them as an Avro file and translates the JSON schema to the corresponding Avro schema.

glogla · on March 1, 2021

Yep, definitely.

Then the holy grail would be to have bunch of taps-targets running in parallel for a single pipeline, each working on a subset of streams.

rahimnathwani · on March 1, 2021

I'm currently using Stitch (https://www.stitchdata.com/) to move data from a few sources to Redshift.

Since Stitch, Meltano and Pipelinewise all use Singer.io taps and targets under the hood, I wonder if there's any reason to choose one over the other?

tayloramurphy · on March 1, 2021

It's dependent on your use case, but the three examples you've listed here all have a slightly different approach to the market. Stitch (as mentioned in a comment below) is SaaS only. Pipelinewise is open source but as far as I know have no plans to build a company around it. With Meltano, we're aiming to grow the project and community and eventually build a business around it in a similar manner to what GitLab has done. Our docs[0] have more information about our current focus and roadmap if you're curious.

[0] https://meltano.com/docs/#focus

glogla · on March 1, 2021

Stich is SaaS only. That by nature makes it suitable for some uses, and unsuitable for others (like when you have a provision that your data can't leave your network or when you are in a company where adding a new vendor isn't a quick process.

Meltano and Pipelinewise are open source projects that someone built for themselves but are sharing. You can just start playing with it and change the code or whatever, but there's no support to pay for.

For example where I'm standing the best one would be "Stitch I can self-host for free for a PoC and then eventually engage vendor about a support contract while still self-hosting it for security reasons" but there doesn't seem to be anything like it.

dataminded · on March 1, 2021

Easy -- Meltano has not been in production. They are both relatively new tools.

tayloramurphy · on March 1, 2021

The GitLab Data Team is running Meltano in production[0]. We're currently extracting Zoom data with it and have plans for several more extractors (Slack, GMail, PTO by Roots, EdCast, and a few more). I just made this MR[1] to update the list of Extractors to include Zoom too.

[0] https://about.gitlab.com/handbook/business-ops/data-team/pla... [1] https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_request...

DouweM · on March 1, 2021

And the GitLab Data Team is not alone! The Meltano Slack community (link on the homepage) is about 800 strong right now, and every day we've got people discussing their production deployments and helping new users set up their own.

PS. Like Taylor, I'm on the Meltano team at GitLab.

kbd · on March 1, 2021

This has been called "ETL" for as long as I can remember. Could someone explain why they're using "ELT"?

JPKab · on March 1, 2021

While there are technical differences with transforming the data within it's storage location/database, I've interacted with my former employer's board of directors enough to know there's a bigger, rather obnoxious reason for this:

VC valuations in Silicon Valley for traditional "ETL" tools isn't great. The term "ETL" has baggage of Talend and other big enterprise tools that have an aura of "not being user friendly".

"ELT" is intended to signal that a tool is trying really hard to not be like those sad, difficult to use, dirty "ETL" tools.

That's the real reason. There really isn't an actual difference in how the tools work. Traditional ETL tools are perfectly capable of loading the data and initiating the transform in the database it's being stored in. This has been done for decades.

I have been repeatedly lectured by senior strategy team executives about using the proper vernacular to ensure perception of value is maximized in the trendy, buzzword driven VC world. So fucking stupid.

visch · on March 1, 2021

"Transform" step is happening after the load to data warehouse here. https://blog.getdbt.com/what--exactly--is-dbt-/ is a pretty good overview.

snidane · on March 1, 2021

ELT means: let's focus on moving data between from system A to system B and skip the transform part, which is a business of whoever has access to the target system. Which is usually a sql database so you can then do arbitrary data transforms there.

In other words just get the data in house into a queryable form and worry about the T part later.

nojito · on March 1, 2021

Marketing. There's been a huge marketing push of tools that designed to automate the SQL portion of data pipeline.

Doesn't make much sense to me, because anyone who's been in the data space quickly sees the rampant marketing of different tools that all do the same thing but simply change the order of the pipeline.

kpierce · on March 1, 2021

Keeping a raw version of the data make it easier to change in the future. If you use s3 as a data lake you can transform later into the shape you want it. Also can use tools like presto where schema is on read instead of hardcoded so transform step is moved to another time.

tayloramurphy · on March 1, 2021

This article[0] from Census is a good primer on the difference and what's driving it. But the TL;DR is that storage is cheap and it's easier to change SQL-based transformations using something like dbt[1] than to re-extract the data when the business logic changes.

[0] https://blog.getcensus.com/dbt-the-etl-elt-disrupter/ [1] https://www.getdbt.com/

brylie · on March 1, 2021

Looks like it will integrate with Airflow and dbt. I like to see this kind of synergy in the open-source community.

tayloramurphy · on March 1, 2021

You're correct, it currently does integrate with Airflow and dbt. I'm a big fan of both having used them basically every day for the past 3 years. The dbt community has done wonders for me personally and professionally and we're hoping to help build upon the dbt community (and others) to include open source data integration.

pchal · on March 1, 2021

Any plans to integrate with dagster for pipeline orchestration (instead of airflow)?

tayloramurphy · on March 1, 2021

Yep! We have this issue[0] open. I'm also interested in getting Prefect and GitLab CI as orchestrators too.

[0] https://gitlab.com/meltano/meltano/-/issues/2393

_0c71 · on March 1, 2021

before I try yet another ETL tool. How does this work with datasets that do not come from 3rd party providers like salesforce etc? I have had to build ETL pipelines for highly customized datasets either row level based or xml with I would say tricky code as the nesting or flows were not so simple and a lot of data missing.

How would Meltano or the other mention tools handle this?

Example is EDIFACT or FHIR or BDT

tayloramurphy · on March 1, 2021

We're working on an SDK[0] for building taps that should make it much easier to build to the Singer spec with all of the features out of the box. In theory if you can write some python against whatever you're pulling data out of, then it can work within the Singer ecosystem and Meltano. It's nearly ready to go but we'd love feedback if you decide to test it out!

[0] https://gitlab.com/meltano/singer-sdk

_0c71 · on March 2, 2021

okay fair enough. but I guess then it doesn't offer anything extra for us as we have already the ETL Platform self made and we add the process with just an extra file. thanks!

estsauver · on March 1, 2021

I can speak for singer taps which this is based on for the EL bits. Our postgres etl works quite well. Logical replication broke down quite quickly, but primary keys + updated_at keys are working very well for us. I can't speak about XML though.

smilliken · on March 1, 2021

Why did logical replication break down for you?

estsauver · on March 2, 2021

Great question—-I don’t really know! We were using stitch and couldn’t really get a good answer.

Jgrubb · on March 1, 2021

There's a good intro to this project from last summer or so on the Data Engineering Podcast. It's actually a project within Gitlab iirc

tayloramurphy · on March 1, 2021

Yep - here's the link to the podcast [0]. And you're correct, it is a project with GitLab. For the past year or so there's been 1 person on the project, but there are now 2 (today is when I switch from the GitLab Data team to Meltano) and a 3rd will be joining us in a week!

[0] https://www.dataengineeringpodcast.com/meltano-data-integrat...

failboat · on March 1, 2021

read through the homepage and not sure what value add this has over airflow? i must be completely misinterpreting the software.