Mara: A lightweight ETL framework, halfway between plain scripts and Airflow

cosmie · on May 9, 2018

I really like this!

I bootstrapped the ETL and data pipeline infrastructure at my last company with a combination of Bash, Python, and Node scripts duct-taped together. Super fragile, but effective[3]. It wasn't until about 3 years in (and 5x the initial revenue and volume) that it started having growing pains. Every time I tried to evaluate solutions like Airflow[1] or Luigi[2], there was just so much involved with getting it going reliably and migrating things over that it just wasn't worth the effort[4].

This seems like a refreshingly opinionated solution that would have fit my use case perfectly.

[1] https://airflow.apache.org/

[2] https://github.com/spotify/luigi

[3] The operational complexity of real-time, distributed architectures is non-trivial. You'd be amazed how far some basic bash scripts running on cron jobs will take you.

[4] I was a one man data management/analytics/BI team for the first two years, not a dedicated ETL resource with time to spend weeks getting a PoC based on Airflow or Luigi running. When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.

vijucat · on May 9, 2018

> When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.

"An engineer is one who, when asked to make a cup of tea, come up with a device to boil the ocean".

Source: unknown. I think it was in Grady Booch's OO book, or some such book.

Also: "architect astronaut" and "Better is the enemy of good" come to mind...

IncRnd · on May 9, 2018

I think the reference to impossible things is far older than Grady Booch.

  To talk of many things:
      Of shoes--and ships--and sealing-wax--
  Of cabbages--and kings--
  And why the sea is boiling hot--
  And whether pigs have wings
  ~ Lewis Caroll, 1832-1898

SmirkingRevenge · on May 9, 2018

You might take a look at Luigi again, it is pretty simple, once you realize its "scheduler" is really just a global task lock to prevent the same task from running on multiple machines or processes, at the same time.

The most onerous configuration you might have to do for a minimal setup, is to make sure your workers know the scheduler url, and that the scheduler is accessible to the workers. And of course, you have to do the actual scheduling yourself (e.g. cron).

That aside, there are few bits of infrastructure that are so quick and painless to stand up, IMHO. A database isn't even required (though desirable for task history).

This Mara project looks really cool too, though.

mattbillenstein · on May 9, 2018

I started on Luigi and loved the simplicity of it (checkpointing completed tasks on the filesystem instead of a db), but the telemetry in the Airflow webui is more or less essential imho.

SmirkingRevenge · on May 10, 2018

The simplicity of Luigi is great, but I did find myself in a spot fairly quickly, where the features of the airflow scheduler/webui were really desirable, over the rather ad hoc nature of Luigi. But after using Airflow a bit, I found myself really missing some of Luigi's simple niceties. I became pretty annoyed with Airflows operational complexity and its overall lack of emphasis on idempotent/atomic jobs, at least when compared with Luigi.

To me, Luigi wins when it comes to atomic/idempotent operations and simplicity. It loses on scheduling and visibility. Today I'm still trying to figure out what the better trade off is.

I think I really want something with Luigi like abstractions (tasks + targets) but with the global scheduling and visibility of airflow.

andscoop · on May 10, 2018

What sort of operational complexities did you run into when using Airflow?

In regards to the idempotency of workflows, so much of that comes down to how you develop your DAG files. Having read through both sets of the docs, they both pay lip service to idempotent workflows, but doing the heavy lifting of making your workflows idempotent is up to you.

SmirkingRevenge · on May 10, 2018

Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs. All those workers need every library or app that any of your dags require. Each of these things are not necessarily big deals on their own, but it still all adds up to a sizable investment in time, complexity and cost to get it up and running.

You'll probably also want some sort of shared storage to deploy dags. And then you have to come up with a deployment procedure/practice to make sure that dags aren't upgraded piecemeal while they are running. To be fair though, that is a problem with luigi (or any distributed workflow system, probably).

Luigi, IMHO, is more "fire and forget" when it comes to atomicity/idempotency, because wherever possible, it relies on medium of the output target for those guarantees. The target class, ideally, abstracts all that stuff away, and the logic of a task can often be re-used with a variety of input/output targets. I can easily write one base task, and not care whether the output target is going to be a local filesystem, a remote host (via ssh/scp/sftp/ftp) or google cloud storage or s3. With airflow I always feel like I'm invoking operators like "CopyLocalFileToABigQueryTableButFTPItFirstAndAlsoWriteItToS3" (I'm exaggerating, but still... :P).

reinhardt · on May 10, 2018

> Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs.

All these are supported but the scheduler is pretty much the only requirement.

Source: been running Airflow for the last two years without a worker cluster, without having celery/rabbitmq installed and sometimes without even an external database (i.e. a plan sqlite file).

swalsh · on May 9, 2018

In the environment I work in, our ETL is all over the place. We have steps on SQL Server, steps in hadoop, steps that need to be reviewed by a real person, some SAP, and a bunch of other technologies.

The solution we came up with is a decentralized solution with a centralized log. Basically the entire graph is in the SQL server, which every technology knows how to talk too. Then we have multiple "runners" (one runner per technology we use) which just ask the sql server for what's next. It also has a system for storing system states caused by events not in the system. It's very simple, and has been very robust.

BerislavLopac · on May 9, 2018

This might be useful: https://cloud.google.com/blog/big-data/2018/05/cloud-compose...

cosmie · on May 9, 2018

That looks like a nifty service, but we were an AWS shop and the bandwidth intensity of ETL would have made a GCP-hosted service cost prohibitive. That said, at least once a month I tried to convince our CTO to let me move our data workloads over to GCP due to all the nifty managed services they have available.

I was primarily munging data between FTP drops, S3, RDS, and Redshift, which mainly fell into free buckets for internal data transfer.

chimerasaurus · on May 9, 2018

Disclaimer: I'm the PM for Composer. :)

If the cost of Composer is an issue, ping me. Running a static environment _does_ have a cost, but for serious ETL it should be pretty inexpensive all things considered. You _should_ be able to use Airflow (in GCP or anywhere else) to call on other services, like S3/Redshift to operate without moving the data through Airflow, keeping network tx low.

If it's network traffic for the actual data moving to and from, that's unfortunately an artifact of how public clouds price.

occams_chainsaw · on May 9, 2018

QQ, why is it called Google cloud composer? I totally would've named it Gooflow

chimerasaurus · on May 9, 2018

My key advice for any new PM:

Every Dilbert cartoon about naming a product is true.

andscoop · on May 10, 2018

Engineer at Astronomer.io here. We offer Airflow as a managed cloud service as well as an affordably priced Enterprise Edition to run on your own infrastructure wherever you'd like. Check us out - and feel free to reach out to me personally if you have any questions.

mattbillenstein · on May 9, 2018

I did multi-cloud doing the data stuff in GCP (mostly GCS and BigQuery) and the rest in EC2 -- costs weren't really an issue, but I guess if you're moving TB's around daily, that's the problem?

cosmie · on May 9, 2018

Not quite TB's daily, but close. We were in B2B lead generation, and a lot of my ETL workloads involved heavy text normalization and standardization, then source layering to ultimately stitch together as complete and accurate of a record as possible based on the heuristics we had available.

Providers of that type of data essentially live in a world of "dump dataset to csv[1] periodically, place csv onto the FTP account for whomever is paying us for it currently". No deltas for changed or net new records, no per customer formatting requests, nothing. So the entire thing had to be re-processed every single time from every single vendor and then upserted into our master data.

[1] Hell, usually not even technical information was provided like the character encoding the data was stored or exported at or whether it's using database style character escapes (any potential special character is escaped with a backslash) or csv-style escapes (everything is interpreted as a literal except for a double-quote, which is escaped with a second double-quote).

dragonwriter · on May 9, 2018

Tangentially, I've usually seen “multicloud" for that and “hybrid cloud” for combining remote cloud services with on-premises resources in a blended system.

mattbillenstein · on May 9, 2018

Agree, multicloud is a better term for this.

jameslk · on May 9, 2018

Did you consider using Singer by chance (https://www.singer.io)? I'm wondering how that compares

mattbillenstein · on May 9, 2018

re points 3 and 4, you can use airflow in a non-distributed manner to just run bash jobs (bash running python in my case). The telemetry it collects and web interface give you a lot of visibility that you don't get with cron and plain bash jobs.

I started with Airflow in this capacity -- maybe taking a day or two to set it up figuring I'd add celery when we needed it. It's been maybe 1.5 years and task queues may never be needed; we can always just buy a bigger VM on ec2 as it's only 8-core/15GB atm.

antoncohen · on May 9, 2018

I'm sure this is right for someone, everyone has different requirements, but I don't really want a lighter-weight Airflow. I want an Airflow that runs and scales in the cloud, has extensive observability (monitoring, tracing), has a full API, and maybe some clear way to test workflows.

I was looking into how Google's Cloud Composer is run, which is a managed Airflow service. They use gcsfuse to mount a directory for logs, because Airflow insists on writing logs to local disk with no cleanup system, even if you configure logs to be sent to S3/GCS. To health check the scheduler they query Stackdriver Logging to see if has logged anything in the last five minutes, because the scheduler has no /healthz or other way to check health. There is no built it way to monitor workflows, so you can't easily do something like graph failures by workflow, email on failure is about all you get. A GUI-first app that requires local storage is not what I expect these days.

mariusae · on May 9, 2018

> I want an Airflow that runs and scales in the cloud

I'd encourage you to look at Reflow [1] which takes a different approach: it's entirely self-managing: you run Reflow like you would a normal programming language interpreter ("reflow run myjob.rf") and Reflow creates ephemeral nodes that scale elastically and that tear themselves down, only for the purpose of running the program.

> has extensive observability (monitoring, tracing)

Reflow includes a good amount of observability tools out of the box; we're also working on integrating tracing facilities (e.g., reporting progress to Amazon x-ray).

> has a full API, and maybe some clear way to test workflows.

Reflow's approach to testing is exactly like any other programming language: you write modules that can either be used in a "main" program, or else be used in tests.

[1] https://github.com/grailbio/reflow

KirinDave · on May 9, 2018

I just want an airflow that runs reliably. My experience with it is extremely unreliable results.

phunge · on May 9, 2018

+1 to this, I've had multiple teams that have shied away from airflow because of its operability & deployment story, despite needing something that has all its features.

In terms of radically different takes on workflow engines, I'm very interested in reflow. I haven't used it enough to know if the rough edges are a deal breaker.

andscoop · on May 10, 2018

> I want an Airflow that runs and scales in the cloud, has extensive observability (monitoring, tracing), has a full API, and maybe some clear way to test workflows.

I am an Engineer over at Astronomer.io and we are working on making Airflow the best piece of software it can be. We offer both Enterprise and Cloud editions. We are solving exactly the problems you describe above.

Take a look and feel free to reach out if you have any questions or feedback.

KHPatel · on May 9, 2018

if I may, have a look at Streamsets (https://streamsets.com/) We looked at their open source version in our research and it seems pretty comprehensive. Not sure if it's a simple single click deploy on your cloud provider, but if it's deployed on a Spark cluster or with Mesos, I'm sure you've got what you need.

occams_chainsaw · on May 9, 2018

There is no built it way to monitor workflows, so you can't easily do something like graph failures by workflow

Not sure if this meets your criteria, but the tree view allows you to see a historical graph of job runs and status

notamy · on May 9, 2018

What exactly IS an "ETL framework"? I looked at both this project and Apache Airflow, and I'm not quite sure I understand...

shadowmint · on May 9, 2018

Basically software for batch processing a tonne of data.

eg. importing a SAP feed into a database, or loading a bunch of csv files, or like processing a bunch of images...

...anything where you have to convert data from some source through a series of steps (typically a DAG) into some useful output.

However, its often misused.

For example, if you have a trivial amount of data, or trivial process an ETL is over engineered cruft where a simple script would do.

So there are many (rubbish) ‘simple ETL frameworks’ which offer zero value and technical complexity for no benefit.

Unless you need multiple servers processing data through multiple steps and you need the auditing and process control... you probably don’t need an ETL, just a simple script.

cosmie · on May 9, 2018

> you probably don’t need an ETL, just a simple script.

+1

> Unless you need multiple servers processing data through multiple steps and you need the auditing and process control

I'll stress the "multiple servers" part. You can add in a substantial amount of multiple, sequential steps and auditing and process control in a simple script. The part that adds orders of magnitude worth of complexity and operational overhead and points of failure is being able to distribute it to multiple servers. Distributed architectures are operationally and architecturally expensive. And far more often than not, completely unnecessary for a given use case.

bduerst · on May 9, 2018

Some ETLs these days are simple scripts - e.g. Spark, Dataflow - with their config files.

cosmie · on May 9, 2018

Being able to define an ETL workload within a simple script is not the same as your ETL system itself being a simple script.

While I love both Spark and Dataflow, both of them are incredibly complex distributed systems with very high operational costs. Someone, somewhere is paying a lot of money to have an operational resource maintain that complexity. Whether you have an internal devops resource doing so or you're using a managed service, you're paying for that complexity somehow. And, for a lot of workloads, you aren't actually getting any more value than you would from standing up a ~$50/month standard Debian/Ubuntu server and a set of simple scripts on it.

Mironor · on May 10, 2018

You don't have to have a cluster to run spark scripts, setting master to `local` (and running it on one machine) is often enough for small anounts of data.

bduerst · on May 10, 2018

They don't have high operational costs - you can run them as a script on your local machine. You're making them out to be more complex than they really are.

debacle · on May 9, 2018

> batch processing

Not necessary. ETL these days can be streamed, realtime, etc.

abd12 · on May 9, 2018

To me, it's easiest to think of it as "Make, for data". Your ETL pipelines are often a complex graph of dependencies. If some step halfway down the chain fails, you don't want to restart at the very beginning -- you want to restart where it failed and keep moving.

The ETL frameworks (Airflow, Luigi, now Mara) help with this, allowing you to build dependency graphs in code, determine which dependencies are already satisfied, and process those which are not. They'll usually contain helper code for common ETL tasks, such as interacting with a database, writing to/reading from S3, or running shell scripts.

martin_loetzsch · on May 9, 2018

(author here).

Mara data integration is indeed a glorified version of Make (with cost based scheduling and lots of visualizations)

Diederich · on May 9, 2018

https://en.wikipedia.org/wiki/Extract,_transform,_load

thibaut_barrere · on May 9, 2018

(I’m the author of a popular Ruby ETL framework named Kiba ETL)

An « ETL » framework is for me something that will allow you to express and run maintainable data pipelines, as a coder.

Kiba is exactly that, in the sense that it does not include built-in sources/transforms/destinations, but rather defines a set of guidelines you can follow to easily implement reusable ETL components with high quality & unit tests.

More information in the README: https://github.com/thbar/kiba/blob/master/README.md

me_bx · on May 11, 2018

+1 for the maintainability of an ETL framework brings, compared to "handmade" scripting...

slenk · on May 9, 2018

See the other comments also, but ETL stands for "Extract, Transform, Load"

KirinDave · on May 9, 2018

Let me give you a quippy answer:

Cron is awful for cloud-based deployments trying to keep data from multiple sources in sync and compute business-specific metrics over them.

ETL frameworks seek to address this pain by building a large stack of software around scheduling the running of and unifying the writing of these data munging tasks.

As such, they tend to be as awful as the sum of their dependencies minus some small constant quality factor. So, more awful than any one part but less awful than the storm you'd be howling into if you didn't take some sort of infrastructural solution.

collyw · on May 9, 2018

Extract Transform Load.

Getting data from one place to another - usually the end place being a structured database (or it ought to be).

If you have been in the industry over 10 years you probably used Perl or Python scripts.

porker · on May 9, 2018

What is there in the ETL space with bi-directional sync?

I don't usually run into problems where "transfer data from X to Y" is it. Usually it's "there's data in CRM X and data in Event system Y, merge the two keeping X as the master source"

There's Mulesoft et al but they seem overkill for small deployments, as well as being stupidly expensive [1].

1. I'm sure they're good value if you're an enterprise company. But if you don't need the UI builder and only 2-3 sources kept in sync, they are expensive. And with my reading of the documentation, conflict resolution isn't great either.

ptrott2017 · on May 9, 2018

Kettle - the open source component of Hitachi Pentaho Data Integration is worth looking at, has some functionality for this (you can join sources and insert joined data back into master) and its pretty easy to extend to meet requirements. Its Apache licensed, with great commercial support if you need it, and can be found here:

https://github.com/pentaho/pentaho-kettle

We are small shop but it is a key part of our data workflows. We also found spoon the UI workflow builder to be very helpful for building workflows - since its allowed team members who were not strong in Java to build workflows that they need. Last but not least and a key decider in our adoption is the community is super friendly and very helpful. Something we thought would take months got implemented in a weekend thanks to quick feedback. Perhaps worth a look.

karambahh · on May 9, 2018

As a long time Kettle user (probably close to 10 years) I must warn potential users that the learning curve is steep and that (as any large body of code) it contains code that sometimes can run unpredictably. I got good at diagnosing user induced bugs in PDI transformation via reading the stacktrace but it is not to everyone liking.

To me, a very strong regression is the "new" UI which switched from meaningful icons to a blue & white scheme that makes reading/discovering new transformation a real pain: all is a blur of blue without the past color cues that you learned in the past ("ok, this is the icon for a merge from a source file & a database sent to an ES cluster" became "some stuff is read from blue sources and sent to some blue output")

I recently learned about the capability to run transformation into a spark cluster that replace the original engine by a new spark implementation, bring obvious compute optimization for large enough dataset but I don't have enough experience with it to speak of it positively or negatively.

ptrott2017 · on May 9, 2018

@karmbahh - good to know. I've used Kettle for about 9 months in production and so far its been pretty solid - but we are not going that far off the beaten path for most things. Its a big app, but at least there is documentation and some great users who have been very helpful and the codebase is by and large very logically laid out.

We do use the Adaptive Execution Layer - but so far not with spark (we use it with our own processing engine) - its working well for us and its great we can switch engines as needed.

re: UI. I like a lot of new look and feel but I can see how it did lose some visual semantics and i can imagine any long term user would find the changes frustrating. I guess with coming to the tool much later, this has been less of an issue for me and we teak the presentation for our own workflows and plugins anyhow.

For us Kettle/Pentaho PDI is a great open source project but it will definitely be interesting to see how things evolve now Hitachi has acquired Pentaho.

dgudkov · on May 9, 2018

Try my EasyMorph (https://easymorph.com). It's pretty simple to use (we aim it at non-technical users) and has a decent free edition without limitations in time or data volume.

Lord_Zero · on May 10, 2018

We just paid for mulesoft, can confirm very expensive. But they worked with us and came down to what we could afford for what it's worth.

Mandatum · on May 11, 2018

I assume you work for a non-profit or educational organisation?

Have you used Mule in anger? Would love to hear how you're using it at the moment. There's not a lot of information about it around HN and my regular communities, so it's nice to see something I work with every day mentioned.

groodt · on May 9, 2018

I'm interested in hearing thoughts from people who've used digdag (https://www.digdag.io/) or pachyderm (http://www.pachyderm.io/). Pachyderm is the most interesting to me. It seems to be focussing on the data as well as the data processing.

jdoliner · on May 9, 2018

I'm the founder and a core developer of Pachyderm so I can weigh in on how it compares (there is, of course, some potential bias here). I also was at Airbnb around the time we released Airflow, so I got to see it being built up close and used the system it was replacing quite a bit as well.

I think it's fair to say that Mara and Airflow are both in the same category of DAG (directed acyclic graph) schedulers for Python; Python makes a ton of sense as the language to focus on as it's the de facto lingua franca for data science. I'd also put Luigi in that bucket, although I think Airflow has degraded its mind-share quite a bit. All of them are targeting the data pipeline use case, which is very well represented as a DAG, but the actual management of the data is left up to the user. They (Mara, Airflow, or Luigi) schedule tasks for you after all the tasks they depended on have completed, but you have to figure out where to store your data so that downstream tasks can find the data their upstream tasks outputted. At Airbnb we used HDFS as this storage layer, often with Hive or Presto on top. Storing in s3 is also a common pattern.

Pachyderm is also a DAG scheduler but we're a lot more prescriptive about where you store the data, and a lot less prescriptive about what languages and tools you use. Pachyderm ships with its own distributed filesystem (pfs) that we use for storage, it does a few things that other storage solutions can't do. In particular, it version controls your data and records "provenance" i.e. where data comes from. For example if you train a machine learning model then its provenance will be the data you used to train it. In terms of processing we're much less prescriptive, because we let users express their code as a Docker container rather than only having bindings for one language. So you can use anything that you can fit in a Docker container. Data is exposed to your code via the local filesystem, so regardless of language you have a very natural interface to your data: system calls on files.

Hope this helps understand the differences between the various systems and thanks for your interest in Pachyderm. Swing by our users slack channel [0] if you'd like some help getting started with it.

[0] http://slack.pachyderm.io/

Eridrus · on May 12, 2018

Random question:

I want to automate some workflows on my local machine, and besides the obvious of just writing a script, I am interested in a system where I could describe my workflow as a DAG and then have an easy (web?) UI where I could specify which DAG nodes have changed (e.g. my data pre-processing code) and have it automatically run all of the nodes that (recursively) depend on it as an input, while not executing those whose inputs have not changed.

I am passing around very little actual data between these jobs; they are mostly writing data to a (distributed) file system, so at most I need to pass some paths around.

Some of the stages require launching a remote job and polling to find out if it has completed.

Is there a good system for doing this? Now that I've described it I could probably hack it together with a command-line UI without too much difficulty, but having a pretty UI for launching and monitoring jobs would be great.

stadeschuldt · on May 9, 2018

Here is a conference talk presenting the framework and the ideas behind: https://youtu.be/GdtFuOah-5c

samuell · on May 9, 2018

Send a pull request to add it to https://github.com/pditommaso/awesome-pipeline/blob/master/R...

stadeschuldt · on May 9, 2018

occams_chainsaw · on May 9, 2018

Seems interesting, but I'll have to dismiss it for now due the total lack of tests

endlessvoid94 · on May 9, 2018

In my experience one of the places where every framework breaks down is when you must combine / reconcile multiple rows from multiple data sources to produce one row in a fact table.

Does there exist a "framework" that lets me do this simply?

martin_loetzsch · on May 9, 2018

(author here)

The mara example project [1] does exactly that. It combines PuPI download stats with Github repo activity data.

[1] https://github.com/mara/mara-example-project

endlessvoid94 · on May 9, 2018

Thanks! Just took a look.

The file directory structure is a bit confusing -- could you point me to the file that performs this transformation?

martin_loetzsch · on May 9, 2018

For example the PyPI download stats pipeline is here: https://github.com/mara/mara-example-project/tree/master/app...

The __init__.py contains the pipeline, and the rest is the SQL files that do the transformations

endlessvoid94 · on May 10, 2018

Thank you!

thalesmello · on May 9, 2018

As the one who implemented Airflow at my company, I understand how overwhelming it can be, with the DAGs, Operators, Hooks and other terminologies.

This looks like a good enough mid-term alternative. However, I have a few questions (which I couldn't find easily in the homepage, sorry if I skipped something):

- Do you have a way of persisting connection information? I saw an example of how to create a connection, but it isn't clear if the piece of code has to be loaded every time you execute the ETL

- How easy it is to implement new computation engines?

- Plans of creating a command line to make it easier to execute operations?

martin_loetzsch · on May 9, 2018

(author here)

Connection information is configured in code through [1], see [2] for an example.

It's very easy to run other workloads. Either by directly invoking Python functions from tasks or by writing own commands (operators)[3].

There is a command line. It's the interface for running from external schedulers (jenkins, cron)[4] & [5]

[1] https://github.com/mara/mara-db

[2] https://github.com/mara/mara-example-project/blob/master/app...

[3] https://github.com/mara/data-integration/blob/master/data_in...

[4] https://github.com/mara/data-integration/raw/master/docs/exa...

[5] https://github.com/mara/data-integration/raw/master/docs/exa...

neuromantik8086 · on May 9, 2018

Perhaps this is addressed elsewhere, but do you have any plans to support Common Workflow Language?

mooreds · on May 9, 2018

I'd love this, except for ruby. Anyone know of anything like this?

thibaut_barrere · on May 9, 2018

I wrote Kiba specifically for this.

I will share a good bunch of current use cases at Kaigi 2018 if you are interested!

kakoni · on May 9, 2018

Embulk (https://www.embulk.org)

- Define your datapipelines with yaml + ruby (using ruby proc filter)

- Functionality can be extended with ruby(or java) based plugins.

Been using this for few years, works beautifully.

busterarm · on May 9, 2018

https://github.com/thbar/kiba

This has been the standard for years

shadowmint · on May 9, 2018

...but why?

Just use airflow.

Things I want in an ETL:

[x] works at scale.

[x] simple to use.

[x] not written in python (eg. in go or rust)

[x] easy to scale (eg. in docker)

[ ] this.

busterarm · on May 9, 2018

https://github.com/thbar/kiba This is still my workhorse and has never let me down. I do a ton of ETL.

https://github.com/thbar/kiba-ex This looks interesting though.

thibaut_barrere · on May 9, 2018

Kiba author here - your comment made my day, so thanks!

busterarm · on May 9, 2018

:D Thank you!!!

samuell · on May 9, 2018

There are a few pipeline tools/frameworks written in Go: http://gopherdata.io/post/more_go_based_workflow_tools_in_bi...

(Full disclosure: I'm developing SciPipe)

liveoneggs · on May 9, 2018

[ ] has a way to test data connections before trying to use them

[ ] has a clean way to do releases

[ ] can handle more complex network setups (like found in docker envs)

stadeschuldt · on May 9, 2018

Python is more a glue in this case. Most operations are basically unix pipes pushing data in and out of Postgres

lasermike026 · on May 9, 2018

Go is nice, Rust is interesting, and Python is fine.

shadowmint · on May 9, 2018

python is slow and single threaded, which is a tangible problem for an ETL, and airflow specifically.

...airflow is good enough for other reasons, but that doesn’t mean ‘not python’ gets off the wish list.

SmirkingRevenge · on May 9, 2018

Generally the workflow framework ought to be doing very little compared to the jobs it orchestrates and executes. In the common use cases, airflow is often only a little more than an (distributed) executor/scheduler for bash commands.

Wouldn't the value of optimization at the framework level, be generally... very little?

samuell · on May 9, 2018

Not always. When the workflow tool is a central scheduling point for thousands of tasks fired off to other compute nodes (via some distribution mechanism), you can start to experience pain.

There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.

Because of this, in our work (scheduling machine learning workflows on HPC clusters), we started see lost HTTP-connections between workers with Luigi when going over 64 workers, even though these workers were doing nothing else than keeping track of a job running on another compute node.

This lead us to use Go instead, with SciPipe, and it has none of these problems. Go being compiled also means we're catching a lot more stupid typos and such at compile time rather than at run time (like 7 days into a 7 day HPC job), which also makes our life easier.

Python is definitely quite a lot easier to write and read, but the robustness of compiled Go code is hard to beat, and I don't wish to go back to a python-based tool.

baq · on May 9, 2018

> There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.

this is half true, and importantly, in a workflow tool, the part where python is perfectly capable of being multithreaded while waiting on IO (which releases the GIL) should be quite sufficient.

there are problems with python like its memory usage and with airflow specifically like its at times flaky scheduler (maybe that's fixed in newer versions), but multithreading shouldn't be one in this particular context.

artwr · on May 9, 2018

The scheduler has seen a lot of improvements in the past couple of releases, and release 1.10 should be coming out soon. (full disclosure: I work on Airflow)

SmirkingRevenge · on May 10, 2018

Great points. Scaling indefinitely will inevitably expose inefficiencies, even tiny ones. And even tiny inefficiencies will place a hard limit on your ability to scale.

cdancette · on May 9, 2018

python in airflow is just used to define your graph and trigger your jobs. I think it's a wonderful language to do this.

You can (and should) program your batch jobs in other languages than python if you have large amount of data of course.

lasermike026 · on May 9, 2018

Performance issues can be overcome with Cython or Pypy.

"Python is slow."

What do you mean by "slow" and what have you done to make Python code run faster? I find for most things Python is fast enough or performance can be improved to meet most compute and data demands.

In the future we will all be using Javascript. Language performance debates are moot.

riku_iki · on May 9, 2018

some other languages (e.g. java) don't need much untrivial effort from you to work fast.

occams_chainsaw · on May 9, 2018

Are you not using celery with airflow?

kenhwang · on May 9, 2018

You hit my wishlist almost exactly. A strongly typed, not java, ETL framework would be my dream.

Plus, works at scale and easy to scale almost becomes a non-issue because we wouldn't be choked by the slowness and bloat of python.

CMCDragonkai · on May 10, 2018

Funflow is written in Haskell https://www.tweag.io/posts/2018-04-25-funflow.html

JustSomeNobody · on May 9, 2018

> ...but why?

You ask this generally, but then you give a bunch of specific reasons why you don't want to use it.

wokwokwok · on May 9, 2018

yes, but it doesn't do any of the things on the list.

nostalgeek · on May 9, 2018

> [x] not written in python (eg. in go or rust)

Don't you need some kind of scripting capabilities though to create custom pipelines without having to recompile all the app each time you had a new one?

mariusae · on May 9, 2018

Reflow [1] is also well-suited for ETL workloads. It takes a different tack: it presents a DSL with data-flow semantics and first-class integration with Docker. The result is that you don't write graphs, instead you just write programs that, due to their semantics, can be automatically parallelized and distributed widely, all intermediate evaluations are memoized, and programs are evaluated in a fully incremental fashion:

[1] https://github.com/grailbio/reflow

leblancfg · on May 9, 2018

That looks very interesting indeed, and love to see some more development in that space.

Would anyone care to explain how it differs from Airflow? I had dismissed Airflow some months ago for my use case (Windows, and large number of dependencies were an issue at the time), but would still like to eventually migrate my ETL scripts to a solid framework sometime in the future.

artwr · on May 9, 2018

With a little bit of work you could probably trigger tasks on a Windows Machine, in particular if the work you need to do is mostly script and/or SQL based. The main limit is that Gunicorn for the Airflow webserver is not currently compatible with Windows, but the scheduler should work fine. I believe it might be able to run in the Windows Subsystem for Linux, but I don't think anyone has tested it as of yet. (Source: I am an Apache Airflow committer)

leblancfg · on May 9, 2018

Thanks! To be fair, I was looking at it from a single-user perspective, and was optimizing for:

    time it takes to setup and move my ETL to Airflow + time I will save in the future by using the framework

vs

    time I will spend fixing and maintaining my already-existing ETL scripts

so I dropped the notion after I hit the first major snag. But I should put a reminder to take a look back at it in a few months!

stadeschuldt · on May 9, 2018

Mara is designed to be simpler than Airflow. It positions itself between plain scripts and Airflow.

huac · on May 9, 2018

What features does it forgo to be 'simpler'?

artwr · on May 9, 2018

The main feature which jumps out at me, I think it schedules on a single node, where Airflow can orchestrate across multiple nodes. Airflow is also starting to implement more enterprise-y features, like role-based access control, which should be coming to the next release.

martin_loetzsch · on May 9, 2018

(author here)

That's absolutely correct. Mara uses Python's multiprocessing [1] to parallelize pipeline execution [2] on a single node so it doesn't need a distributed task queue. Beyond that (and visualization) it can't do much. In fact it doesn't even have a scheduler (you can use jenkins or cron for that).

[1] https://docs.python.org/3.6/library/multiprocessing.html

[2] https://github.com/mara/data-integration/blob/master/data_in...

frugalmail · on May 9, 2018

Just thought I would add:

If you want something more serious that supports better scale, realtime/streaming, is written in a statically typed language check out https://kylo.io/ it's added on to Apache NiFi (which was developed by our friends at the NSA)

reinhardt · on May 10, 2018

Why PostgreSQL only? The mara-DB dependency [1] claims to support more.

[1] https://github.com/mara/mara-db

martin_loetzsch · on May 10, 2018

(author here)

Currently there is a hard dependency to Postgres for the bookkeeping tables of mara. I'm working on dockerizing the example project to make the setup easier.

For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add more).

random4369 · on May 10, 2018

I'm a bit confused about this. What if the target is HDFS? Why this dependency on SQL databases for ETL?

samuell · on May 9, 2018

I really like the interactive terminal-based menu, that's neat!

tjr225 · on May 10, 2018

This is slightly weird...but I named my dog Mara: https://flic.kr/p/FtdRNX

martin_loetzsch · on May 10, 2018

Also weird: it's the name of a giant ugly guinea pig: https://en.wikipedia.org/wiki/Mara_(mammal)

soobrosa · on May 10, 2018

What's the least verbose/boilerplate-heavy tool in your experience?

We couldn't make it leaner than this (works well in production in scale). https://github.com/wunderlist/night-shift If we could get rid of Ruby in there (super useful for scripting) and fly with only Python I'd be the happiest person on Earth.

Also we started to go cloud agnostic, it handles both AWS and Azure. Do you something that does AWS, Azure _and_ Google Cloud also?

martin_loetzsch · on May 10, 2018

GNU Make is indeed the least verbose/ boilerplate-heavy tool and I use it for a lot of things.

The problem with Make is lacking acceptance amongst younger programmers who always want to work with the latest technologies.

nikolay · on May 10, 2018

How is this better than Singer [0]?

[0]: https://www.singer.io/

mistrial9 · on May 9, 2018

the docs animations inline add to casual-browsing fun https://github.com/mara/mara-example-project

pgwhalen · on May 9, 2018

Is anyone out there doing (or considering) streaming ETL as opposed to batch?

woqe · on May 9, 2018

Apache nifi (https://nifi.apache.org/) is a software which handles streaming ETL processing.

From a previous post I made about it: "We have HTTP endpoints set up to receive data from our ERP's accounting system to send data to Concur and to update customers' Lawson punchout ordering systems with shipment information. The 'E' is an HTTP post with an XML payload. The 'T' consists of using the payload to query other databases to build the 'L' payload, and the 'L' is an HTTP post to the consumer's endpoints."

chrisjc · on May 9, 2018

There's often a misconception about Nifi being batch or file based and not streaming. This probably originates as result of data in Nifi being represented as a FlowFile, which is kind of a misnomer. As woqe stated above, Nifi can certainly doing streaming.

artwr · on May 9, 2018

I think a lot of people are considering it. The ideas behind Apache Beam are supposed to make the transition easier, with the ability to run in either mode. I think in practice it really depends on your use case. For a large enough amount of data, especially on the ingest side, running it streaming makes sense. Now for that daily report from aggregated table, do you really need streaming ETL/aggregation is another question. The operational cost is non trivial with streaming.

pgwhalen · on May 9, 2018

Can you expand on that last sentence? Not that I’d disagree, I’m just curious about what specifically you’re referring to.

My team is currently underwater supporting a decades old system of batch ETL, where it’s a major challenge to understand dependencies between jobs, and therefore what to rerun when something fails. We’re exploring a move to streaming ETL (Kafka streams being the leading candidate) as a way of making the dependencies explicit: rather than having to rerun stored procs or Talend jobs that we depend on certain data, we run a fleet of 24/7 microservices that transform and move data downstream whenever it is available.

thibaut_barrere · on May 9, 2018

I do both depending on the project needs!

jmartrican · on May 9, 2018

Does not seem lightweight. Maybe its light compared to the other solutions?

foolinaround · on May 9, 2018

if this had a scheduler and an alerting mechanism when SLAs are breached, that would be great! maybe future features?

martin_loetzsch · on May 9, 2018

(author here)

It intentionally doesn't have a scheduler, just definition and parallel execution of pipelines. For scheduling, use Jenkins, cron or Airflow.

Currently you can get notifications for failed runs in slack. Alerting itself is not really in the scope of this project, but it should be easy to implement in a project.

larsf · on May 9, 2018

Composable is amazing at ETL - https://composableanalytics.com

It blows things like Alteryx, NiFi, Airflow out of the water.

mattbillenstein · on May 9, 2018

Please add a disclaimer if you work for these guys... I personally generally dislike enterprise solutions because I hate talking to sales people - but that's just me.

larsf · on May 9, 2018

Yes - I'm the founder. And you don't have to talk to me if you don't want to :)