This is maybe a silly question, but what's the difference between the timely dataflow that Materialize uses and Spark's execution engine? From my understanding they're doing very similar things - break down a sequence of functions on a stream of data, parallelize them on several machines, and then gather the results.
I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).
There are a few differences, the main one between Spark and timely dataflow is that TD operators can be stateful, and so can respond to new rounds of input data in time proportional to the new input data, rather than that plus accumulated state.
So, streaming one new record in and seeing how this changes the results of a multi-way join with many other large relations can happen in milliseconds in TD, vs batch systems which will re-read the large inputs as well.
This isn't a fundamentally new difference; Flink had this difference from Spark as far back as 2014. There are other differences between Flink and TD that have to do with state sharing and iteration, but I'd crack open the papers and check out the obligatory "related work" sections each should have.
For example, here's the first para of the Related Work section from the Naiad paper:
> Dataflow Recent systems such as CIEL [30], Spark [42], Spark Streaming [43], and Optimus [19] extend acyclic batch dataflow [15, 18] to allow dynamic modification of the dataflow graph, and thus support iteration and incremental computation without adding cycles to the dataflow. By adopting a batch-computation model, these systems inherit powerful existing techniques including fault tolerance with parallel recovery; in exchange each requires centralized modifications to the dataflow graph, which introduce substantial overhead that Naiad avoids. For example, Spark Streaming can process incremental updates in around one second, while in Section 6 we show that Naiad can iterate and perform incremental updates in tens of milliseconds.
That's very helpful, thanks! I think I still have to wrap my head around _why_ being stateful allows TD to respond faster, but maybe I just gotta dig deeper and see for myself.
It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?"
If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things.
That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic.
Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years.
I see. To my small brain it sounds like TD can intelligently memoize or cache the outputs of each "step" so that it only recalculates when it needs to as the inputs change.
I think Spark does that sometimes these days, but I don't know much about the specifics of how and when Spark does it.
Does TD have to keep _everything_ in memory, or can it be strategic in what it keeps and what it evicts?
TD lets you write whatever logic you want (it is fairly unopinionated on your logic and state).
Differential dataflow plugs in certain logic there, and it does indeed maintain a synopsis of what data have gone past, sufficient to respond to future updates but not necessarily the entirety of data that it has seen.
It would be tricky to implement DD over classic Spark, as DD relies on these synopses for its performance. There are some changes to Spark proposed in recent papers where it can pull in immutable LSM layers w/o reading them (e.g. just mmapping them) that might improve things, but until that happens there will be a gap.
You should check out Apache Flink. It does a bunch of those things that Spark doesn't, though it's also missing a few things that Spark has. https://flink.apache.org/
There's no difference really. All "Big Data" (tm) tools are trying to capitalize on the hype, so Kafka adds database capabilities, while Spark adds Streaming. At some point they will reach feature parity.
I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).