I could really do with a brief rundown of a what a data processing engine does? Have briefly scanned https://www.wallaroolabs.com/ but the descriptions are both too general and too specific. Let's take the machine learning scenario: what does Wallaroo do for me here specifically..?
We currently do event-by-event data processing. We are working on adding microbatching features.
Wallaroo provides the infrastructure plumbing to write resilient, scalable applications. You write your code to our scale-independent APIs (currently Python 2.7 is supported. Go and Python 3 coming later this year). Wallaroo hands laying out your data and computations across X number of processes which can change as the application is running. We have built in message replay, are working on completing exactly-once message processing (at-least-once delivery with deduplication). Wallaroo handles the infrastructure concerns so application programmers can worry about the business logic.
"the machine learning scenario" is rather broad so I can't answer that specifically. We just starting working with a company that is interested in the ability to take the Python their data scientists are writing on their laptops and move it into production with no logic changes. Currently they end up rewriting in a JVM based language and often have to switch machine learning libraries.
There's a 15 minute video I did about scale-independence that includes code and a demo that you might be interested in:
The vision here is to make it much easier to build and scale distributed data applications in general, not just event-by-event. We are in the process of putting together functionality we need to support microbatch type workloads and jobs that have a beginning and end.
re: Too general/too specific. my personal email is in my profile. i'd love to hear from you on what was too general and what was too specific. We did our best with explaining but we knew that we would still end up being off the mark in some cases. If you could spare the time to drop me some notes, I'd appreciate it.
Sorry if that answer is a little rambling. We had a number of long hours while cleaning everything up for the first open source release and I'm exhausted.
1. What does it mean "scalable"? When would Wallaroo not be "scalable"?
2. How does Wallaroo differ from just running bunch of AMQP clients for data
processing? And I ask this as somebody who's in the middle of writing his own
stream processing engine.
And a few comments:
- Installation procedure is so unnecessarily long. It would be much shorter if
you provided source package and not require anything from outside of the
distribution.
- Installation procedure involves downloading Pony compiler from non-canonical
source. It's a warning of sorts. I don't know if the compiler source was
changed from the upstream, so I'm wary about Wallaroo's compilation with
official Pony compiler.
- Installation procedure only ever mentions Ubuntu. You do know that people
run other distributions, right? Does Wallaroo even compile under Debian or
CentOS, or was it written for Ubuntu specifically?
- Requiring Docker to run examples is bad. Your software should be easy to run
without garbage from containers.
A lot of how Wallaroo handles state mirrors what Pat Helland wrote about in "Beyond Distributed Transactions". It ends up having a number of nice features for stateful applications. You can see it in action in the 15 minute "get to know scale-independence" video that I did:
- We are pairing down the installation procedure. We couldn't do everything we wanted before we did our first release.
- The compiler was changed from upstream. We are in the process of getting back onto mainline ponyc. That should be happening in the next couple weeks at which point, we will be taking the first step in pairing down the instructions. When we were first developing Wallaroo, we wanted to be able to rapidly evolve the Pony runtime as well as Wallaroo itself. To do that, we needed to fork the project. We've gotten almost all of our changes back into the mainline ponyc.
- It works on other distributions. We currently only provide instructions for Trusty and Xenial because as you said, it's a somewhat involved process at the moment that we are working on simplifying.
- We went back and forth on if we should use docker for the UI. Its written in Elixir and requires that Erlang be installed. After working with some clients, we decided it was easier to provide a docker container.
BTW, several of the core Pony people are working for Wallaroo labs. The fork sounds temporary, and this is probably a good thing for those following Pony.
I just watched the video and here is what I think could be approved. As someone who knows nothing about your project the first thing I’d like to know how it runs.
So I think you could get out more from your demonstration and place it before the API usage tutorial part.
Right now the demonstration feels a bit hasty and nervous. It would be nice to have more detail on whats going on in the demo.
That's a paired down version from 30 minutes. The decision we faced was:
Do we think asking people to watch a 30 minute intro video will cause far more people to drop off than a 15 minute video? If yes, can we convey a lot of the value of Wallaroo in 15 minutes and get them interested in learning more?
We decided the answer to both questions was yes. I'm planning on doing a couple longer intro videos for those with an appetite to learn more at a slightly more relaxed and less hurried pace.
My question to folks would be, what's the longest you'd be interested in watching? I have enough to cover that I could do:
30 minute, 60 minute or 90 minute videos, each providing more depth. How much time are folks generally interested in investing in videos of this sort?
The C++ API that we currently aren't supporting and the Go language bindings that we are working on create native binaries; yes.
The Python version embeds a Python 2.7 interpreter inside a running application. We have C code that acts as an interface between our Pony runtime and the developer's Python code, calling the required framework functions at the correct time.
The event-by-event features in Wallaroo are somewhat analagous to Storm but we had management of in-memory state that allows you to grow the size of the cluster and have the application state redistributed across the cluster. This can be done while an applicaiton is running. In the case of Storm, this would be like:
1- keeping your application state in objects in your bolts (this helps with performance compared to an external store)
2- having a means to add more workers to the running storm topology and have that in bolt state redistributed across the new number of machines.
One of our engineers is giving a talk at Velocity conf NY next week. He's going to be discussing Pat's paper and how Wallaroo exhibits a number of characteristics he describes.
I think that one of the great benefits to open source platforms such as PostgreSQL, Hadoop, Kafka and others is that their open license guarantees no vendor lock-in. This means users can relax in the knowledge that the platform won't disappear if the original company disappears (or hikes prices), and means that it's possible to support the platform forever - even if that means bringing it in-house.
Those are certainly legitimate concerns. In the end, we are positioning ourselves as an open-core business. Some of what we offer with Wallaroo is always going to be commercial (even if the source is available). That part of what you raise isn't something we can really address. It's the business model we are comfortable with and think from talking to other companies that are the primary developers of an open source project is sustainable as a business.
Business model aside, I think you raise some interesting points about hikes prices, company disappears etc. This is something we'll be discussing over the next couple weeks. Its definitely an area of ambiguity that we need to address. Thank you for raising it.
> The Wallaroo license seems to be non-free, which I think is very unfortunate and a misses an opportunity to be more widely used: (...)
Maybe. But then, if they require you to buy a license when you're using it to make money yourself (and don't try tell me that you can afford to run your hobby project on more than 24 CPUs but can't throw some money on the devs of your tools) then it is no longer "unfortunate" - it's just business.
And don't forget that programmers need to eat too. "Widely used" does not directly translate to "making tons of money".
Bonus points to them for doing it from the beginning, instead of changing the licensing mode mid-flight.
I absolutely agree that Wallaroo Labs wrote the code, and they get to decide what license is on it. It is, as you say, their business.
> And don't forget that programmers need to eat too. "Widely used" does not directly translate to "making tons of money".
I hadn't forgotten that; I wish every success to Wallaroo Labs, and hope that they are soon "making tons of money". I think their product could be very exciting.
Personally, I think their product would be even more exciting if it were fully open source. I think more people and companies would get involved, and that this would accelerate development and adoption. There are other business models which could support this (paid support, hosted instances, etc...) which are not incompatible with the software being fully open source.
Ultimately, the decision belongs to Wallaroo Labs. Their assessment of the advantages and disadvantages of the available business models is clearly different from mine.
This is so cool. I see you used Erlang for various components like the UI, was your primary reason for using Pony instead of Erlang for the data processing performance? That seems like a safe assumption given Erlang's frustrating performance characteristics. There isn't a lot of information out there about Pony vs Erlang in production, hopefully you guys will write up more in the future about the differences you have encountered.
1. Where is the latency variance coming from? (Garbage collection? Network stack?)
2. Between which two points are you measuring it? 3. What would the variance look like if using two processing nodes on different machines?
4. Is it possible to use something like LTTng with Pony to understand latency distribution better?
1. I'm not sure what you are reffering to in terms of latency variance. What variances are you referring to?
2. We measure from the moment that we ingest the message to when the message is done being processed. If a Wallarooa application was receiving data over TCP and sending out new data also over TCP, it would be from the time that we read in the data to when we write the message out onto the outgoing socket. If a message were to be filtered and not creaye any output then the end point would be once its filtered.
3. We currently use TCP between nodes. The latency that gets added with a network hop in part depends on how much traffic is going over that hop. If you were processing a couple hundred thousand messages or so a seond, that additional latency is general 500 micros to 2 milliseconds.
4. I suppose. I don't believe anyone has tried.
The metrics that we provide include not just end to end latencies but also the amount of time spent in a given computation, going over a given network hop and similiar insight.
Wallaroo has a fairly low overhead so most time in an application is spent in user computations (the latency of which Wallaroo can't control) and any network hops.
We read in large chunks of messages at a time and start processing them. Some messages wait longer in internal queues for CPU time to become available to process them.
The benchmark we use is intentionally very light computationally. We did that because, if we had a heavy computation then the overhead from Wallaroo would be hidden.
Because the people we are working with as commercial clients wanted 2.7 supported; they weren't interested in Python 3. We have Python 3 support slated for the end of the year.
Basically, we establish a roadmap for ourselves and then adjust it based on interest from folks who will pay us.
We are now adjusting that to including feedback from the open source community as well.
There are many many things we could be building. Getting feedback from folks who will help us grow the business is an important part of our process.
I'll preface this with I'm one of the authors of Storm Applied. A book about Storm from Manning.
There's a number of performance tips, best practices and what not in Storm Applied. One of the things we did when building Wallaroo was to build such best practices and tips into Wallaroo itself so that application programmers didn't have to think about them.
Many of the systems I've built using Storm, needed to keep state in memory in order to achieve the performance we needed, trying to rapidly calculate new results and save those results to an external store (for recovery in case of failure) was problematic:
1) performance was almost always bounded by the speed of the external data store
2) We had to roll our own idempotence mechanisms (which invariably had bugs)
Wallaroo features integrated state management. Rather than creating fields inside of bolts and managing state that way, Wallaroo's APIs allow you to setup state objects that Wallaroo manages. You define the objects, the basic partitioning scheme and how messages get routed to them and Wallaroo manages them for you.
When you have a message that needs to operate on state, your computation is given both the message and the data to operate on. This allows a number of interesting features:
- You can scale state across a running cluster, adding more servers or removing some and Wallaroo can redistribute your state for you. Our 15-minute demo video shows this in action: https://vimeo.com/234753585
- You can much more tightly integrate failure handling and resilience and avoid a number of potentially nasty edge cases.
- (Coming soon, not yet in Wallaroo). You can create new state objects and change partitioning on the fly. For example, in Storm if you are doing word count, you use a fields grouping. You set up a number of tasks for counting words (fixed) and each bolt has a map of words to count that it handles. If you created 128 tasks then that is the limit of what you can parallelize to while running. We are working on "dynamic partition keys" in Wallaroo. What this means is that I can have a state object per word in Wallaroo. Rather than having a fixed number of buckets to store word counts in (which limits parallelization), you can have 1 for every word you encounter. In addition to allowing for parallelization that can grow with the shape of your data, it also allows for some interesting characteristics for simplying business logic (which I'm not going to cover now as its a bit involved to explain without a lot of code samples).
Lastly, this is going to allow us to create replicas of data across a cluster of Wallaroo workers. Which should end up allowing for a number of nice characteristics.
Wallaroo's approach to state is very much in line with Pat Helland's paper "Beyond Distributed Transactions". You can find a link to it (and other papers/talks that we hope more people are exposed to) at https://www.wallaroolabs.com/community/talks
---
Storm is JVM based, if you want to use a non-jvm language, you need to use its multi-language support which involves marshalling data between the jvm and an external process via Thrift. There are pros and cons to this approach. Its great for making it easier to support new languages. It has performance and operational overhead. If you use a JVM language with Storm (Java, Scala, Clojure) there is no such overhead as thrift isn't involved.
Wallaroo is not JVM based. Our cost of adding support for new languages is higher but we have better performance and operational stories. In the case of a language like C++ or Go, your application code is compiled together with the Wallaroo framework code into a single binary that you distribute. Your application is that binary. You set up a cluster by having the different running instance communicate with one another. This means, that all state is and messages remain in a single process (except when moving from one cluster member to another), which would provide less overhead if you wanted to do stream processing with a language like Go or C++ that isn't JVM native.
When considering how our Python implementation works, its similiar except, we embed a Python interpreter along with the Wallaroo framework code. It works in the same fashion as the means described above with C++ and Go except, you need to distribute that runner application (Wallaroo + Python), your Python code and any required Python libraries together. It's slightly less self contained than the single binary you get with a compiled language like C++.
---
Storm generally provides minimum latencies that are measured in milliseconds. Wallaroo applications can be microseconds.
---
Storm provides at-least-once message processing or at-most-once message processing. You can do microbatching with Trident to get "exactly-once" processing but the perform is quite a lot worse than you get with Storm's event-by-event APIs.
Wallaroo provides exactly-once processing while doing event-by-event processing. There's a small amount of latency overhead and at the moment and a distinct drop in throughput. Our reference application that we use that has minimal business logic (allowing us to more easily measure the overhead that Wallaroo is added) has a 10x drop in throughput when running with all our resilience features on. It should be noted however that even with that drop, we are processing hundreds of thousands of events per second with tail latencies measured in single digit milliseconds. And we have plenty more tuning to do there an expect it to get better.
---
There are other differences but those are a couple I'd like to highlight. My basic statement to Storm users would be.
If you:
- use a JVM based language
- are generally happy with Storm
then Wallaroo isn't for you. But if that doesn't describe you, then you should keep an eye on Wallaroo. We are actively developing and will be releasing a number of improvements over the coming months. For some use-cases, Wallaroo is usable now. For others, it isn't yet but we are actively working to expand the types of applications that Wallaroo is good for. We are happy to chat over email, IRC etc with folks to learn more about their use case.
One of the big reasons that we released Wallaroo now rather than waiting is we wanted to expand the pool of people whose use cases would be presented to us. We want to make building data processing applications much easier than it currently is. A big part of that is hearing the types of applications you are building, the types you can't currently build because its too hard, the problems you are having with existing tools.
Improving the installation process is near the top of our list of things to do.
edit
one of my co-workers thinks that I might have misinterpretted your comment. Can you clarify the 7 clicks you are referring to? I took it to mean the process you have to go through to get to running the first application.
I’m going to be really honest, and I mean this critique in the most constructive way possible: this release feels very rushed in terms of documentation, presentation and technical limitations.
A few observations:
1. You wrote a C++ API for Wallaroo but you’re not “supporting it.” What does this mean? Why provide a C++ API (and therefore appeal to some subset of potential clientele) if you’re not going to support it at all? Further, why do you say you’re “happy to work with you” to C++ users then?
2. Why are you emphasizing Pony at all? You’re not documenting it much and you explicitly state that you only really intend to use that binding internally. That detracts from your core content.
3. Your current limitations are a showstopper for a lot of serious use cases. For example: you can only recover from one worker failing at a time, and you don’t support logging other than print statements?
With that out of the way, I want to congratulate you on the release. It’s nice to see more activity in this space, especially without the JVM.
re: emphasizing Pony, the moderators here on HN changed the title of the post, adding Pony and also redirected it to GitHub repo so, don't have a ton of control over that. The blog post that announces the release specifically talks about the Python API which is what we are initially supporting.
re: C++. C++ was our first API but the folks we have gotten traction with aren't those that are interested in using C++. We have limited resources and a lot of work to do, so, the C++ is there and folks can use it if they want but we don't want to mislead people into thinking that it will get the same attention as the Python in terms of features or bug fixes in the near future. In the end with C++, we will pick it up and start working on it more if it can bring in revenue and allow us to grow the team and further enhance the product. At the moment, most of the interest that we are getting is for Python, Python 3, and Go so we are putting an emphasis there. We had to make the hard decision in August that C++ wasn't getting traction and we shouldn't be investing time in that API at this time.
Handling multiple concurrent failures will be addressed by the end of the year as will logging.
We've put a lot of work into some areas and less into others. You have nicely highlighted a couple that we de-emphasized initially.
Thank you for your kind words and congratulations.
edit to add a bit more info
re: timing on a release. I think that is hard for any project. Whether it has commercial backing or not. In the end, we want to get Wallaroo out there with people, warts and all and get feedback from folks on what features they think are most important. We want to hear from potential users in the space and have them help us prioritize functionality (even more so if they pay us money so we can hire more engineers to speed development).
Pony is a compile to native code actor-based language. It's designed to make it easy to write fast, safe applications. It has a unique type system that enforces data race safety at the type system level while not requiring all data to be copied when sending messages from one actor to another.
I'm going to be doing a post within the next couple monthes about why we selected Pony. And the pains and advantages that we got from using it.
I gave a talk at QCon NY in June that covers some of the ground that the forthcoming post will:
In an open-core model like Nginx there's an open-source edition and there's a commercial edition and there's clear documentation on what features are available on which version. There doesn't seem to be anything similar for Wallaroo.
You are right that information is a little bured right now. You'd have to look at the code.
At the moment, core stream processing and state management are 100% free.
The forthcoming exactly-once and autoscaling features (they mostly exist, just not general release yet) are enterprise features.
There are build time options to decide what features get compiled in.
I hope that helps a little. We will be improving the information around that over the coming weeks. There are some issues around how installation etc should work, what are the defaults and we punted on those. Our assumption for punting on those is that it's unlikely anyone is going to want to put Wallaroo into production during the next month or so and that if they do, they could contact us to get more information.
It's not the best answer but we decided to invest time in other bits of polish while we figure that out.
We are working towards adding data replication as a feature in Wallaroo. To make that work, you need to allow for at least 3 servers so that was the bare minimum. We added the 24 cpu limit because we are trying to build a business and allowing someone to load up 3 massive servers and use all the enterprise features for free isn't good business.
We started with "low" numbers because we aren't sure what the right cut off will be and we would rather give people more in the future then realize we gave them too much.
If we decide that 5 servers/40 cpus is the right level then everyone who gets an increase will be happy. If we started at 5/40 and realized that the vast majority of use cases were fine at that level and wanted to lower that number, we'd be screwed and would ruin trust. So, we started with what we thought was a nice low minimum and as we go forward, we'll potentially raise the limits.
It was at least partially informed by a POC we did for one of the major US banks were we met their targets with 5 server and 30 cpus. It was the sort of application that we think is reasonable for us to get paid for (so we can keep building Wallaroo) so that also acted as an upperbound for our early discussions.
In the end, we want people to be able to use Wallaroo and get value out of it. And we want to be able to build a business and keep building this thing that we love. And as I said, we want to be cautious and make sure we are always in the position to give away more rather than having to take things away.
There are features that might eventually become 100% free and open source as we move forward. What we want to avoid is ever having to close the source on a feature.
I hope that answers your question. Sorry, its been a long few weeks polishing up the first release and it's been a very long day today so I'm a little tired and rambly.
> Wallaroo is an open source project. All of the source code is available to you. However, not all of the Wallaroo source code is available under an "open source" license.
You're just mincing words. The author(s) is upfront about the source distribution and licensing and is not being misleading, as the GP implied. I'm really not interested in yet another discussion about definitions when the author's meaning is perfectly clear.
If you ever feel you need a phrase and "a phrase" in the same sentence, chances are you won't come across perfectly clear. You are then either implicitly contradicting yourself, or bringing in a statement about your perceived validity of the very same statement you choose to use into your writing. Whatever it is, it's not clear writing.
Hence, I would say that the meaning really isn't clear, especially not in the context where it's stated. This lack of clarity is further bolstered by the blog post linked from the project webpage where they mention it's going to become open source, there they spend a fair amount talking about exactly once delivery and how important it is. It turns out that exactly once is not a feature covered by the Apache license.
Almost every project containing both for-payment parts and open source parts split them into separate repositories, it's a good practice when there are different legal agreements involved, and I see no real reason for not splitting it up, except possibly that pony might be one of the opinionated languages where you can't build things if not everything is in the same folder?
If that is the case, those constraints are - as in go - rather shortsighted. The world is a complicated place, simple solutions rarely work as well as one might hope.
Yeah, we decided our original plan for April was too aggressive. We believe in "put it out there before you are comfortable releasing" but in the end, we decided April was too soon. We have a long roadmap of features, improvements and what not that we want to do with Wallaroo. When we looked at where we were in April, we decided that we needed to execute on more of that roadmap before putting it out and getting feedback from folks.
The release blog post also has some crucial licensing information spelled out:
> Parts of Wallaroo are licensed under the Wallaroo Community License Agreement. The Wallaroo Community License is based on Apache version 2. However, you should read it for yourself. Here we provide a summary of the main points of the Wallaroo Community License Agreement.
> You can run all Wallaroo code in a non-production environment without restriction.
> You can run all Wallaroo code in a production environment for free on up to 3 servers or 24 CPUs.
> If you want to run Wallaroo Enterprise version features in production above 3 servers or 24 CPUs, you have to obtain a license.
> You can modify and redistribute any Wallaroo code
> Anyone who uses your modified or redistributed code is bound by the same license and needs to obtain a Wallaroo Enterprise license to run on more than 3 servers or 24 CPUs in a production environment.
Not sure I'm too happy about a custom licence - it'd be nice to hear more about this choice vs a more traditional AGPL / commercial license.
Immediately I sense that if wallaroo labs goes bankrupt / is bought - the open source system can potentially become impossible to run legally in production [ed: for new installations] ...
> Immediately I sense that if wallaroo labs goes bankrupt / is bought - the open source system can potentially become impossible to run legally in production [ed: for new installations] ...
You raise an interesting point. We'd be dealing with that with clients on a case by case basis of providing them with whatever assurances that they need should Wallaroo Labs disappear. However, we don't have anything stating that anywhere.
The expectation that our lawyers had was that no company would sign a boilerplate contract so issues of "what happens if the company goes away" would be handled the agreement between us and that company.
The basic story behind the custom license:
1- Outside of RedHat, there aren't a lot of companies who have shown you can build a successful growing business selling support.
2- We didn't want to have an open source product and a closed source commercial offering. We wanted anyone to be able to modify any of the code as they needed. We wanted to grant more rights than you normally see with open core projects.
3- We want to allow smaller companies to be able to use Wallaroo and all its features, thus the "free up to X" for enterprise features.
This boils down to:
1) we don't think you can have a sustainable business as the primary developers of a infrastructure project where the model is selling support.
2) we want to make all source available and grant rights beyond what you would see with most open-core projects.
From a business point of view, I understand the reasoning, but everyone wants everything for free ;-)
I my concern is more that such a closed core/custom license doesn't bring with it the peace of mind that I usually associate with Free/OS software: that any work put into it, and any deployments (including future deployments) will be able to at least run on current code, regardless of licensing. And that was mostly an academic worry, until Oracle bought Sun, and demonstrated that companies would indeed do silly things to open software. Note that if it was Wallaroo that was "closed up", there'd be much more work and/or impossible to fork the system and continue using it.
Sun with it's CDDL is also interesting for the problem wrt GPL incompatibility (which I think affects your license as well, (see: "Further Restrictions"). While I don't expect any of this code/system to run in the kernel, or inside some other system, like GCC - it is a (small) concern in terms of future use.
It'll be interesting to see how this works out for you - I absolutely understand the need to monetize, and even if this is more of a "source available" license, than "Free Software" - it's still better than a completely closed system. But it'll never be distributed as part of eg: Debian/Main - like for example the jdk, or apache or spark is.
I guess my main take-away is that this isn't "open source" but rather "source available" - because there's no way I can modify your software and sell/give/distribute it to someone and have them run it as and how they whish (Freedom zero).
I don't mean this as criticism, but as an observation.
[ed: On a more encouraging note, I know some big clients of Oracle/Solaris that were frustrated with the closed source offering, because there existed patches for problems on the open source OpenSolaris side for serious issues with big zfs deployments that took months to show up in the closed/supported offering from Oracle. So at least having a "transitive" license (if you have a license you can run community-improved code, not be bound to wait for upstream) is an improvement.]
[ed2: Now, what happens when company X implements a similar license for compiler for language Z that is incorporated in Wallaroo (eg: APL/J, something new?) - but perhaps choose a different number of CPUs to trigger the need for a license? Could my system X-with-wallaroo require license fees from me to run at 12+ cpus, and from Wallaroo at 32+ cpus?). Just a thought.]