Hacker News new | past | comments | ask | show | jobs | submit login
Joining a billion rows 20x faster than Apache Spark (snappydata.io)
153 points by plamb on Feb 9, 2017 | hide | past | favorite | 83 comments



Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.


This paper is a must read: https://pdfs.semanticscholar.org/6753/959eed800e9fad9e330daa...

People keep stumbling upon the same thing over and over which is that the ability to scale has significant overhead.


Back in '10, I needed a three or four node Hadoop cluster just to match the performance I was getting using a spare Mac mini in development mode when I was doing a lot of work in Cascalog, which is based on Cascading.

Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.

People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.

If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.


I'm mostly convinced that most companies interested in Big Data stuff are not as interested in the scale of the problem but that they want to create "data lakes" to unite thousands of different forms of data that exist in their organization under a federated, centralized database of some sort. But most of us experienced in either enterprise companies or machine learning is that data quality is the primary problem that almost nobody actually can solve without brute force human eyeballs, which simply won't scale with the amount of data pouring in. So now there's interest in machine learning primarily to try to do that instead of people.


At the risk of sounding cynical, there are also companies out there that want to _appear_ to be interested in all the things you just mentioned, so they'll hire a few people to do their [wave hands] data science, machine learning "thing" and those few people then go down the rabbit hole, untethered from reality. The C-suite people will then have their message they can deliver externally—and internally—about their commitment to [wave hands again] all that stuff. The tragedy when this happens is that sometimes that small team of people is not aware what their function at the company is—being the human props for the sales and marketing teams. The alternative, I suppose, is far more depressing—that the "data scientists" are fully aware of their role in the grand scheme of things.


Woah, that was exactly my previous position.

Upside: I had one of the most paying position among the technical people. I also got to play with expensive stuff.

Downside: it was soul crushing, I was delivering no value whatsoever and had a really hard time looking at my colleagues in the eye, as they were making a third of my salary (at best).

I got out, joined a new company with that in mind and now have a very exciting job. They do have a dedicated R&D, Data Science team which is a shit show: absolutely brilliant people completely wasted as they can't build anything for lack of programming/technology experience, in an environment where their theoretical skills are mostly useless. I'm genuinely sad for them.

[edit: one of the company still has a Hadoop/Spark cluster for their whooping 500Mb of data]


You really just described me. I've to do such a show-case project to complete my MSc Thesis for a minimum pay. And make all the proofs, so that the C-Suite guys can use during their presentations to sell what I made for a lots of money. (Even unaware if they'll charge their customers in the hundred thausands or millions range).

However, I'm really glad to find out about SnappyData.io, that's gonna save me a lot of time waiting. It would truly be my perfect dream, if they allowed running any programming language inside an environment like Jupyter.org or BeakerNotebook.com, but with Pandoc.org Markdown. So that I can essentially work fulltime programming, while I can also document it and also be able to export my documentation to a good looking latex thesis.


SnappyData has a Zeppellin interpreter and the code is open source. So if adding a interpreter for jupyter is something that can be easily added, I am sure someone from the community would find it a interesting project to undertake. Agree that it would be useful


This is so true. For what it's worth, there are probably a lot of people who would give anything to be a human prop earning a salary significantly above $100k (even far outside of SF). For someone who's actually interested in data science, this would be miserable.


This is excruciatingly accurate.


+1


>If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.

I disagree over here. I have worked across multiple scenarios which warranted big data solutions and such solutions were not feasible before Apache Spark and such were available. Even our current startup (www.aihello.com) has 8.7 million products and calculating LDA + Cosine Similarity reaches trillions of matrices which is simply not feasible with traditional tools.

Telstra/Sensis, the telecom company in Australia that I consulted for, went from a month delayed reporting to near real time reporting due to apache spark.

Also keep in mind that the scale of data is growing exponentially for all of us since storage is getting cheaper and big data analysis is proving game changer in many scenarios.


Being able to do things like churn prediction and net promoter score in real time was one of the motivations for creating SnappyData. You get the ability to mutate data (think KPI maintenance in memory without having to jump across products) , and do joins etc. on streams, which makes things a lot simpler


Amen. Also said as: too big for excel is not big data. See also https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html


Amen right back at ya! (I love the O'Reilly book cover.) I highly recommend people read your blog post. And there's also this classic:

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:

https://bost.ocks.org/mike/make/

https://github.com/Factual/drake

If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.


Just for clarification, I'm not the original blogger. +10 for the other link and using make ! I don't know Drake however.


The picture has now gotten a little fuzzier as this blog post conflates map reduce and YARN and calls them both hadoop. The scala pseudo code is just about exactly what you'd use with spark which runs on YARN.


I think his point is that bloated, over-engineered Big Data systems—whether batch or streaming—are overkill for the vast majority of problems.


There are just many points that don't really apply to stuff like spark or tez that runs on YARN:

ex: Hadoop << SQL, Python Scripts

I completely agree with

Mapreduce << SQL, Python Scripts

I do a lot of my processing on sparkSQL and through RDD transformations as opposed to Mapreduce limiting, slow KV style processing.


Thanks for the link. The replies in it were hilariously obvious.


There used to be a little web site where you'd fill out a little form that asked you how much data you had, and it would provide a list of commercially available hardware that could be bought or configured to handle it on a single machine. I think it would even give you a link to someplace you could order that piece of hardware.

This reminds me of the time, way back when, that a coworker told me about how our customer was filling a rack with a terabyte of hard drives. My eyes bulged a little bit to think of it. Now I chuckle to think that the laptop I had two laptops ago had a terabyte drive in it.


https://twitter.com/garybernhardt/status/600783770925420546

> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.

Considering https://www.supermicro.com/products/system/4U/8048/SYS-8048B... which is a plain old 4U server not some fancy, super expensive NUMA machine can eat up 12TB memory, this quip and parent has quite some merits.

6TB is not even https://memory.net/product/s26361-f3843-e618-fujitsu-1x-64gb... horrible at 57 504 dollars. That's about 48 engineering days if your engineer related expenses are 150 an hour (and it's likely they are more).


Note: https://www.sgi.com/products/servers/uv/uv_300_30ex.html

> SGI UV 300 now scales up to 64 CPU sockets and 64TB of cache-coherent shared memory in a single system.

This is the current limit of Linux hardware memory support so going above it is tricky. But still, 64TB.


One nice thing about Hadoop is you get free distributed apps. On my current project, we only read about 50 TB of data per run across 70ish fraud models. Some read 1.5 TB, others only 20 GB. On a single system, that kind of data reading would require some smart I\O partitioning across the various models (multiples read the 1.5 TB data, all read the same 20 GB [after 20 GB you're looking at history that expands to the 1.5 TB]). With Hadoop, even just Map Reduce and Cascading, you can spin all of that work out to multiple computers. Since they have the data copied over multiple drives on those multiple computers, the I\O and general scheduling are handled for us. In the end, it makes everything simpler. If something fails due to network hiccups or disk failures, Hadoop moves the job and starts it again.


I think this is only half the story.

There are other use cases other than mere size that can necessitate "big data" solutions. E.g. timeliness, resiliency, maintainability...

If you are building production data processing systems that have constraints on data size, latency, resiliency, scheduling, dependency management, etc., you might be better off with a "big data" system. Even if the data could all fit on a beefy box. This was a painful lesson for me to learn.


Hm. Size is mere size. Latency will never improve with a "big data" solution over one machine with in-RAM data. Dependency management? You're going to declare it once and impose it everywhere anyway. Scheduling? Again, one machine with in-RAM data will always win.

That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM.


Latency does improve if you download your data in parallel across a cluster. Or if you're running many iterations of an algorithm over GBs of data thousands of times, and each iteration is independent--you can save hours or days by performing them in parallel on a cluster.

If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.

At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.

> how is resilience helped with a big data solution?

The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.

Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.


Is it even possible to get even 10TB of ram on a single commodity server?


The Super Micro SuperServer 8048B-TR4FT lists that it supports up to 12TB DDR4 ECC RAM (which could have 4xE7-8890v4 for 96 cores / 192 threads). Close to a commodity server, but probably doesn't quite count. Taking a wild guess on the price - $250k-$350k?

The SuperServer 7088B-TR4FT lists that it supports 24TB DDR4 ECC RAM (with 8xE78890v4 for 192 cores / 384 threads).


This opinion is stated frequently. What size of data is big data, in your opinion?


My rule: If it can fit on a single hard drive, it's not big data.


There are WAY more than 100 companies that meet that requirement.


The correct way to compare software A and software B is to benchmark both on the target platform/hardware they were respectively written for. Afterwards, do a cost-benefit analysis.

edit: (accidentally hit the submit button early).

I don't think people should leverage highly distributed software for small workloads, for the same reason they shouldn't write highly parallelized code for things that run perfectly fine on one thread. But the test, while well-intentioned, seems to miss the mark.


> The correct way to compare software A and software B is to benchmark both on the target platform/hardware they were respectively written for. Afterwards, do a cost-benefit analysis.

Well, ideally, yes, if we had infinite time. In reality we don't, which means that we have to choose what to do without the benefit of being able to implement-thrice-deploy-N-times[0]. In practice, what happens is that we (as "engineers"[1]) form rules and patterns in our heads which we use as guidance. I think the point being made is that "use a cluster" is almost never good guidance.

[0] How can you know what the performance is without actually giving your product to a bazillion users? This hints at why just-deliver-it-now-bugs-be-damned and continuous feedback is so valuable. There's no point optimizing a product used by 1000 people, but if your platform ends up being used by 1e9 people (e.g. Facebook), then you'll make ADJUSTMENTS ALONG THE WAY. This is a GOOD PROBLEM TO HAVE.

[1] A laughable term for most of the programmer crowd, myself included. Engineering is about tradeoffs and we still have basically no idea about tradeoffs in software development.


I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense.

Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.


The question is more, what do you want as result? Suppose you search in your 8TB database of molecules the 1000 molecules most similar to a given one, you have 16 cores, you cut the 8TB in 500GB skunks, preload continuously 1GB of molecules per core and accumulate 16*1000 molecules and merge at the end. You can do it on a single system and you work with a TB size dataset.

It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.).


I see your point -- size alone doesn't matter, it's how you use it :-).

But how do things change when the dataset grows to 9GB? Now we need more than one HD. Hadoop + Spark is built for this exact use case...


I think the conundrum comes up from the "donut hole" of medium sized data. For 1TB use a script and a laptop; for 100TB use Spark running on dozens to hundreds of machines.

The problem is exactly that 8-9TB range because running spark on just two or three machines will be slower than on a laptop with an extra external drive. You need to scale up into potentially dozens of machines just to get the same performance you were getting on a laptop. You were ok with a laptop, add more data and now you have a not insignificant AWS bill, unless you are ok puttering around on a few machines much more slowly than on the laptop.

There is no middle ground solution, so everyone starts with a overkill solution that scales out of fear of getting stuck on one machine when the dataset grows. But most of these systems never grow enough to need to scale this way. So we are wasting resources running toy clusters on problems that would fit on a laptop.

Maybe I am becoming a cranky old man who yells at clouds, but I miss MPI. It had no frills but it runs with next to no overhead and scales up to super computers with no donut hole in between.


If a distributed software running on multiple high end expensive servers cannot beat another solution running on a single laptop with a cheap external hard drives, the issue is not distributed systems, the issue is that that specific software is crap.


There will always be some overhead, but yes it seems like some of these frameworks are pretty bloated.


Processing data wouldn't be the problem with 2 socket xeons neither would it be putting 3 or 5 Hdd on a raid5. Getting the 32TB in, however, would take at least 8 hours at 10Gbps saturated, if your disks can write that fast.


10gigE seams fast, but in reality it's only 1.25GB/s in an ideal case. One enterprise PCIe SSD drive will saturate that. Or 5x of the old style 3.5 inch 7.2k RPM drives (you can fit 12 of these in a dense 1U case).

That why you see 40gigE or 56gigE used in HPC.


Its 25G, 50G or 100G today.


Different example, doing a simple 'group by' sparksql query on only about 20 million rows on a distributed phoenix/hbase table couldn't even be completed because of spark dumbly shuffling all the data around the cluster. Spark/phoenix RDD drivers apparently had no 'group by' push down support for phoenix so shuffled all the data amazingly inefficiently. Running the same query directly on phoenix took all of about a minute to finish.

My point is, these 'on a laptop/single machine memory' examples don't really give me an indicator of scenarios where I might actually want to use spark/etc.


Hey, I'm the phoenix-spark author here. You're totally right, right now there is a lot of dumb shuffling around for certain operations. Hopefully some of that will get fixed up in the next release [1].

[1] https://issues.apache.org/jira/browse/PHOENIX-3600


You're trying to GROUP BY on a distributed data store; your code is the problem, not Spark SQL. Use CLUSTER BY - it's distributed sibling.

Query languages like HiveQL and Spark SQL were designed to look like SQL, but they're not.


Edit: correct me if I'm wrong, it doesn't appear that 'cluster by' avoids a costly shuffle first. I'd rather just push down to the database engine, when using a database engine.. group by worked fine on phoenix, so saying my code is the problem means it's really only a problem when using sparksql with the phoenix RDD driver.


From what I know, this test originally written by Databricks (expanded here) is meant to tease out the optimizations in the Tungsten engine. Of course, a distributed query that is dominated by shuffle costs will produce a very different result.


The test uses 4 partitions (4 threads) and is not single-threaded. Of course, distributing over network will have network costs which can be significant for sub-second queries but will become insignificant for larger data and more involved queries. The Spark execution plan will be exactly same on laptop or in cluster, and these specific queries will scale about linearly.


I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?

You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?

In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?


I think there are many use cases. Fraud detection, risk analysis in finance, weather simulations, etc. These don't need to spill out to disk and are a perfect use case for these systems.

A friend of mine works for a company that does high speed weather analysis to make predictions for energy brokers, to predict prices of wind / solar energy on the market. They use these kind of systems extensively, because of the speed and volatility of the data. Fascinating stuff.


You can also measure cloud oktas from satellite imagery of you want to get fancy in terms of solar energy supply side forecasting: https://axibase.com/calculating-cloud-oktas/


Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.


My lab works with multi-terabyte datasets on a regular basis. We have big machines to do machine learning on, but when they're not in used, I can tell you that it's way easier to provision and write a single or multi-threaded script that just loads everything into memory rather than deal with networking and partitioned data.

Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.


It's not big data if it fits in memory... This article is demonstrating an architecture that may scale well with big data.


Lol was hoping it was a combination of awk and paste :)

That always makes me chuckle.

Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.



Pardon my ignorance but what would you use Jenkins for ? Scheduling ?


Jenkins in such a setting gives you two good things: i) scheduling. and ii) access control. The ability to give random dude X the ability to trigger computation Y, Z, and A, without the ability to change said computations.


Triggering jobs more generally (schedule or push notification) and splitting things into jobs and/or pipelines.


This seems like impressive stats about a relational database technology. But the scrolling on their website doesn't work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)


It worked for me but the nav of the browser didn't hide, which I recognize as messing around with absolute/fixed positioning and/or overflows. I'd recommend to use media queries to show a simple site on mobile and leave all the fancy stuff they are surely doing in the desktop for the desktop only.

Edit: on a second check, it might have to do with that nav that moves the whole page down.


Appreciate these comments, the site did not go through much testing before being deployed. Overflowing was modified to eliminate horizontal scroll on mobile but it looks like there were some vertical issues as well. We will get this fixed


What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?


Yes, it is a hash join.


I will need to dig into the implementation of the hash function, it must be a nice read as the speed shows that it is definitely well optimized! Thank you.


Python 2.7 can do it in 0.0867 usec (Intel i7);

    $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2'
    10000000 loops, best of 3: 0.0867 usec per loop
(Admittedly, I killed `n=109; sum(range(1,n+1))`.)


Great article, actually. Typical HN comments on performance optimizations are complaints like "this isn't a real world use case" or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... ).


The problem is what "baseline" means. For example, a multithreaded program will always run slower than a single threaded one on one thread by definition. It has to do work in order to coordinate the threads. Obviously, this doesn't mean we avoid multithreaded code.

In this case, the software being tested was explicitly written to manage the coordination of data on many nodes, so why is the definition of "baseline" a single laptop? Seems specious.


Yes, but that is exactly why I think these types of articles and discussions are useful - people who understand what is going on often times assume that so do others, but for many people it all looks like magic.

How many people out there (genuine question here) assume the opposite of what you know / that they are ignorant of it? How many people do you think that when they hear "multithreaded" that they associate that with being faster?

Now assume the people who know there is overhead work to split up and divide the work across threads... because they have this knowledge, also "see everything as a nail because they have a hammer"? That sometimes the right solution is to simply run a single threaded operation, not parallelize everything?

I think there are interesting merits to all of that, even if it means "hyperbolic" articles or cliche not-realistic world tests. They challenge our thinking, our assumptions, our approach. And then separately, there should be articles/discussions on real-world tests and use cases.


You guys should realize that this is a commercial company promoting its own product. They're just doing the test that makes them look good.


I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?


Our impression was that when Databricks released the billion-rows-in-one-second-on-a-laptop benchmark, readers were pretty awed by that result. We wanted to show that when you combine an in-memory database with Spark so it shares the same JVM/block manager, you can squeeze even more performance out of Spark workloads (over and above Spark 's internal columnar storage). Any analytics that require multiple trips to a database will be impacted by this design. E.g. workloads on a Spark + Cassandra analytics cluster will be significantly slower, barring some fundamental changes to Cassandra.


Good questions. One answer - Speed in analytics when working with large data sets matters. A lot. Think about this - several vendors seem to be claiming support for interactive analytics. i.e. i can ask an adhoc question on large volume of data and get some sort of insight in "interactive" times. Really? maybe with a thousand cores? In a competitive industry, say like in investment banking, if one can discern a pattern before the competition it simply provides an edge. Ask the question to folks involved in detecting fraud, or online portals trying to place an appropriate ad, or manufacturing plant trying to anticipate/predict faults. It isn't so much about trying to prove we are better than Spark (well other than grabbing some attention :-) ) but rather the potential to live in a world where batch processing is a thing of the past - like working with mainframes. The hope is that we can gain insight instantly. fwiw, we love Spark and expect some of these optimizations simply become part of Apache Spark.


it's literally impossible to test every possible workload for a tool like Spark so... what's the point of asking? Stand up a testing cluster, run some of your jobs and you'll get the answer.


I am not asking about every workload. I was just curious about an example workload where this benchmark matters.


why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?


In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)


The font in the embedded gists or the font on the page?


Likely the font on the page.

A web design QA note for all: thin fonts (e.g 300-400 weight) as a body font but work fine on macOS due to better font rendering, but do not work well on Windows.


Is it better on Mac? Whenever I boot into Win10 Im struck by how crisp text looks compared to mac.


Prob the retina display the high PPI makes fonts much easier to read.


Will look into this




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: