How to Analyze Billions of Records per Second on a Single Desktop PC

nkurz · on July 9, 2018

I have encountered multiple claims that in-memory analytics databases are often constrained by memory bandwidth, and I myself held that misconception for longer than I want to admit.

So if not memory-bandwidth, what is the constraining factor? Or specifically, what's the limiting factor that causes the observed benchmark speeds for LocustDB?

My guess would be that a well designed in-memory database should be limited by memory-bandwidth, but that real-world memory-bandwidth shouldn't be thought of as a single number. Instead, it depends on the size and pattern of requests, the latency of different cache hierarchies, and (crucially but often forgotten) on the number of requests in flight.

Is there a better terminology that distinguishes these two uses of memory-bandwidth?

cwinter · on July 10, 2018

The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory.

Of course you are right that this is not the whole picture, it is possible to be bottlenecked on neither memory bandwidth nor CPU throughput.

With respect to memory, two additional considerations are read amplification and latency.

Read amplification: Read amplification basically means that you are reading data but not actually using it. Suppose we have a table stored in row format with four columns C1, C2, C3, C4 which store 8 byte integers. Row format means that the values for the columns are interleaved, so the memory layout would look like this, where cn refers to some value for column Cn:

  0x0000 c1 c2 c3 c4 c1 c2 c3 c4
  0x0040 c1 c2 c3 c4 c1 c2 c3 c4
  ...

One property of conventional memory is that we can only read memory in blocks of 64bytes (the size of a cache line). So if we have a loop that sums of all of the values in C1, we will actually also load all of the values for columns C2, C3, C4. This means we have a read amplification of factor 4, and our usable bandwidth will be a quarter of the theoretical maximum.

Latency: Suppose we also don't have any read amplification and make use of all values for each loaded cache line. Then after we finish processing a cache line, and load the next cache line, we might have to wait a long time (100s of cycles) before the new cache line actually arrives! So then we would actually be stalled because of memory latency during some parts of program execution.

One of the big advantages of using columnar layout (which stores all values for a column sequentially) is that it all but eliminates read amplification. Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.

Of course there's a number of reasons why things might not work out quite so perfectly in practice, and those may why performance in this benchmark starts to degrade before the (usable) memory bandwidth reaches the theoretical maximum memory bandwidth. But as you can see it still possible to get very close, so it's safe to assume that queries which read much less data than the maximum bandwidth are constrained on CPU.

nkurz · on July 10, 2018

Excellent explanation of the advantages of a columnar layout! I'm still a little doubtful about the exact diagnosis, though.

The constraining factor in this case is CPU throughput.

When you looked at it with VTune, did you find clear evidence of this? That is, were you consistently executing 4 instructions per cycle, or were you maxed out at 100 percent utilization for a particular execution port? While it's possible to be at these limits (say if you were hashing each input) in most cases I think it's rare to get to that point when reading from RAM.

Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.

One important caveat to this (at least on Intel) is that the hardware prefetcher does not cross 4KB page boundaries. So if you are doing an 8B read every cycle, and if you are able to do your other processing in the other 3 µops available that cycle, it's taking you ~500 cycles to process each page. And then you hit a ~200 cycle penalty when you finish each page, which means you are taking a ~25% performance hit because of memory latency, and probably have a CPI closer to 3 rather than 4! So if you (or your compiler) haven't already done so, you might see a useful performance boost by adding in explicit prefetch statements (one per cacheline) for the next page ahead.

Anyway, thanks for your response. I like your approach, and I wish you luck in making it even faster!

cwinter · on July 10, 2018

Yes, excellent points. No doubt there is still much room for optimization.

taeric · on July 10, 2018

Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.

Your other points make sense. And those would surface as constraints in a similar fashion as a CPU throughput limitation would. I just expect those well before a CPU constraint.

I confess I have not researched CPU speeds in quite a while. Nor RAM speeds. Just surprised to see that the CPU's throughput would ever really be the bottleneck.

beagle3 · on July 10, 2018

> Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.

Sort of. Random main memory access speeds aren't there, but L1 is; as long as your memory accesses are sequential, proper coding can stream sequential main memory into L1 and keep the CPU working.

But it's effectively a lost art - the vast majority of programs spend a significant part of their time stalling on memory access.

taeric · on July 10, 2018

That doesn't follow. Yes, the cache is fast enough, but assuming you really are using every cycle, eventually the CPU will catch up, no?

My metaphor is a tub. If the drain is faster than the faucet, filling the tub before opening the drain will still see an empty tub.

Only way you can prevent that is if the faucet is the same speed as the drain. Right?

Now, again, the cache can help you see more time where you are getting max utilization. Hard to believe it would be a full workload, though.

mmt · on July 10, 2018

To stretch the tub metaphor to include latency, you have to imagine someone turning the faucet off and on (or plugging and unplugging the drain) depending on how full the tub is.. and, with memory, there may be one spigot (the bandwidth), but a tremendous number of valve knobs (addresses), such that the someone has to decide the correct knob to turn before the tub empties out completely. If the wrong valve is opened, no water comes out (the wrong data is loaded into cache, which is the same as no data being loaded), so merely guessing is no good.

beagle3 · on July 10, 2018

Main memory has enough bandwidth but not enough latency. If your code is written to account for that, then - yes, your CPU will be your bottleneck. If it isn’t then your CPU will be waiting for memory.

DDR4 can do >20GB/s but the latency is >10ns. If you prefetch enough cycles in advance, data will be in L1 by the time you need it. If you don’t, it’s won’t.

taeric · on July 10, 2018

And that is still surprising to me. Quickly checking the internet, I see [1] which lists the throughput of an i7 is about 25 GB/s. That said, I see the newer specs report a GT/s, which is new to me. Looks like those are much lower. Though, I suspect I'm just not looking at the right specs.

Regardless, I've learned that RAM has gotten much faster than I would expect. Cool to see.

[1] https://ark.intel.com/products/83503/Intel-Core-i7-4980HQ-Pr...

beagle3 · on July 10, 2018

The 25GB/s corresponds to the fastest speed a DDR4 can send memory.

That said, any nontrivial computation is unlikely to keep up with that speed (e.g. 64-bit data will be ~3GD/s, so whatever computation you do has to fit in amortized 1 cycle on a 3Ghz clock, or the CPU is the bottleneck)

GD=GigaDatum

taeric · on July 11, 2018

I'm curious how much of a contributing factor this is to why many deep learning pipelines use smaller floats. Specifically, I "knew" that the throughput some math operations was actually greater than 1. That said, I have not kept up with which operations. I would not have expected the floating point units to be the fast ones.

Again, thanks for sticking with me in this thread. I learned a bit. Hoping to come up with ways to play with this knowledge. :)

srean · on July 11, 2018

Curious whether memory controllers can stripe bytes across slots much like RAID can do for disks.

mmt · on July 11, 2018

I don't think it's bytewise (just like nobody does RAID3), but, yes, they can and do. The keyword you're looking for is "interleaving".

I think it's also not even by DIMM slot but by channel, so if the channels are unevenly populated (e.g. 2 slots filled on one channel but only 1 on another), interleaving is disabled.

bitcoinmoney · on July 10, 2018

[The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory. ]

If you take any basic computer architecture class you'd know that this is not the case. Unless there are breakthroughs within the past 5 years i have not read about. DRAM is at least 10x slower than CPU. Unless your prefetcher is so accurate it can feed the i-cache/d-cache without any stalls on the CPU side.

blattimwind · on July 9, 2018

> there a better terminology that distinguishes these two uses of memory-bandwidth?

theoretical peak bandwidth

effective/observed bandwidth

nkurz · on July 9, 2018

That's not the distinction I'm looking for. I'm looking for a better term to describe the throughput the application would achieve if all non-memory constraints were removed. In the context of the article, the author is comparing the ratio of the "effective/observed bandwidth" and the "theoretical peak bandwidth", noting that the ratio is large, and concluding that he is not constrained by memory bandwidth. I don't know that this is a reasonable conclusion.

I'm looking for a different denominator, which is also theoretical. Maybe call it the "theoretically achievable bandwidth", which takes into account all the details of the requests. If the "observed bandwidth" equals the "theoretically achievable bandwidth", your only path to improvement is to change your data layout or increase the parallelism of your requests. If the "observed bandwidth" is less than this theoretical, there should still be room for implementation optimizations.

Falling short of the "theoretical peak bandwidth" (even by a lot) doesn't tell you which of these is the case.

squeed · on July 10, 2018

Hmm. I'm not exactly sure if it's what you mean, but in the networking world, we use the term "Goodput" as a shorthand for "actual user-observed bandwidth".

minimaxir · on July 9, 2018

Note: there are multiple pages to the post, which have the benchmarks.

Aissen · on July 10, 2018

Thanks, I missed this at first. I've always found pagination to be disrespectful of the reader, I wonder what was the motivation here.

paulsutter · on July 9, 2018

Correct title is "How to Analyze Billions of Records in 20 Minutes and One Second"

All of these "fast" databases have a fatal flaw - it takes forever to load the data in the first place. In this case loading takes >1000x longer than the query so loading is all that matters.

t0mbstone · on July 9, 2018

Loading data from cold storage (HDD/SSD, etc) into RAM only has to be done once on startup, though. After that single slow startup, all of the additional queries happen super fast.

Your complaint only makes sense if you only intend to perform one single query against a dataset.

paulsutter · on July 9, 2018

My complaint makes sense because I might have 100B records per day to process, and the delay to load them up is significant

wmf · on July 9, 2018

Could you stream the data into memory as it is created so that loading data isn't in the critical path?

AboutTheWhisles · on July 10, 2018

Does this software do that?

gricardo99 · on July 10, 2018

Yes, that’s common use case for kdb+

edf13 · on July 10, 2018

Common for any analytical setup... you'd integrate with many of the other stream processing engines available (Storm, Flink, Kafka, Spark... all depending on what you want to do).

skgoa · on July 10, 2018

That assumes that we have enough RAM to hold all of the data we could ever want to analyse and that we have this data at startup. Neither of which has ever been the case as far as I have personally experienced.

xgbi · on July 10, 2018

My friend works for $financial_subcontractor and he told me they have an XmX of 1,46TB on their servers so that they can load the data needed for market analysis.

The response time is quite fantastic (less than a minute), but there are drawbacks: - GC takes in the order of minutes to complete - loading the data takes 90 min from spinning rust

So yeah, it's possible, but your bottleneck becomes the storage driver.

neuro · on July 10, 2018

I've seen that setup in a number of hedge funds

majewsky · on July 10, 2018

It's not uncommon for in-memory database servers to have installed RAM in the order of multiple terabytes.

Source: I work at SAP's internal cloud team. For our internal customers running SAP HANA, we offer baremetal machines with up to 6 TiB RAM.

bduerst · on July 9, 2018

Isn't this how all indexing works?

Heavy fixed processing costs up front with the benefit of light variable processing moving forward.

edf13 · on July 10, 2018

I'd down vote this if I could...

It's far from the correct title. A good part of the optimisation of all modern analysis DBs is in-memory... this is no different and as such he's compared his work against 2 of the major other systems in the space.

I see a lot of critics here... I think the guy should be congratulated for his work - which if you've read through all pages, appears to be done single handed!

zeroxfe · on July 9, 2018

That isn't a flaw, it's a constraint, which applies to practically all databases. Good databases are the ones that can optimize around these constraints for arbitrary data analysis and manipulation (e.g., by caching, read-ahead, auto-indexing, deduplication, aggregation, batching, contention-avoidance etc. etc.)

There's a lot more to data analysis than loading data off disk.

placebo · on July 9, 2018

Have you experienced this issue with the other databases mentioned in the article?

chiph · on July 9, 2018

I think he was implying that if you access the disk at all, your performance is hidden behind that cost.

ducreux · on July 9, 2018

The author had no idea what they were doing with kdb.

They even admit that they couldn't be bothered to modify their ingestion scripts to not partition their data.

frankmcsherry · on July 9, 2018

This seems like a very unfair reading of what the author actually wrote:

> One note about the results for kdb+: The ingestion scripts I used for kdb+ partition/index the data on the year and passenger_count columns. This may give it a somewhat unfair advantage over ClickHouse and LocustDB on all queries that group or filter on these columns (queries 2, 3, 4, 5 and 7). I was going to figure out how to remove that partitioning and report those results as well, but didn’t manage before my self-imposed deadline.

gricardo99 · on July 9, 2018

Yup.

>The pickup_ntaname column is stored as varchars by ClickHouse and kdb+, and as dictionary encoded single byte values by LocustDB.

But it would be trivial to convert to enum/sym type in kdb+. It's silly to query and group by strings.

inteleng · on July 9, 2018

This is one of the frustrating parts of database software. Where can information about the ways to optimize these parameters be found (outside of random posts scattered around StackExchange)?

gricardo99 · on July 10, 2018

so true. Either you have lots of experience working with specific databases to setup/optimize queries, and you know what works best through personal blood, sweat and tears, and/or you have intimate knowledge of the inner workings of the database architecture/implementation and know the theoretical best approach to structure your schema/queries.

But even then, hardware/networking performance and tuning can throw a wrench in the most seasoned/knowledgeable approaches. Users can further bring otherwise solid setups to a grinding halt with unanticipated use-cases.

The only hope when you hit these inevitable road-blocks is that you're working for someone that appreciates the difficulty of the problem.

geocar · on July 10, 2018

The first red flag is that Mark's benchmarks look very different for kdb, even though his ClickHouse times are similar to Clemens.

Looking over the queries, he made some... interesting changes that have him benchmarking oranges to apples.

anonu · on July 10, 2018

Completely moot point though as he demonstrates that even when kdb+ is advantaged by having data be indexed, LocustDB is still faster in 4 of the 5 queries he runs...

So yeah, maybe the guy has no idea what to do with kdb - but ultimately having a fast, free and open-source database & query language beats a fast and expensive piece of commercial software.

emmelaich · on July 9, 2018

It'd be interesting to compare this against traildb.

http://traildb.io/

ryanworl · on July 9, 2018

This is a very interesting project! Thank you for posting it.

Having read both this post and the documentation for TrailDB, I don’t think they are comparable past very simple use cases. This post seems to be covering a complete database system with a query parser, planner, and (previously?) a storage engine. TrailDB is more like a storage engine you can push certain kinds of filters down into, but performs neither query parsing nor planning for you.

emmelaich · on July 10, 2018

True. But the latest blog post for TrailDB mentions two new query interfaces, trck and reel.

I also wouldn't be surprised if someone has done a sqlite backend for it.

mkarlsch · on July 10, 2018

Interesting project, however it looks like the last commit is from September 2017 so either it is just very stable or not maintained anymore?

bcaa7f3a8bbc · on July 9, 2018

Past:

How to Analyze Billions of Records per Second on a Supercomputer

Present:

How to Analyze Billions of Records per Second on a Single Desktop PC

Future:

How to Analyze 5 Records per Second on Thousands of Machines Distributed Across the Globe Over the Blockchain.

taneq · on July 9, 2018

I can't find it now but there was a great blog post about parallel processing of large data sets just using Linux streams from the command line. They ended up comparing a single laptop (iirc) against a Hadoop cluster and coming out ahead.

wcrichton · on July 9, 2018

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

potatoyogurt · on July 10, 2018

in other words, "I used a chainsaw to cut an apple and it SUCKED at it."

If you're processing an amount of data that comfortably fits in memory on a single machine, then obviously Hadoop is going to perform poorly in comparison. The costs of scheduling a job onto N mappers/reducers, transferring code to each node, waiting for the slowest mapper/reducer to finish, transferring data from mappers -> reducers, replicating output on HDFS, etc. are well-understood. It's true that many people try to use Hadoop when they'd be better served with simpler solutions, but that does not justify the amount of shade that the author throws at it.

mmt · on July 10, 2018

> are well-understood. It's true that many people try to use Hadoop when they'd be better served with simpler solutions

I posit that these two assertions are contradictory.

My own understanding of the term "well understood" is that it is synonymous with "widely understood". If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

That said, although I have a grasp of when the tradeoff is so loopsided as to be obvious, I don't know where to go (or where to point other people to go) for a better understanding of where the boundary is.

Where should we go to better learn that understanding of those costs?

potatoyogurt · on July 10, 2018

They're well understood by anyone who has used these technologies professionally. I probably should have been ore precise with my language, though. It's really a grey area as to where the boundary is and it depends a lot on your specific application. But as a general rule of thumb, my feeling is that for tens to hundreds of GB, you should consider it. And for TBs or more, you almost certainly want to be doing something distributed. Hadoop isn't necessarily the best option then, but it's a powerful tool. I don't know if there's any resource out there that really goes deep into the tradeoffs involved though. There probably is, given how popular the subject is, but I'm not aware of one.

The problem with the article is that if it's for a general audience that doesn't understand the tradeoffs of a system like Hadoop, it really paints a picture that it is just a bad, slow tool. It barely acknowledge just how rigged the comparison is at all, aside from mentioning that you might need something like Hadoop for really big data in the conclusion, while it is peppered with unnecessarily snide comments about Hadoop that will probably be more memorable. I think it is liable to leave readers more confused about the tradeoffs involved after reading than before.

TeMPOraL · on July 10, 2018

> They're well understood by anyone who has used these technologies professionally.

We need to substitute the word "professionally" with more precise terms when talking about our industry. Because if one was to read "professionally" as "at work", then your statement is absolutely false - both the lower bound and average amounts of critical thinking and caring in this industry are extremely low. Even ignoring people who obviously have no clue, I can still imagine Hadoop and other big data stacks being sanctioned by "professionals" in management for buzzword-generating reasons, and implemented by "professional" "engineers" for CV padding reason.

potatoyogurt · on July 10, 2018

You're probably right. I'm thinking of specific coworkers when I think of the level of knowledge that should be expected, and generally the standard I would hold coworkers to would include understanding this. But it is probably not as widely understood as I would hope.

mmt · on July 10, 2018

> They're well understood by anyone who has used these technologies professionally. I probably should have been ore precise with my language, though

So.. It's well understood by True Scotsmen? :) I certainly understood that there are experts in the field who have a deep, even intuitive understanding of how and when to use which tools. To the extent that those experts don't communicate that knowledge to a broader audience while at the same time may be advocating for the use of the tools, they bear some responsibility for the misuse.

My point wasn't so much that you used imprecise, but rather, that the statement about how well it's understood was inaccurate or irrelevant (depending on which definition you were going for).

> But as a general rule of thumb, my feeling is that for tens to hundreds of GB, you should consider it. And for TBs or more, you almost certainly want to be doing something distributed

While a 3-4TB cutoff makes some sense if ones workload has to remain in-memory for performance reasons, that can't be anywhere near the cutoff for any kind of workload that could stand to read from SSDs.

> I don't know if there's any resource out there that really goes deep into the tradeoffs involved though. There probably is, given how popular the subject is, but I'm not aware of one.

I would hope so, but I'm not so sure. It may not even need to be very deep, something akin to the "5 minute rule" for memory/disk caching. Mostly, I'm not convinced that the subject of tradeoffs is actually popular, so much as just using the tool without considering them is.

> The problem with the article is that if it's for a general audience that doesn't understand the tradeoffs of a system like Hadoop, it really paints a picture that it is just a bad, slow tool.

If we're talking about the adamdrak.com 233x article, I have to disagree, as my read of it was that it focused on evangelizing the "under-used approach for data processing" of "standard shell tools and commands".

> it is peppered with unnecessarily snide comments about Hadoop that will probably be more memorable

That's certainly not a charitable interpretation, and I would hazard that it's not even fair or factual (as to "peppered", at least). It's mentioned only a handful of times:

> Command-line Tools can be 235x Faster than your Hadoop Cluster

I agree that click-bait can be considered snide.

> I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR

To me, this comment, this very first mention of Hadoop in the intro, made it clear that this was "rigged" in that the "competition" was neither competing nor truly concerned about performance.

> while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).

Merely a factual summary. Nothing snide that I could detect.

> Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

This seems like just a restatement of the admission in the introduction plus the assertion (that I believe even you agree with) that many people mis-use Hadoop when it's not called for.

> The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory.

Again, just another factual summary, with no snideness I could detect.

> While we can certainly do better, assuming linear scaling this would have taken the Hadoop cluster approximately 52 minutes to process.

This next mention is after at least half of the bulk of the article. It may be an assuming-spherical-cows estimation, but it doesn't strike me as grossly misleading on its face, and there's no editorializing.

> This gets us up to approximately 77 times faster than the Hadoop implementation.

> about 174 times faster than the Hadoop implementation.

> gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.

The next three mentions, near the end, are just comparisons of the evolving demonstration implementation to the reference implementation. No detectable snideness.

> Hopefully this has illustrated some points about using and abusing tools like Hadoop

> but more often than not these days I see Hadoop used

Here in the conclusion paragraph is where I agree that there is both snideness and where a reader may be confused about tradeoffs, if that's what they were expecting to be enlightened about.

However, because that's not what the introduction promised, my criticism would be merely that the conclusion doesn't match the introduction (and maybe goes too far into inflammatory territory with "abuses").

Pretend that section isn't even in the article, and the article still reads OK.

potatoyogurt · on July 10, 2018

I agree with your first two points. It's definitely possible to efficiently process quite a bit of data on a single machine, although I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine (scalability if requirements change, resilience to machine failures, etc.).

I still disagree about the article itself (even outside of the conclusion), but perhaps I am reading it uncharitably and other people are not getting the same impression. I do feel that it would be easy for someone who is not very familiar with these technologies to get the wrong impression. Misusage does probably go mainly in the other direction (of people overusing Hadoop rather than underusing), though, so maybe that is not so important a concern.

mmt · on July 10, 2018

> I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine

I've elaborated on these in another comment, as well.

Today, drawing the line at a single terabyte is way too early, even for all-in-memory workloads, if only because there exists an almost 4TB AWS instance now. Any smaller than 3.5TB (or whatever RAM is available to applications) is, at best, living in the past.

> scalability if requirements change

This reads as premature optimization, which turns the strong argument into either a weak argument or even an argument against.

Now, if you know or have reasonable certainty that your requirements will change (and will do so faster than, say Moore's Law) and change soon, then that's different. I suspect there are people who think this, but that it's little more than wishful thinking or a delusion as to how large their slice of "web scale" actually is.

> resilience to machine failures

Machine failures just aren't a legitimate consideration for modern, high-end (but still commodity) hardware. You wouldn't bet your whole business on it, of course, but a 1% chance every year of losing an hour or two of batch processing? Sure.

Sadly, the flip side of this is that I see Hadoop clusters being built with such reliable servers, including redudant PSUs and fans, instead of taking full advantage of the resilience at the software level in order to save as much as possible at the hardware level. The original company behind map-reduce is certainly not splurging on hardware.

potatoyogurt · on July 10, 2018

I'm not saying that past a terabyte is a point where you definitely want to use distributed processing, just that at that point, you should really strongly consider it. There is usually a lot of fuzziness around estimates you get about what sort of data volume you'll need to deal with, and it's not uncommon for it to vary by integral factors between days. If you're pushing the limits of what your system can handle without needing a dramatic rearchitecting, then that's a big risk, and it's not necessarily premature to build in the flexibility to have the option of scaling in the future if you need to. If you hit that 4TB and you still need more, it will be a big headache.

I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

mmt · on July 11, 2018

Fair enough, for in-memory only, if 1TB is your raw data, by the time it's indexed, it's going to be bigger.

Surely, though, workloads that require in-memory performance are fairly niche, and jumping from there to distributed (even in-memory) seems non-obvious, at best. Why aren't large arrays of fast SSDs a better alternative? The bandwidth is comparable, but the latency is terrible (still comparable to ethernet to a remote node, though?)

What about workloads that don't require fully-in-memory in the first place? If the cutoff is, then hundreds of TB, wouldn't that cover the vast majority of common use cases?

> I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

That sort of anecdata isn't very useful, because a human can cause any failure at any layer, including someone stop a whole cluster, which I've seen happen before.

My point about it not being a legitimate concern is that what is now common practice with what is now common equipment means it's uncommon. These practices and equipment had to evolve, but that evolution happened on the order of over a decade ago.

Also, be wary of selection bias. It's very easy to remember the "fire drill" because of the one machine failure, and it makes a much more interesting story to tell that gets passed around and modified enough, eventually sounding like multiple stories and therefore multiple machines. The hundreds of servers that operated unheard and unseen for years, sometimes beyond their specs (e.g. with only only blower out of four still turning and only half-speed at that), get nary a thought, let alone mention.

bartread · on July 10, 2018

> If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

That's probably true but, cynically, I tend to think a lot of the time when people use Hadoop they're not doing it because Hadoop is the best solution: they're doing it because the solution allows them to use Hadoop.

mmt · on July 10, 2018

I'd venture that's a little too cynical, as it violates the "never attribute to malice what can adequately be explained by ineptitude" rule :)

I'm not suggesting that something like resume-padding is never a motivation, just that it seems unlikely to be the sole or even primary motivation.

Now, I may be wrong, but my reasoning behind this is that, for all its purported power as a tool, as potatoyogurt has alluded [1], it's really only the true experts who can wield that power effectively. The ones who merely succeeded at getting on their resume would only be able to use it where its power isn't truly needed (as was the case before).

If the technology falls out of fashion, then those resume-holders will be in a position of needing to deliver the cargo, rather than carved wooden headphones.

As such, I believe their motivation is actually to attempt to learn the true power of the tool (i.e. true resume building, rather than mere padding) and that they're grossly underestimating the costs through ignorance and a desire to learn by doing.

The trouble is, without well-published non-distributed reference implementations and, instead, ultra-popularity of the distributed tools instead, they never end up learning those costs, and we're in a state of perpetuated ignorance and perpetuated over-use.

[1] by saying that the cost trade-offs are well understood by the experts, which strongly implies that the experts have a pretty deep understanding of the actual mechanics of distributed computing in general.

avifreedman · on July 10, 2018

Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills. It's tragic to think of all of the computrons and watts wasted with Hadoop-ish stuff (map-reducing without filters, Java itself for most implementations) - but still I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Both because of familiarity with querying and the solidity of running a multi-tenant system.

But I do recommend that they switch to MapR [c++ core and a passable central FS for unix-based super fast queries] if they're concerned with efficiency.

[For context, in my day job we do multiple clusters of millions of network traffic summaries/sec and are often replacing Hadoop, or more recently, ELK, as people tried to use them for that use case. All well >>> will fit in ram. We have our own in-house column-store + streaming combo db done in go/c/c++ that started as clustering fastbit.]

mmt · on July 11, 2018

> Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills.

I doubt the point of the article was to suggest that linux cli stuff would scale to hundreds of users on the same host, but, if each of those users has a host of their very own, such as a laptop, the model could scale very well indeed, for small enough datasets.

> I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Well, as you point out later, regarding familiarity, once it's in, it's probably too late. What about for a new implementation?

In answering this question, don't get too hung up on a literal interpretation of "single" server being exactly one. For example, a traditional RDBMS with one or more replicas (for performance, redundancy, or both) would still fall under the single server model. Really, it's about the non-distributed-computing option.

> if they're concerned with efficiency.

The fact that this is an "if" (and I do know that it is, even for startups) is bewildering to me, even more so in the context of distributed architectures where scaling is less linear the more data that has to be shared.

avifreedman · on July 11, 2018

Absolutely a believer in the power of single machine analytics on GB or TB vs PB datasets. But I've found that whether it's a new data system or just showing csv/tsv tool magic, if people aren't into data systems (for the former) or unix cli (for the latter), it's a whole culture/training thing vs. most technically efficient wins.

So I think the latter issues dominate (culture/training) for new implementations as well as existing.

Re: efficiency overall, and of distributed systems, it's interesting to see both that MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

Unfortunately, it's hard to see a startup focusing on single-machine solutions for GB->TB sets (even scaled out with replicas) as the way many in the space get started is with an open core model and the thing they charge for is the clustering and/or monitoring needed to become a distributed system.

But... I am optimistic we'll see a generational effect over the next 10 years in openness/interest in composable ad-hoc analytics tools, especially with Windows incorporating unix cli components.

mmt · on July 12, 2018

> MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

I'm only very slightly familiar with their features/value-add and not at all with their pricing. Could the pricing model be particularly unpalatable for some reason?

Not that I expect there has to be a deeper reason beyond simply not caring about cost/efficiency. I've certainly both seen and heard described plenty of Hadoop installations that seemed to have missed the "cheap" point in Google's M-R paper and subsequent Hadoop hardware selection advice from, for example, Hortonworks, or misunderstood what it meant. There may also be some misunderstanding of "commodity" or "industry standard" to mean server hardware of a certain "class" (such as brand name or with redundancy features), even if it conflicts with cheapness.

Some of it may be that the hardware selection advice articles (e.g. Hortonworks, Cloudera) are very old, with excellent general advice, but potentially misleading specific numbers. Even extrapolating from those numbers in a naive way can easily lead to needless expense and/or sub-optimal performance (that time some Xeons had 3, not 2, not 4, memory channels).

The latest article I found in an (admittedly quick) search was https://hadoopoopadoop.com/2015/09/22/hadoop-hardware/ from late 2015, which is still remarkably long ago and is rather verbose.

flukus · on July 10, 2018

> My own understanding of the term "well understood" is that it is synonymous with "widely understood". If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

I's entirely possible that 9 out of 10 people on a project using Hadoop know it's a waste but there are non-technical reasons for doing so. Resume padding and PHB demanding some technology of the week would be the two most common ones.

That said it probably is contradictory much of the time. I'd say the majority of current developers don't know about the simpler tools.

qaq · on July 10, 2018

It's actually an interesting question you can get 192 cores and 12 TB RAM in a single x86 box. At what point does it actually make sense to go for Hadoop.

obelix_ · on July 10, 2018

It makes sense if you are handling millions of requests from all over the world per second and need failovers if machines go down.

But...if you just want to run your own personal search engine say...

Then for Wikipedia/Stackoverflow/Quora size datasets (50GB with with 10GB worth of every kind of index(regex/geo/full text etc) ) you can run real time indexing on live updates with all the advanced search options you see under their "advanced search" page one any random Dell or HP Desktop with about 6-8GB of RAM.

Lots of people do this on Wall Street. People don't get what is possible on desktop cause so much of it has moved to the cloud. It will come back to desktop imho.

nl · on July 10, 2018

It won't come back to desktops because as they get cheaper the costs of people to maintain them increases.

There will always be people who need them and use them but that proportion is going to keep decreasing (I'm somewhat sad about this, but the math is hard to argue with).

obelix_ · on July 10, 2018

Just temporary. Nature did not need to invent cloud computing to perform massive computation. The speed and data involved in computation going on in a cell or an ants brain show us how far we can still go on desktop. The breakthroughs will come.

In the meantime (unless you are dealing with video) most text and image datasets out there that avg Joe needs can easily be stored/processed entirely locally thanks to cheap terabyte drives/multicore chips these days. People just haven't realized there isn't that much useable textual data OR that local computing doesn't require all the overhead of handling millions of requests a second. This is Google problem not an avg Joe problem that is being solved with cloud compute.

nl · on July 10, 2018

You seem to have missed everything the comment you are replying to says.

There is no dispute with the size or amount of compute available on desktops.

obelix_ · on July 10, 2018

No idea what cost that comment refers too. If you are Facebook or YouTube or Twitter or Amazon sure you need your five thousand man PAID army of content reviewers to keep the data clean cause the data is mostly useless. But if you are running Wikipedia search or Stackoverflow's search or the team at the library of congress please go and take a look at the size of these teams and their budgets and their growth rates.

nl · on July 10, 2018

Are you conflating "single server" and "desktop computer" here? It sounds a lot like you might be?

Because almost everyone already uses the model of co-locating the search index and query code on a single computer (both Wikipedia and Stackoverflow use Elastic Search which does this).

They use multiple physical servers because of the number of simultaneous requests they serve.

This has never been the use-case for Hadoop.

I've built Hadoop based infrastructure for redundantly storing multiple PB of unstructured data with Spark on top for analysis. This is completely different to search.

That's very different to the Wall St analyst running desktop analysis in Matlab, or the oil/gas exploration team doing the same thing.

potatoyogurt · on July 10, 2018

That's a good point. It's really hard to give a clear cutoff because there is quite a bit that you can potentially do on a single machine now. But there are disadvantages to doing your processing on a single really big machine even when you can find a way to handle it. If you have inconsistent workloads, a single machine may be harder to schedule to be close to fully allocated, and if it goes down, then your workflow is dead, possibly until you take some sort of manual action, whereas a cluster of machines is already built to be resilient to individual node failures. It all depends on the use case, which makes it really tough to give hard and fast rules.

mmt · on July 10, 2018

> a single machine may be harder to schedule to be close to fully allocated

That seems remarkably unlikely if there truly is only the one. Maybe I don't understand what you mean, though. How is that any different from having a single distributed cluster, instead?

> if it goes down

I hear (read) this quite a bit, especially recently, and I'm a bit mystified that it's even brought up in this day and age.

Firstly, having worked with (what is now) commodity hardware for over a decade, I believe that people who haven't grossly over-estimate how often "it goes down", especially today. This overestimation is trotted out as a reason that operating ones own hardware is such a "nightmare" and therefore one must always use cloud or a VPS.

With a "large" enough server, the risk goes up, of course. More DIMMs means more memory can fail, but we're still talking about low single digit percent for all errors (including correctable). IIRC, CPUs have even lower failure rates. Everything else tends to be redundant.

Even then, your workflow isn't dead. It's just missing a DIMM or a CPU, possibly after a reboot (which won't be, if you configured it right).

In many cases, downtime isn't actually caused by hardware but by software (or humans). That's not going to be unique to centralized processing versus distributed.

Also, if the single machine is on a cloud provider, many hardware and utilization issues can be abstracted anyway, for a huge price premium.

potatoyogurt · on July 10, 2018

Just replying to the top bit, since we're exchanging multiple comment chains. Hadoop, Spark, etc. are built on top of a job scheduler, so if you have multiple competing jobs, it will allocate tasks to resources in a fairly reasonable way. If there is too much contention at some point in time, your jobs will all still run, and portions that can't be scheduled immediately will be delayed while they wait for free nodes. The processing model also generally leads to independence between jobs. You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution. And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.

edit: if you're running on a cloud, you also may be able to autoscale to deal with spiky usage patterns.

mmt · on July 11, 2018

> You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution.

I'd venture to say that "for free" is a bit of a stretch, other than under the "already paid for" definition of "free". Being built on top of a job scheduler means one always has to exert the effort/overhead of dealing with a scheduler, even for the simplest case.

On a single machine, you can decide if you want to add the complexity of something like LSF or not. (I would call a custom single-node solution a strawman in the face of pre-Hadoop schedulers. Batch processing isn't exactly new).

> And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.

I'm pretty sure I'm still missing something. Since my goal is to gain a better understanding, I'll put forward where I believe the disconnect is (but definitely, anyone, correct me if my assumptions are off):

You're coming from a distributed model, where the assumption is that "resources" are divisible and allocatable in such a way that, for any given job, there's an effective maximum number of nodes it can use (e.g. network-i/o-bound job that would waste any additional CPUs). Any leftover nodes need other jobs allocated to them to reach full utilization.

I, however, am considering that the single-server model assumes that there won't be any obvious bottlenecks or limiting resources, instead prefering to run a single job at one time, as fast as possible. This would completely obviate the need for any complex scheduling and avoids accidental resource exhaustion (which, I suppose, is really only disk or memory).

potatoyogurt · on July 13, 2018

hah, you're right, I definitely am looking at it with distributed system goggles on. With just one machine, yes, it will probably make sense to fully process one job before moving onto another (although there are some cases where this may not be true, such as if you're limited by having to wait for external APIs. But this isn't a strong argument because calling external APIs probably isn't something you want to be doing in Hadoop anyway). If your workload scales, though, you will eventually have to spread out onto multiple machines anyway. At that point, even if you process each job on a single machine, you either need a system to coordinate job allocation across machines, or you have to accept some slack if you only schedule jobs on individual machines. In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them, but my point is just that the more steps we take towards the complexity of a distributed computing model, the more it starts to seem reasonable to just say "okay, let's just use hadoop/spark/etc. and get the extra benefits they offer as well (such as smooth scaling without having to think about hardware), even if it's more expensive than a setup with individual nodes optimized for our purpose." It's now really easy to spin up a cluster, and with small scale, costs are not that big a deal. With large scale, your system is distributed in some way anyway.

I don't think it's an obvious decision, and you're absolutely right that people are usually too quick to jump to distributed systems when they really don't need them. But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary. I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."

mmt · on July 13, 2018

Sorry.. I forgot to respond to the last part of your message:

> I don't think it's an obvious decision

Certainly, which is why I even bother with discussion like this, in the hopes of making the decision clearer (of not obvious) to me in the future.

A response to my footnoted (in my other comment) comment pointed out how oversimplified my understanding of distributed databases was. Well, I knew it was an oversimplification, but not in which way.

There's plenty of computer science research from the 70s and 80s covering these topics, but they're both tough to translate to practical considerations, and they're woefully out of date (e.g. don't account for SSDs or cheap commodity hardware).

> But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary.

Well, philisophically, I would disagree with such an assertion on the grounds of premature optimization, absent the "strictly".

I would advocate for switching from scaling "up" (aka "vertically", larger single machines) to scaling "out" (aka "horizontally" or distributed) around the point of cost parity, not at the point it is no longer possible to scale up a single machine (unless that point can reasonably be expected to occur first, I suppose).

> I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."

That would account for any overestimation of how difficult it is to work with hardware or how complex Hadoop is set up, administer, or use. Those are just initial conditions and may well unduly influence decision making that has far longer-lasting consequences.

However, I'd like to think I'm not often guilty of the latter overestimation when discussing solutions (and even advocating single-server), as I tend to assume that it can it least become easy enough for anyone out there, so long as the technology is popular enough (like Hadoop) or traditional/mature enough (like the tools in the original comment, or PBS) that plenty of documentation and/or experts exist.

My background also includes having seen, first-hand, over decades, various attempts at distributed processing and databases in practice, with varying degrees of success. This has included early "universal" filesystems like AFS, "sharding" MySQL to give it "web scale" performance [1], Glustre and its ilk, some NoSQLs, and of course Hadoop.

If anything, I'd say that, with most popular, new technologies, especially ones predicated on performance or scale, "it's a pain" is not the knee-jerk skeptical reaction my experience has ingrained in me. Rather, it's more like "sure, it's easy now, but you'll pay." TANSTAAFL.

[1] This worked well enough but did have a high up-front engineering cost and a high on-going operating cost for the large number of medium-small servers plus larger than otherwise needed app servers to do DB operations that could no longer be done inside the database because each one had incomplete data. Due to effort overestimation, it was unthinkable to move from a VPS to a colo so as to get a medium-large single DB server with enough attached storage to break the "web scale" I/O bottleneck for years to come.

mmt · on July 13, 2018

> calling external APIs probably isn't something you want to be doing in Hadoop anyway

And yet I've seen it done. At least they weren't truly external, just external to Hadoop.

> In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them

There's those goggles again :) Batch schedulers have existed for decades (e.g. PBS started in '91).

> (such as smooth scaling without having to think about hardware),

This neatly embodies what I believe is the primary fallacy in most of the decision making, including the fact that it's often parenthetical (i.e. a throway assumption).

Does anyone really value the "smoothness" of scaling? I'd expect the important feature to be, instead, that the slope of the curve doesn't decrease too fast.

The notion that Hadoop someone frees one from having to think about hardware flies in the face of hardware sizing advice from, Cloudera, Hortonworks, and others that discuss sizing node resources based on workload (mostly i/o and ram) expectations and heterogenous clusters. It does, however, explain my observation, in the wild of clusters built out of nodes that seem undersized in all respects for any workload.

>It's now really easy to spin up a cluster,

It's really easy if it's already there? That borders on the tautological. Or are you talking about an additional cluster in an organization where there already is one?

>and with small scale, costs are not that big a deal.

That's just too broad a generalization, just as "cost is a big deal" would be. Cost is always a factor, just not always the biggest one.

Small scale is often (though not always) associated with limited funds. Doubling, or even tacking on 50% to, the cost could be catastrophic to a seed startup or a scientific researcher.

>With large scale, your system is distributed in some way anyway.

This strikes me as little more than a version of the slippery slope fallacy. Even some web or app servers behind a load balancers could be considered distributed [1], but that doesn't make them a good candidate for anything that's actually considered a distribute framework.

It also hand-waves away the problem that, even if costs weren't a big deal at small scale, they don't somehow magically become less of an issue at large scale. Paying a 50% "tax" on $40k is one thing. At $400k, you could have hired someone to think about hardware for a year, instead.

[1] I just recently pointed out, slightly tongue-in-cheeck, an architectural similarity, one layer down, between app server and database https://news.ycombinator.com/item?id=17521817

threeseed · on July 10, 2018

Your comment is actually pretty funny as the entire point of Hadoop was to be able to use commodity, cheap, off the shell PC hardware as opposed to the exotic specifications you mention there.

Except that of course nowadays such hardware is just a couple of clicks away in AWS.

mmt · on July 10, 2018

Today's "exotic" (which is actually just high-end commodity) is tomorrow's middling.

I'm not sure it's fair to summarize any one thing as "the entire point" of Hadoop, but, as I recall, it was, originally, an open source implementation of Google's Map-Reduce paper. Put another way, it was a way to bring Google's compute strategy to the masses.

That said, the notion that there is "commodity, cheap, off the shelf PC hardware" and "exotic specifications" is completely a false dichotomy, especially in the face of what, for example, Google actually does.

Google goes cheap. Very cheap. It's custom and exotic, just optimized for cost, but not absolute cost per nod, rather the ratio of cost for performance.

That last part is what's missing from every single Hadoop installation I've personally seen (or that anyone I know has personally seen), the maximization of performance for cost. Instead, there's an inexplicable desire to increase node count by using cheaper nodes, no matter the performance.

> Except that of course nowadays such hardware is just a couple of clicks away in AWS.

I'm a bit unclear what the "except" means here. I don't believe AWS has the truly high-end specs available (and never has, historically, so we can reasonably assume it never will). It's also very not-cheap.

beagle3 · on July 10, 2018

The point of hadoop might have been that, but it never actually delievered any real value to most users - it's an abysmal failure from a computing efficiency point of view; here's an example http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

threeseed · on July 10, 2018

I've been using Spark for many years going back to 1.0.

It is the foundation technology of almost every data science team around the world. And your misguided post (which for some weird reason only focuses on graph algorithms) doesn't change that. And not sure why you think it's inefficient. We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

mmt · on July 10, 2018

> We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

I have exactly zero knowledge on Spark's efficiency as well as zero on how representative graph algorithms are, but I can confidently say that the above statement fails to refute the referenced article's thesis (which, arguably, criticizes assertions just like that).

Just because your implementation scales (even autoscales) to use more compute resources says nothing about its efficiency (overall or even marginal when adding more nodes, i.e. the shape of the curve).

Computer science has struggled with achieving even near linear-scalability ever since the advent of SMP.

beagle3 · on July 10, 2018

Spark is significantly more efficient than Hadoop.

I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.

I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.

It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.

smw · on July 11, 2018

No, the point was to be able to process workloads much larger than would fit in memory on a single machine.

mmt · on July 11, 2018

Citation needed! :)

Joking snark aside, I'm actually doubtful this is true. Specifically, I don't recall the impetus for Hadoop (or Google's original Map-Reduce, as described in the '04 paper) being an all-in-memory workload.

Despite it being repeatedly brought up in this sub-thread, I maintain that it's a niche use case and that disk-based data processing workloads are far more common.

ETA: Does anyone know of a canonical or early/initial document outlining the purpose, or at least design goals, of Hadoop?

pmorici · on July 10, 2018

That's kind of the point of that post it's pointing out that you should consider weather you actually need to use something like Hadoop and that most people aren't actually working with data sets large enough for it to make sense.

potatoyogurt · on July 10, 2018

> Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

"Big Data™" is unnecessary shade.

> One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. This is basically like having your own Storm cluster on your local machine.

It is entirely unlike having a Storm cluster on your machine, and trying to do your data processing as chained shell commands will rapidly become cumbersome if you try to do actual complex processing.

Yes, I get that the author is trying to point out that simpler tools can work for many cases, but the tone of the article makes it seem like that author is just generally saying that EMR/Hadoop is bad. He does not acknowledge just how weighted the test he did is against Hadoop or give any indication of what the tipping point is where you actually want to start considering something distributed. This paints a really misleading picture for anyone who does not already know a fair amount about these technologies.

itronitron · on July 10, 2018

Hadoop has a very narrow best fit use case but it has been oversold as the best solution for big data.

potatoyogurt · on July 10, 2018

Best fit is something you can argue a lot about. There are a lot of data processing tools out there now, many of which have come out after Hadoop. But if the comparison is against some process running on a single machine, then the use case is not narrow. It includes basically anything where you're processing more than 1TB of data in non-trivial ways (i.e. not just a map operation) and are okay with batch processing.

PeterisP · on July 10, 2018

The issue with that is that for most organizations their key business data (and all its recorded history) fits in RAM of a sufficiently beefy workstation. They want to call it Big Data to stroke their egos, and properly acquiring, cleaning and integrating that data can take a LOT of effort so that data can be quite expensive and worthy of any glorious label they can think of; but my experience is that processing more than 1TB of meaningful data actually is a narrow use case, which matters in two specific categories: the (relatively few) very large multinational companies, and processing of raw video/audio/image data; and the majority of people working on data analysis end up with business needs that can be satisfied by relatively simple methods on relatively small datasets.

potatoyogurt · on July 10, 2018

I agree in general. But I think you underestimate how large the set of use cases are where people are processing > 1TB of data. This includes quite a bit of the adtech industry, for instance, even many startups. It also includes data warehouses for other industries, such as in health tech. Of course, these people generally are experts, since it is their core business, so they know well what tools they need. I agree that for some analytics department in a random company whose core business isn't processing data, Hadoop is more likely to be a resume item than something that's really needed.

itronitron · on July 10, 2018

A lot of those > 1TB data sources are very standardized, they can be mapped to a schema, in which case indexing the data supports interactive queries and analytics. Hadoop seems well suited for data that needs to be processed in very different and changing ways.

mmt · on July 10, 2018

I think this is the other point that's so often missed/ignored in the "big data" discussion: there's a middle ground between everything-fits-in-memory and must-be-distributed.

The Adam Drake article alludes to it only in the last sentence by mentioning traditional relational databases as an alternative.

For workloads that are relatively I/O-heavy and CPU-light, it's very hard to beat local SSDs (or even HDDs in enough quantity) attached to a single [1] node, if the competition is distributed storage attached by ethernet. It only takes a couple 600MB/s SSDs to saturate a 10GE. A server with 48 lanes of PCIe 3 slots could take the I/O of 78 of them.

100Gb/s networks are getting close. For upwards of $1k per server (NIC and switch port) one can bring that ratio closer to 4:1 from 39:1. I'd expect this is the attractive route for anyone with CPU needs that can't be met by a 4-socket server.

[1] Yes, there can be more than one node with copies for redundancy, as has been complained about elsewhere in the sub-thread, or even scalability

potatoyogurt · on July 10, 2018

You're absolutely right. Probably my estimate of "anything greater than 1TB" was not quite the right number. At that point, though, careful indexing is probably the more expert option. It's easy to throw data into EMR now if you're not super concerned about performance/cost, and it doesn't require you to think through how your data will need to be indexed to support your future needs.

fiddlerwoaroof · on July 10, 2018

In my experience, people generally go the other way: they say “we need spark/Hadoop/Cassandra because we have Big Data” when they have a 30GB dataset that is best handled on a beefy EC2 instance with boring tools.

ljw1001 · on July 10, 2018

but the chainsaw analogy was a winner on many levels.

malisper · on July 10, 2018

Hmm... I don't think the comparison says anything about Hadoop vs Bash. The programs they are running are completely different.

In that article, the author runs a regular expression that looks lines that start with "Result" in the pgn files. Then they do a tiny bit of processing with awk.

In the original article[0] the author completely parses the pgn files. Then they extract the result from the parsed file.

I wouldn't be surprised if the difference came completely from that.

I would love to see someone run benchmark the exact same program and see how much of a difference there really is between Hadoop and Bash.

[0] https://web.archive.org/web/20140119221101/http://tomhayden3...

mmt · on July 10, 2018

It's probably easy enough to simulate the "bash" implementation done in a distributed way. Doing that python library full parsing locally, 10,000 lines (or however much memory one can configure up to) at a time might also be easy enough.

Come to think of it, just benchmarking a single-threaded run of enough lines of each against each other should give you an idea, especially with even basic profiling from time(1) to estimate how much was CPU and how much was I/O.

That said, I don't think that such a discrepancy, if it exists, takes away from what I took to be the author's overall point, which is that old, simple tools, are underused, despite being much faster and good enough, i.e. worse is better [1].

[1] This an example of sacrificing correctness for simplicity in implementation, something detailed in the "rise of worse is better" essay, not merely an excuse for sloppiness in engineering.

taneq · on July 9, 2018

That's the one, thanks!

adamdrake · on July 9, 2018

Author here. If I can answer any questions, feel free to ask away!

taneq · on July 11, 2018

No real questions, just that I love reading about this sort of optimization - Abrash's Black Book is a favourite that still gets pulled out every now and then. Thanks for the fun post!

It always astounds me these days when someone manages to release slow software for a desktop computer despite modern systems being orders of magnitude faster than the first desktop computers while often being no more responsive.

Edit: Also I'm amused by how much of a nerve this seems to have hit. I guess some people are defensive about their high performance computing approaches... :)

mmt · on July 11, 2018

> I guess some people are defensive about their high performance computing approaches...

Fortunately, when articles about real HPC clusters (aka Supercomputers) make it to the front page, they need no defense, only an occasional explanation that what makes them (extra) special are the high-bandwidth, low-latency interconnects. (Those interconnects make a very strong effort at Fallacies of Distributed Computing numbers 2 and 3[1] and tend to neatly take care of the remaining 6).

To be fair, though, I don't think the claim for the more common distributed systems is that they're "high performance" so much as that they're scalable (and have other benefits of being multi-node like resilience).

[1] The bandwidth may well be indistinguishable from infinite if it exceeds local (e.g. CPU-memory, CPU-CPU, CPU-GPU) bandwidths. I don't think that's yet true for something like multi-socket Intel systems, but it might be possible with enough interface cards. I didn't look at the specs on those POWER9 chips.

itronitron · on July 10, 2018

I'm interested to know if you have any pointers on learning resources for awk.

adamdrake · on July 10, 2018

I use it from time to time, but I certainly wouldn't consider myself an expert. When I need to do something beyond my current knowledge I search around a bit until I find the awk syntax I need to get the job done. Sorry I can't be more helpful!

mmt · on July 10, 2018

I'll second the sibling comment about the original AWK book (The AWK Programming Language).

I'd also suggest just using it more. If you find yourself wanting to use grep, cut or sed (especially if you need more than one!) for a one-liner, try using awk instead, if only for the practice. Once you're accustomed to some of the idioms, simplicity, and built-in looping, it will feel like a more natural, casual text-processing tool that you'd reach for automatically.

Lastly, search for "cookbooks" or libraries or other collections of useful scripts to see how others have used the language, in more advanced ways. Personally, I don't necessarily find it as something to aspire to (i.e. use of the language for complex, multi-line programs) as much as a way to better understand some of the power of the quirkier features that may, otherwise, be less obvious.

One of my favorite non-obvious behaviors is bypassing the input/loop functionality by having the only pattern be BEGIN. This results in an output-only program that makes for a convenient way of playing around with things like printf or just language syntax features (or differences between awk/mawk/gawk), without having to redirect input from /dev/null or worrying about hitting control-D twice and exiting the shell (for those of us who abhor ignoreeof):

  awk 'BEGIN{print "Hello, world!"}'

pedro84 · on July 10, 2018

I learned from The AWK Programming Language:

https://news.ycombinator.com/item?id=13451454

adamdrake · on July 9, 2018

If this kind of stuff is of interest for you, I wrote up some similar ideas (albeit in Go instead of standard command-line tools).

366,000 RPS on a laptop: https://adamdrake.com/big-data-small-machine.html

yzmtf2008 · on July 10, 2018

Scalability! But at what COST?: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

segmondy · on July 10, 2018

The one I see if is companies using ridiculous ETL tools to run reports that take 100x what a regular SQL takes.

TrevorJ · on July 9, 2018

That's axiomatic of the direction consumer software has gone over the last 20 years as well. How a word processor can feel sluggish on a modern PC is beyond me.

mmt · on July 10, 2018

It's not just consumer software and not just the last 20 years.

There is an axiom that software gets bigger, for no particularly good reason, for at least 40 years.

This can be witnessed in how the size of /bin/true on Unix went from 0 bytes to whatever size it is today [1]. I can't find it now, but there was an article with a bar chart showing a gradual progression through time.

Of course, on-disk size doesn't translate directly to performance, but the trivial example demonstrates mechanisms by which bloat/cruft can accumulate so gradually that no outcry is ever caused.

It also refutes the sibling/downthread comment the "cruft" is (necessarily) someone else's use case.

[1] e.g. 13k MacOS native, 52k GNU coreutils from MacPorts

majewsky · on July 10, 2018

> There is an axiom that software gets bigger, for no particularly good reason, for at least 40 years.

Sounds like a computer variant of Parkinson's law: "Work expands to fill the time available for its completion."

mmt · on July 10, 2018

Close, but not quite, since what's available (space in the specific example) isn't being filled.

Alternatively, with the complaint that desktop software feels more sluggish on a modern computer than contemporaneous software did on much older hardware suggests that something is even being overfilled.

My intuition says there's a different phenomenon of psychology at work here.

quickthrower2 · on July 10, 2018

Not experienced a sluggish word processor recently (using 6 yr old lenovo laptop, both Word and OpenOffice are fine)

majewsky · on July 10, 2018

Word 2016 takes 2 seconds to start on my pretty beefy notebook (2017 Lenovo X1 Yoga). Pretty sure that when you measure it, Word 2003 or so would take a similar amount of time to start on a 2003 PC, even though all of CPU, RAM and disk storage are vastly faster nowadays.

Why does the Word window not appear within milliseconds of me clicking on the application icon?

zeusk · on July 9, 2018

Because that word processor can do a lot more things and is not the sole thing running on a modern machine.

_emacsomancer_ · on July 9, 2018

and, one would imagine, a lot of accumulated cruft.

zeusk · on July 10, 2018

what you're calling cruft is some other person's usecase.

frankling_ · on July 10, 2018

I'd still prioritize the basic use case of comfortably typing words, which to my admittedly sensitive mind is not supported that well anymore by some common pieces of software. Between the apparent round-trip per keypress in Outlook Web, the noticeable general display delay of Windows 10, and the frequent wait for characters to appear in a reasonably small Powerpoint presentation, I feel like I might be holding it wrong.

TeMPOraL · on July 10, 2018

Quite often it would not be usecases, but externalities. That is, trading the quality and efficiency of your product for some benefits in selling or developing.

Software, for better or worse, is a social enterprise, so you can get away with dumping a lot on your users before they pack up and look for alternatives.

_emacsomancer_ · on July 11, 2018

Ok, and a lot of non-user-facing cruft too.

adamdrake · on July 9, 2018

Or, as is my preference, how to analyze 366,000 Records per Second on my laptop: https://adamdrake.com/big-data-small-machine.html

And that's with non-optimized code!

mmt · on July 10, 2018

You may want to re-submit that, since it didn't get any traction 38 days ago but might today, if this article and this comment get enough eyeballs.

I now hear actual data scientists complain that startups trying to hire them think they have "big data" problems when all they have are "data" problems.

itronitron · on July 10, 2018

That sounds like something AI can help solve. /s

golanggeek · on July 9, 2018

cwinter does Locust also support scalability across machines.. what is the timeline for it to be production ready..