Performance in Big Data Land: Every CPU cycle matters

zmmmmm · on Oct 29, 2015

So this person says every CPU cycle matters and then immediately takes the single CPU cycle and multiplies it by billions, the scale of their data.

So no, it's not "every" CPU cycle, it's the ones that scale with the highest dimension of your data that matter. Which is the same old story we have always had, save your energy for optimising the parts that matter, because the ones that matter probably matter orders of magnitudes more than the ones that don't.

GeneralMayhem · on Oct 29, 2015

>If AUTOCOMMIT = ON (jdbc driver default), each statement is treated as a complete transaction. When a statement completes changes are automatically committed to the database. When AUTOCOMMIT = OFF, the transaction continues until manually run COMMIT or ROLLBACK. The locks are kept on objects for transaction duration.

This made me cringe. Whether a series of operations takes place in one transaction or many isn't something you can just turn on and off depending on what looks more expensive!

The article ended up suggesting more transactionality, which is generally good (although the reason given is not the important one, namely "you're less likely to have all your data completely ruined"), but if you make the process distributed and aren't careful about sharding you may end up trading average-case cost in network load for much worse worst-case cost due to lock contention and transaction failures.

Optimizing database access patterns at scale is hard, and blithely making major changes to things that impact correctness is not the way to do it.

tobz · on Oct 29, 2015

There should probably have been a little more of an explanation on the post, but: 99% of the connections that touch Vertica, for us, are read-only. Actually, we have no system in which there is mixed read/write to Vertica. Either a system reads from it, or writes to it. That made it very easy for us to figure out where to turn off autocommit, and to do it without losing any of the aforementioned correctness.

jrbancel · on Oct 29, 2015

Is 100 Billion (order of a few TB) Big Data?

In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.

gaius · on Oct 29, 2015

No it is not, not even close. At a job 10 years ago nearly, we had 50Tb in plain old-fashioned Oracle, and we knew people with 200Tb in theirs (you would be surprised who if I told you). A few Tb these days, you could crunch quite easily on a high-end desktop or a single midrange server, using entirely conventional techniques.

tobz · on Oct 29, 2015

(Toby from Localytics here)

Yep, this is a great point. The data locality/reducing IO is huge, but the way things actually play out for us when data isn't segmented/partitioned properly, it chews up CPU/memory. This is a lot of why the post was geared around CPU usage: concurrency in Vertica can be a little tricky, and stabilizing compute across the cluster has paid more dividends than any storage or network subsystem tweaks we've made.

We're not at the PB/day mark, though, so there's definitely classes of problems we are blissfully ignorant on. :)

hvidgaard · on Oct 29, 2015

The major takeaway I had from my courses in data intensive applications, was that IO is all that matters. It is the limiting factor to such an extend that you don't really care about the algorithmic efficiency with regards to CPU calculations, or memory.

You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.

bdarfler · on Oct 29, 2015

There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.

https://www.usenix.org/conference/nsdi15/technical-sessions/...

hvidgaard · on Nov 11, 2015

After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.

From the paper the following 3 quotes highlight exactly why they where CPU bound:

> We found that if we instead ran queries on uncompressed data, most queries became I/O bound

> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object

> for some queries, as much as half of the CPU time is spent deserializing and decompressing data

msellout · on Oct 29, 2015

Specifically network traffic, rather than disk read/write.

gtrubetskoy · on Oct 29, 2015

CPU is probably not the best example, but the point is very valid, that at 100B scale anything is large.

We humans are not very good at appreciating orders of magnitude. I usually explain it this way: if it takes you 1 hour to process 1M records, then 10M will take 10 hours, and 100M will take 4.2 days while 10B will take over a year.

brendangregg · on Oct 28, 2015

I hope later posts in this series explore Linux perf_events or flame graphs, which is the origin of the (unattributed) background image (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html). :)

tobz · on Oct 29, 2015

Heyo!

Sorry about the attribution. I'm trying to find who controls the blog as we speak so I can have them add it. (I work at Localytics, but I'm not the author.)

We've gingerly explored flame graphs to understand Vertica behavior under load, and we still have a lot that we want to try and use it for. I'm not sure if it will make an appearance in a further post, but we've definitely used your perf_event/ftrace-based tooling. :)

ptothek2 · on Oct 29, 2015

@brendangregg It was my bad, and you're totally right. I should not have let that fall to the wayside here. We're adding it ASAP. I'm sorry about that.

fsaintjacques · on Oct 29, 2015

Since you are here, how do you handle CPU (or code) graphing on distributed systems? In his case, you do not control which nodes and/or when the query will be executed. Any tips?

syed99 · on Oct 29, 2015

"Different data types will force Vertica to use a different number of CPU cycles to process a data point" At the end of the day that performance bump comes down to the data point itself, sometimes the decrease in that CPU cycle wouldn't be as significant as expected.

Would love to see if the performance bump is highly significant on a much larger and complex data set.

martin_ · on Oct 28, 2015

Am I misunderstanding something? If one CPU cycle accounts for 27 seconds, then the savings of 10 seconds suggest we saved one half of a CPU cycle per iteration? Or do the queries not touch every row?

Optimizing data types and minimizing locks seem like general optimization tips, I was hoping for more advanced techniques for 100B rows.

joefkelley · on Oct 28, 2015

One CPU cycle per row saves 27 seconds of one CPU's time. That 10 seconds was saved on every CPU in the cluster. So if there were 50 CPUs, that's 500 seconds, or ~20 cycles per record, by the original calculation.

In reality, the change in data type probably optimized disk access more than it did number of CPU cycles. That can often be more of a bottleneck.

shulu · on Oct 28, 2015

Totally agree with the "Every CPU cycle matters". It might be more easier to save cpu cycle by saving I/O, utilizing data locality (with in datacenter racks) or even better serialization (binary, columnar or indexed).

Reducing locking and using shorter data type seem inadequate for the "Big Data" scene.

bdarfler · on Oct 29, 2015

You are exactly right, however Vertica already handles the data locality, columnar data storage and data compression for us. Vertica is so good at its job that we are CPU bound on most queries and these types of strategies around reducing locking and using shorter data types make a difference.

andmarios · on Oct 28, 2015

Then why big data land is dominated by JVM-based frameworks?

corysama · on Oct 29, 2015

Because a couple decades ago Java convinced Enterprise Land that they can't hire millions of C++ jockeys and expect them to work effectively in huge projects that plan to evolve into the next decades' (aka: the present's) legacy mudball. Instead, they decided it would be easier to hire millions of Java jockeys and have them build enormous kiln-fired mudballs using the same architectural strategy as the Egyptian pyramids. They convinced academia to raise an entire generation of Java jockeys, hired them all right out of school, and set them immediately to piling up enormous mud bricks forever.

So, now they have a few million Java jockeys churning away and a few million person-decades of work put into their mud piles. When starting any new project, there isn't much question about how to build it: More Mud!

nascentmind · on Oct 29, 2015

This is the problem here.

As an embedded developer where every cycle counts I have come up with the same question as the poster above why bother with such languages. If a switch processes packets at line rate with the use of ASIC's why not have some similar development in the world of big data.

schmidtc · on Oct 29, 2015

Thank you

leetNightshade · on Oct 28, 2015

You assume that JVM is slow, yes? That's not always the case. Interestingly, there's cases where JVM applications run just as fast as if not faster than native code. This blows my mind, as a C++ programmer myself.

http://codexpi.com/java-vs-cpp-performance-comparison-jit-co...

http://stackoverflow.com/questions/5641356/why-is-it-that-by...

http://beautynbits.blogspot.com/2013/01/performance-java-vs-...

abc_lisper · on Oct 28, 2015

Once compiled to native code, which it will be for big data because the same classes are reused over and over, I would assume it would be in same ball-park as C/C++ code.

nostrademons · on Oct 29, 2015

There's still a pretty big speed penalty for Java because the object model encourages a lot of pointer-chasing, which will blow your data locality. In C++, it's common for contained structs to be flat in memory, so accessing a data member in them is just an offset from a base address. In Java, all Object types are really pointers, which you need to dereference to get the contained object. HotSpot can't really optimize this beyond putting really frequently used objects in registers.

A lot of big-data work involves pulling out struct fields from a deeply nested composite record, and then performing some manipulation on them.

vvanders · on Oct 29, 2015

Listen to the parent here, I've seen 10x performance in production Java code just using flatbuffers(and paying the marshaling costs from ByteBuffer).

50x is not unreasonable for C/C++ code that was OO and uses a data oriented approach instead.

vitalyd · on Oct 29, 2015

Memory indirection is the biggest issue indeed. However, I'd also add that java has a terrible performance model, as a language. Unless you stick to primitives only, the abstraction costs start to add up (beyond pointer chasing). It shoves the entire optimization burden onto the JVM which by the time it runs has lost a bunch of semantic and type information in some cases. There are also codegen deficiencies in current hotspot C2 compiler (i.e. generated code subpar compared to roughly equivalent gcc).

abc_lisper · on Oct 29, 2015

> In C++, it's common for contained structs to be flat in memory, so accessing a data member in them is just an offset from a base address

JVM inlines virtual method calls as one of its optimizations. See: http://www.oracle.com/technetwork/java/whitepaper-135217.htm...

dsharlet · on Oct 29, 2015

How is that related to the parent's point?

vitalyd · on Oct 29, 2015

I think this trend may stop soon. There are already OSS big data projects written in more performant languages (e.g. c++) coming around (e.g. scylladb, cloudera's kudu).

swah · on Oct 29, 2015

Welp, what about Rust?

vitalyd · on Oct 29, 2015

Rust is exciting, no doubt, and I have high hopes for its adoption, but I've personally not seen/heard of any visible OSS big data style projects using it. I see Frank McSherry's stuff has been mentioned, but I think that's still his pet project (hopefully not putting words in his mouth).

But really I was using C++ as an example of something more fit for these types of projects than Java, it doesn't have to be only C++ of course.

kibwen · on Oct 29, 2015

Rust has Frank McSherry (formerly working on Naiad for Microsoft Research) and his work on timely dataflow and differential dataflow: https://github.com/frankmcsherry/blog

buremba · on Oct 29, 2015

Most JVM-based query engines uses bytecode generation and once JIT compiler decides that the code block is hot enough and can generate native code for generated bytecode, the output is identical to C and C++.

The author actually indicates that every CPU cycle is important for code block that will be executed for each row. So once you optimize hot code blocks, you're good to go.

vvanders · on Oct 29, 2015

Data access patterns are much more important than hot code optimization. Sadly Java offers few options on this front(until maybe Java 9 when values types might become a thing).

Modern CPUs have DRAM fetch time in the 100's of cycles. Any cache friendly algorithm is going to walk circles around something that plays pointer pinball instead.

buremba · on Oct 29, 2015

This is why bytecode generation is used by query engines. They don't meant to be used for creating ArrayList or HashMap. Generally, they work with buffers instead of objects to avoid the issues you mentioned and garbage collection pressure.

Let's say we want to compile a predicate expression "bigintColumn > 4 and varcharColumn = 'str'". A generic interpreter would suffer from the addressed issues but if you generate bytecode for Java source "return longPrimitive > 5 && readAndCompare(buffer, 3, "str".getBytes(UTF8))" then you won't create even a single Java object the output is usually identical to C and C++.

vvanders · on Oct 29, 2015

Wouldn't you still pay the bounds checking penalty on those buffers though? Also anything that uses floating point will probably be trashed by int->float conversion(unless jvm bytecode has a load to float from addr, although I freely admit that I know less about bytecode than plain Java).

Either way the average Java dev isn't going to be writing bytecode so I feel like C/C++ still has the advantage in performance cases.

buremba · on Oct 29, 2015

If you use ByteBuffer, then yes, the application may suffer from unnecessary checks. However the performance of ByteBuffer is not usually good enough anyway, that's why people use off-heap buffers (sun.misc.Unsafe) which is a native call that allocates memory in off-heap.

Also bytecode has instruction sets for all primitive types, otherwise there wouldn't be any point to have these primitive types in Java language since it will also be converted to bytecode instructions.

There are solutions for all the addressed issues but they need to much work to implement in Java compared to C++. However, once you solve this specific problem (I admit that it's not a small one), there are lots of benefits of using Java compared to C++.

jazzyk · on Oct 29, 2015

Well, CPU utilization is the least of Hadoop's problems (picking on it, because it is the most well-known JVM-based framework)

Hadoop core has some shockingly bad design choices (lots of disk IO), and no amount of layers on top of it is going to fix latency issues.

It has nothing to do with JVM "overhead" (which is mostly a myth, anyway).

sargun · on Oct 28, 2015

I might suggest a new definition for "Big Data" - Data, whose size is greater than fits in one machine's memory.

dalke · on Oct 28, 2015

100 billion records, 6TB max RAM for http://www8.hp.com/uk/en/products/proliant-servers/product-d... => 54 bytes per record.

BTW, some previous HN discussions along these lines:

"Don't use Hadoop - your data isn't that big " - https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html and https://news.ycombinator.com/item?id=6398650

"Your data fits in RAM " - http://yourdatafitsinram.com/ - https://news.ycombinator.com/item?id=9581862

semi-extrinsic · on Oct 28, 2015

Some propose an even stronger definition: when the indices for accessing said data no longer fits in one machine's memory.

lrei · on Oct 28, 2015

Obligatory half-joke:

The most commonly used definition deduced from reading online articles is, in terms of size

"More data than naively fits into the memory of my (midrange) laptop using a high overhead platform"

or in terms of speed

"More data per second than can be handled using the same naive database code we used in the 90s for our website's comment section"

bbrazil · on Oct 28, 2015

The definition I like is that it's when the size of the data becomes a significant challenge to solving your problem.

For example 1TB of data won't fit in memory, but if all you need to do is a sequential read in under a day then it's not a problem.

toast0 · on Oct 29, 2015

1TB /will/ fit in memory, you can get an ec2 instance with 2TB; but your point stands.

hvidgaard · on Oct 29, 2015

When I was taking my CS degree, the definition of big data for data intensive applications was exactly that. You cannot have it all in memory, and will have to use the disk.

If it fits in memory, you can honestly apply normal algorithmic analysis and optimize for memory access and cpu cycles. Once it no longer fit in memory, you become severely limited by IO.

yummyfajitas · on Oct 29, 2015

Is this really about cpu rather than disk? I don't see anywhere where he attempted to control for disk io by passing the integers.

In fact, since vertica is column oriented, I don't think you can pad things easily.

mobiuscog · on Oct 29, 2015

I'm struggling to understand why they wouldn't just use non-locking selects instead of turning auto-commit off.

Does the auto-commit add additional lock overhead for some reason ?