So this person says every CPU cycle matters and then immediately takes the single CPU cycle and multiplies it by billions, the scale of their data.
So no, it's not "every" CPU cycle, it's the ones that scale with the highest dimension of your data that matter. Which is the same old story we have always had, save your energy for optimising the parts that matter, because the ones that matter probably matter orders of magnitudes more than the ones that don't.
>If AUTOCOMMIT = ON (jdbc driver default), each statement is treated as a complete transaction. When a statement completes changes are automatically committed to the database.
When AUTOCOMMIT = OFF, the transaction continues until manually run COMMIT or ROLLBACK. The locks are kept on objects for transaction duration.
This made me cringe. Whether a series of operations takes place in one transaction or many isn't something you can just turn on and off depending on what looks more expensive!
The article ended up suggesting more transactionality, which is generally good (although the reason given is not the important one, namely "you're less likely to have all your data completely ruined"), but if you make the process distributed and aren't careful about sharding you may end up trading average-case cost in network load for much worse worst-case cost due to lock contention and transaction failures.
Optimizing database access patterns at scale is hard, and blithely making major changes to things that impact correctness is not the way to do it.
There should probably have been a little more of an explanation on the post, but: 99% of the connections that touch Vertica, for us, are read-only. Actually, we have no system in which there is mixed read/write to Vertica. Either a system reads from it, or writes to it. That made it very easy for us to figure out where to turn off autocommit, and to do it without losing any of the aforementioned correctness.
In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.
No it is not, not even close. At a job 10 years ago nearly, we had 50Tb in plain old-fashioned Oracle, and we knew people with 200Tb in theirs (you would be surprised who if I told you). A few Tb these days, you could crunch quite easily on a high-end desktop or a single midrange server, using entirely conventional techniques.
Yep, this is a great point. The data locality/reducing IO is huge, but the way things actually play out for us when data isn't segmented/partitioned properly, it chews up CPU/memory. This is a lot of why the post was geared around CPU usage: concurrency in Vertica can be a little tricky, and stabilizing compute across the cluster has paid more dividends than any storage or network subsystem tweaks we've made.
We're not at the PB/day mark, though, so there's definitely classes of problems we are blissfully ignorant on. :)
The major takeaway I had from my courses in data intensive applications, was that IO is all that matters. It is the limiting factor to such an extend that you don't really care about the algorithmic efficiency with regards to CPU calculations, or memory.
You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.
There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.
After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.
From the paper the following 3 quotes highlight exactly why they where CPU bound:
> We found that if we instead ran queries on uncompressed data, most queries became I/O bound
> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object
> for some queries, as much as half of the CPU time is spent deserializing and decompressing data
CPU is probably not the best example, but the point is very valid, that at 100B scale anything is large.
We humans are not very good at appreciating orders of magnitude. I usually explain it this way: if it takes you 1 hour to process 1M records, then 10M will take 10 hours, and 100M will take 4.2 days while 10B will take over a year.
Sorry about the attribution. I'm trying to find who controls the blog as we speak so I can have them add it. (I work at Localytics, but I'm not the author.)
We've gingerly explored flame graphs to understand Vertica behavior under load, and we still have a lot that we want to try and use it for. I'm not sure if it will make an appearance in a further post, but we've definitely used your perf_event/ftrace-based tooling. :)
@brendangregg It was my bad, and you're totally right. I should not have let that fall to the wayside here. We're adding it ASAP. I'm sorry about that.
Since you are here, how do you handle CPU (or code) graphing on distributed systems? In his case, you do not control which nodes and/or when the query will be executed. Any tips?
"Different data types will force Vertica to use a different number of CPU cycles to process a data point" At the end of the day that performance bump comes down to the data point itself, sometimes the decrease in that CPU cycle wouldn't be as significant as expected.
Would love to see if the performance bump is highly significant on a much larger and complex data set.
Am I misunderstanding something? If one CPU cycle accounts for 27 seconds, then the savings of 10 seconds suggest we saved one half of a CPU cycle per iteration? Or do the queries not touch every row?
Optimizing data types and minimizing locks seem like general optimization tips, I was hoping for more advanced techniques for 100B rows.
One CPU cycle per row saves 27 seconds of one CPU's time. That 10 seconds was saved on every CPU in the cluster. So if there were 50 CPUs, that's 500 seconds, or ~20 cycles per record, by the original calculation.
In reality, the change in data type probably optimized disk access more than it did number of CPU cycles. That can often be more of a bottleneck.
Totally agree with the "Every CPU cycle matters". It might be more easier to save cpu cycle by saving I/O, utilizing data locality (with in datacenter racks) or even better serialization (binary, columnar or indexed).
Reducing locking and using shorter data type seem inadequate for the "Big Data" scene.
You are exactly right, however Vertica already handles the data locality, columnar data storage and data compression for us. Vertica is so good at its job that we are CPU bound on most queries and these types of strategies around reducing locking and using shorter data types make a difference.
Because a couple decades ago Java convinced Enterprise Land that they can't hire millions of C++ jockeys and expect them to work effectively in huge projects that plan to evolve into the next decades' (aka: the present's) legacy mudball. Instead, they decided it would be easier to hire millions of Java jockeys and have them build enormous kiln-fired mudballs using the same architectural strategy as the Egyptian pyramids. They convinced academia to raise an entire generation of Java jockeys, hired them all right out of school, and set them immediately to piling up enormous mud bricks forever.
So, now they have a few million Java jockeys churning away and a few million person-decades of work put into their mud piles. When starting any new project, there isn't much question about how to build it: More Mud!
As an embedded developer where every cycle counts I have come up with the same question as the poster above why bother with such languages. If a switch processes packets at line rate with the use of ASIC's why not have some similar development in the world of big data.
You assume that JVM is slow, yes? That's not always the case. Interestingly, there's cases where JVM applications run just as fast as if not faster than native code. This blows my mind, as a C++ programmer myself.
Once compiled to native code, which it will be for big data because the same classes are reused over and over, I would assume it would be in same ball-park as C/C++ code.
There's still a pretty big speed penalty for Java because the object model encourages a lot of pointer-chasing, which will blow your data locality. In C++, it's common for contained structs to be flat in memory, so accessing a data member in them is just an offset from a base address. In Java, all Object types are really pointers, which you need to dereference to get the contained object. HotSpot can't really optimize this beyond putting really frequently used objects in registers.
A lot of big-data work involves pulling out struct fields from a deeply nested composite record, and then performing some manipulation on them.
Memory indirection is the biggest issue indeed. However, I'd also add that java has a terrible performance model, as a language. Unless you stick to primitives only, the abstraction costs start to add up (beyond pointer chasing). It shoves the entire optimization burden onto the JVM which by the time it runs has lost a bunch of semantic and type information in some cases. There are also codegen deficiencies in current hotspot C2 compiler (i.e. generated code subpar compared to roughly equivalent gcc).
I think this trend may stop soon. There are already OSS big data projects written in more performant languages (e.g. c++) coming around (e.g. scylladb, cloudera's kudu).
Rust is exciting, no doubt, and I have high hopes for its adoption, but I've personally not seen/heard of any visible OSS big data style projects using it. I see Frank McSherry's stuff has been mentioned, but I think that's still his pet project (hopefully not putting words in his mouth).
But really I was using C++ as an example of something more fit for these types of projects than Java, it doesn't have to be only C++ of course.
Rust has Frank McSherry (formerly working on Naiad for Microsoft Research) and his work on timely dataflow and differential dataflow: https://github.com/frankmcsherry/blog
Most JVM-based query engines uses bytecode generation and once JIT compiler decides that the code block is hot enough and can generate native code for generated bytecode, the output is identical to C and C++.
The author actually indicates that every CPU cycle is important for code block that will be executed for each row. So once you optimize hot code blocks, you're good to go.
Data access patterns are much more important than hot code optimization. Sadly Java offers few options on this front(until maybe Java 9 when values types might become a thing).
Modern CPUs have DRAM fetch time in the 100's of cycles. Any cache friendly algorithm is going to walk circles around something that plays pointer pinball instead.
This is why bytecode generation is used by query engines. They don't meant to be used for creating ArrayList or HashMap. Generally, they work with buffers instead of objects to avoid the issues you mentioned and garbage collection pressure.
Let's say we want to compile a predicate expression "bigintColumn > 4 and varcharColumn = 'str'". A generic interpreter would suffer from the addressed issues but if you generate bytecode for Java source "return longPrimitive > 5 && readAndCompare(buffer, 3, "str".getBytes(UTF8))" then you won't create even a single Java object the output is usually identical to C and C++.
Wouldn't you still pay the bounds checking penalty on those buffers though? Also anything that uses floating point will probably be trashed by int->float conversion(unless jvm bytecode has a load to float from addr, although I freely admit that I know less about bytecode than plain Java).
Either way the average Java dev isn't going to be writing bytecode so I feel like C/C++ still has the advantage in performance cases.
If you use ByteBuffer, then yes, the application may suffer from unnecessary checks. However the performance of ByteBuffer is not usually good enough anyway, that's why people use off-heap buffers (sun.misc.Unsafe) which is a native call that allocates memory in off-heap.
Also bytecode has instruction sets for all primitive types, otherwise there wouldn't be any point to have these primitive types in Java language since it will also be converted to bytecode instructions.
There are solutions for all the addressed issues but they need to much work to implement in Java compared to C++. However, once you solve this specific problem (I admit that it's not a small one), there are lots of benefits of using Java compared to C++.
When I was taking my CS degree, the definition of big data for data intensive applications was exactly that. You cannot have it all in memory, and will have to use the disk.
If it fits in memory, you can honestly apply normal algorithmic analysis and optimize for memory access and cpu cycles. Once it no longer fit in memory, you become severely limited by IO.
So no, it's not "every" CPU cycle, it's the ones that scale with the highest dimension of your data that matter. Which is the same old story we have always had, save your energy for optimising the parts that matter, because the ones that matter probably matter orders of magnitudes more than the ones that don't.