> When they can choose (and there is a choice) to use collectors that don't pause, the pressure to keep allocation rates down changes, moving the "this is too much" lines up by more than an order of magnitude.
You mean - like choose the only garbage collector on earth that actually can do this - for the JVM only - which your old company wrote and keeps guarded as a heavily proprietary secret, with the currently published specifications for the design (in the C4 paper) requiring untenable, unmergeable and extensive hacks in the operating system virtual memory layer that will never be largely accepted, with the current design in your product (which alleviates this) again being a heavily guarded secret that will never be revealed, and never have source available, and costs a shitload of money? Which basically means Azul is going to have the only available pauseless GC for the forseeable future?
Well, now that you said it like that - you're right! We really do have a choice... a choice of actually using and working with things that exist and are widely available today, or already living in a land of unlimited money for licensing of a single product that might be completely and totally irrelevant to most of us. I guess the 'choice' isn't so clear, when you say it like that.
Look, that aside, great blog post, and extremely informative. I'm a giant fan of Gene's work, seriously. But this work is Azul's brainchild and they'll never let it go. And I don't blame them for that. But drop the rhetoric, please - few people have the 'choice' of spending shitloads of money on a proprietary JVM like Azul.
Azul originally started out as a hardware company. A pauseless collector is easy as long as you have transactional memory (at a minimum hardware which supports strong LL/SC, allowing you to build the rest in software). No commodity CPUs support this, then or now, so Azul's first product was a specialized CPU and memory architecture with real transactional memory semantics.
But nobody wants to buy specialized hardware these days. So they ended up having to emulate transactional memory on x86. That means _emulating_ the memory architecture of their original CPU. Again, there's no secret sauce to do doing this. Things like QEMU are free--which is to say, many people know how to do it. It just takes a lot of programmer hours to build it, and many, many more to optimize it.
Long story short, Azul's collector is truly pauseless, but its transactional throughput in the vast majority of workloads sucks because it's an emulator (for the JVM) built atop another emulator (for their original memory architecture). And while you can speed up emulating the JVM using JIT techniques, that doesn't work for emulating a transactional memory architecture. There are endless tweaks you can make, including playing around with memory page tables, etc, but at the end up the day you're still stuck emulating the semantics you need, causing huge latencies relative to your clock speed, which already suck on Ghz scale systems.
This won't change anytime time soon. Even Intel's new TSX features are not strong enough to be able to implement true transactional memory. Real transactional memory requires a way to atomically signal contention for every word of memory, or at least lines of words. It requires not only a specialized CPU, but specialized memory controllers, and specialized memory. Something like TSX would only provide a marginal performance improvement for C4, if anything at all, because almost all their effort goes into arranging to avoid locking, not making locking faster.
While it's true that there is nothing magical in C4, pretty much everything you describe here is wrong (as in the exact opposite of what is going on). Specifically:
- C4 makes no use of hardware transactional memory.
- C4 works great (and natively) on commodity x86-64 cores.
- C4 does not emulate anything (not memory architecture, not HTM).
How you got to your posted conclusions from the link you posted is a mystery to me. Cliff doesn't say anything much about the GC mechanism there: he is talking about lessons learned in designing custom hardware. And yes, Vega had some cool GC-assist features, but they are in no way part of the C4 collector mechanism.
Since you raised the bogus, unsubstantiated assertion that "[Zing's] transactional throughput in the vast majority of workloads sucks" based purely on your mis-interpretation of what C4 actually does, I feel I must correct that notion.
To do that, I'll point out that Zing (with C4) is currently used, in production, to run some of the highest transactional throughput and most latency sensitive applications in the world. Zing's sustainable throughput (the throughput at which the system still meets the required SLA) is generally dramatically better than other JVMs running on exactly the same modern x86 hardware. Yes, you read that right: Zing gets more production-worthy TPS out of the same x86 hardware. Not less.
C4 is being used in everything from low latency trading (Algo, HFT, smart routing wire risk) to Online Retail (think black Friday workloads) and travel sites. C4 is used for everything from 1GB compute-heavy and messgaing workloads to 1TB data-heavy analysts applications. It powers big data workloads. It powers search. It powers Java servers of all kinds, big and small. And none of those are complaining about throughput. Quite the opposite.
If you read their published algorithm and are familiar with lock-less algorithms, it's clear that theirs is a transactional memory algorithm. Specifically, their LVB primitive. If this isn't obvious to you, I would recommend reading the seminal transactional memory algorithms from the 1970s and 1980s, including everything written by Hoare. Most of those are available from the ACM library. You particularly need to pay close to attention to how wait-free algorithms are achieved.
"Transactional memory" is not a marketing term, nor a synonym for a particular set of CPU instructions. It's a class of lock-free algorithms, especially lock-free, wait-free algorithms. And the C4 collector very clearly fits into that class of algorithms. It's use of page remapping and read/write page protections is precisely how you would emulate strong transactional memory primitives on x86.
I think this terse quotation (from their own research paper) sums up the relationship between the Vega hardware and the Linux software-based implementation:
"Azul has created commercial implementations of the C4 algorithm on three successive generations of its custom Vega hardware (custom processor instruction set, chip, system, and OS), as well on modern X86 hardware. While the algorithmic details are virtually identical between the platforms, the implementation of the LVB semantics varies significantly due to differences in available instruction sets, CPU features, and OS support capabilities." (http://www.azulsystems.com/sites/default/files/images/c4_pap...)
Regarding the performance of C4, the reason Azul doesn't publish TPC benchmarks is because there's no avoiding the immense costs of their page mapping hacks. From the paper above: "the garbage collector needs to sustain a page remapping at a rate that is 100x as high as the sustained object allocation rate for comfortable operation."
Page remapping is insanely expensive at the micro-granularity needed. They mitigate the cost by batching requests, but it's still significant. Furthermore, they must use atomic reads and writes for internal pointers. Those are cache killers.
I never said C4 can't be faster for particular workloads. Obviously for workloads sensitive to latency a pauseless collector can be faster overall. But as a general matter, those workloads are not in the majority. Ergo, for the majority of workloads C4 will not be faster, at least not on commodity hardware architectures.
You can continue to believe the hype, and believe that Azul possesses some sort of magical fairy dust, using techniques entirely beyond the comprehension of mere mortals. Or you can read about and learn how it _actually_ works. Their algorithm and implementations are all laudable and significant achievements. But there's nothing magical or secret about them.
I created the C4 algorithm. And I wrote the paper. I'm pretty sure I know what it does.
:-)
You raise another "I think this is how it works and therefore it must be slow" point in this posting that you didn't raise before, so let me address it to avoid confusion:
> "...Page remapping is insanely expensive at the micro-granularity needed."
If you actually read the paper (e.g. Table 2), you would have realize that C4 uses a page remapping mechanism that (in 2011) could sustain >700TB/sec of page remapping speed. Using my recommendation that remapping rate needs to be able to handle 100x the allocation rate, that's enough to keep up with 7TB/sec of allocation. So I think we have some headroom. The number have only gotten better since with Zing's loadable module.
To be more specific, when I said that C4 "emulates" the Vega architecture, I was referring the way that you presumably need to be able to invalidate an operation when another thread concurrently reads or writes to shared memory in the middle of a transaction (i.e. copying a black and updating pointers). By emulate, I meant you need to construct a primitive that provides an efficient lock-free large block copying operation, similar to what the Vega cache system provides. Of course, that's only a primitive--you need to then build compound operations atop that, and there's more than enough engineering there to keep people busy for quite awhile.
Again, if you can claim that in fact your page mapping and protection schemes are not analogous to the transactional memory support in the Vega hardware (which looks to be both small block transactional memory tied into the cache controller, as well as some useful page-level operations built into the memory controller), then mea culpa.
Sheesh. Read the C4 paper. See if you can point to a single place in the paper where we mention transactional memory. Or a need to invalidate an operation in the middle of some transaction. Or any form of emulation. C4 simply doesn't do or make any use of that stuff.
You seem to conflate Vega's [very cool] hardware transactional memory capabilities (SMA, OTC) with GC. Vega never used transactional memory for GC purposes. It used transactional memory to support OTC and transparent concurrent execution of synchronized blocks. Nothing to do with GC, and nothing to do with C4.
And yes. I can assertively "claim" that the page mapping and protection schemes in C4 are not analogous to the transactional memory support in Vega, and have nothing to do with caches or memory controllers. Vega (just like C4 on x86) used page mapping and protection schemes for GC purposes.
So as the designer you're telling me that your lock-free (excluding the locking necessary for updating the page tables) and possibly wait-free algorithm (unsure on the last part), simply can't be modeled as a transactional memory algorithm? If you're willing to claim that, I'll acquiesce. Because excluding the low-level details and the context of what's happening, the barrier algorithms look exactly like what you'd get trying to implement, e.g., the N-word transactional memory copying algorithms from 30+ years ago--in particular what Hoare laid out in the paper proving the universality of the LL/SC primitive--in a way that works on x86 for the JVM.
Regarding performance: number of allocations is not the same as transactional throughput. If you can also claim that the object system and VM subsystem of your JVM has no overhead over a traditional lock-based system (which doesn't use extensive mapping tricks, nor needs as many atomic operations), _or_ if you can show me TPC benchmarks where C4 is faster than the other existing locking collectors, I'll acquiesce.
Otherwise, I'm quite comfortable sticking to my guns.
I think what you've written is exceptional in numerous regards. And I don't mean to diminish the work in the slightest, but to only counter the hype--namely, that the algorithm uses techniques unknowable to others, or (as sometimes discussed elsewhere) that it's solved the problem of implementing a generic transactional memory model using regular x86 primitives.
I'm sorry but this analysis is completely wrong. I don't work for Azul but have used C4 and spent time studying it. C4 does not use transactional memory, it does however employ a very elegant lock-free algorithm which allows continuous concurrent compacting collections. BTW that is why it is called C4.
I can tell you from first hand experience that is does not have throughput issues. The truth is quite the opposite as it is the only JVM I've found scales well on large core count or large memory systems.
Do you have experimental evidence to back up your claims or are they just conjecture?
Having studied C4 it is important to note that to get the best out of it you need to allow it to avail of a large heap. If your mindset is to keep the heap small, as you would with many other collectors, then you will be restricting the C4 collecting algorithm from performing at its best. Don't be shy, give it a big heap to play with. The obvious consequence to this is that C4 is not ideal for constrained memory systems - but that is not the world the majority of our servers live in today.
Amendment: All the above applies to Zing (C4) on Intel x86_64.
There's no other way to implementation a lock-less compacting (i.e. copying) collector than by using transactional memory algorithms. Period.
I've never used C4, and I don't even program in Java. But I have personally studied and implemented transactional memory algorithms, and C4 fits perfectly into that mold. In fact, what originally made me interested in transactional memory was figuring out how C4 worked.
And I did figure it out. I figured it out even before they published their 2011 paper, which I just discovered. And unsurprisingly everything in there is precisely as I predicted. It couldn't have been otherwise once you understand the fundamentals.
Um. There are 30 years worth of academic publications covering concurrent compacting collectors that don't need or use transactional memory, and are implementable on commodity hardware with no emulation or specialized instructions. C4 is just one of many examples of this fact. Sure, C4 does some cool new things, but trying to move or copy memory transactionally isn't one of them.
Whatever it is you think you figured out about how C4 works seems to be wrong. You may want to re-examine those fundamentals.
I asked the question if this is based on conjecture or experimental evidence. I'm happy to base my view on experimental evidence from my own research and testing.
Their use of page remapping and read/write page protections is precisely how you would emulate strong transactional memory primitives on x86, allowing you to leverage well-known lock-free and wait-free algorithms.
The 2011 C4 paper doesn't say any of the things you keep attributing to it.
Just because it may also be possible to use page remapping to save on toilet paper or rescue kittens doesn't mean that's what we do with it. We do what we say we do. No conjecture needed.
I don't really know where the "shitloads of money" notion comes from. I guess "shitloads" is a relative term, but Zing is usually no more expensive (and often cheaper) than the cost of the machines it runs on. Last I checked, those machines are usually powered by Intel-brainchild Xeon chips that Intel insists on getting money for, too.
Zing is certainly cheaper than the engineering time it offsets. If you feel that you need to avoid using Zing at all costs, you can certainly avoid using it by spending shitloads of your best talent, time, and effort (which are presumably not free) to successfully force your apps only run at 300MB/sec on those 20GB/sec-capable Xeons you paid shitloads of money for, and to dramatically under-utilize them. That's certainly a choice.
I'm pretty sure one shitload is bigger than the other. The whole trick is to pay with the smaller pile.
Hmmm.... who'd have thought you can CHOOSE to pay some money and solve a problem... wonder who came up with that idea? people work hard and build a product and don't give it away for free?
Can you qualify shitload? certainly it costs more than nothing. How much is shitload?
"We really do have a choice... a choice of actually using and working with things that exist and are widely available today, or already living in a land of unlimited money for licensing of a single product that might be completely and totally irrelevant to most of us"
Assuming your software is making you money (but you wouldn't make money through software, right? crazy idea), you might have a choice between spending your developers time (which is worth money, I assume) trying to reduce the rate of allocation to something the freebie JVM can handle without pausing, or spending some money on an "existing and widely available" solution to the allocation/collection problem.
If application pauses don't cost you money, indeed a pauseless JVM is "completely and totally irrelevant". If they do, get a quote from Azul, maybe it's a good deal for you, maybe it isn't...
Finally, if you are confused about money and how a free market works, just hang in there. Marx promised us the end of history is around the corner and all them capitalist pigs will be the first up against the wall come the revolution!
I don't blame Azul for making this work proprietary. I'm taking a stance against the rhetoric - this isn't a realistic choice for most people. Most people will buy more hardware over licensing expensive software, especially as hardware generally has reusable value, independent of your technology stack. This is especially true at the low end, where not everyone has massive A series or continuous funding. You have to plan a lot up front at that level to pay off your dividends here if you don't already have steady cash.
And many people don't use the JVM at all. So the 'choice' of a pauseless collector is mostly a farce, as far as I can see.
> Another choice is to write your software without a VM, many languages support this.
Thank you for completely and totally missing the point. Didn't I just say something about "Not everyone uses the JVM", and your response is "just use another completely different language and technology stack"? We're going in circles here.
Which other platforms support the C4 collector, including its pause times and sustained allocation rates, might I ask? Freely available open source ones, I might add.
You mean - like choose the only garbage collector on earth that actually can do this - for the JVM only - which your old company wrote and keeps guarded as a heavily proprietary secret, with the currently published specifications for the design (in the C4 paper) requiring untenable, unmergeable and extensive hacks in the operating system virtual memory layer that will never be largely accepted, with the current design in your product (which alleviates this) again being a heavily guarded secret that will never be revealed, and never have source available, and costs a shitload of money? Which basically means Azul is going to have the only available pauseless GC for the forseeable future?
Well, now that you said it like that - you're right! We really do have a choice... a choice of actually using and working with things that exist and are widely available today, or already living in a land of unlimited money for licensing of a single product that might be completely and totally irrelevant to most of us. I guess the 'choice' isn't so clear, when you say it like that.
Look, that aside, great blog post, and extremely informative. I'm a giant fan of Gene's work, seriously. But this work is Azul's brainchild and they'll never let it go. And I don't blame them for that. But drop the rhetoric, please - few people have the 'choice' of spending shitloads of money on a proprietary JVM like Azul.