If you read their published algorithm and are familiar with lock-less algorithms, it's clear that theirs is a transactional memory algorithm. Specifically, their LVB primitive. If this isn't obvious to you, I would recommend reading the seminal transactional memory algorithms from the 1970s and 1980s, including everything written by Hoare. Most of those are available from the ACM library. You particularly need to pay close to attention to how wait-free algorithms are achieved.
"Transactional memory" is not a marketing term, nor a synonym for a particular set of CPU instructions. It's a class of lock-free algorithms, especially lock-free, wait-free algorithms. And the C4 collector very clearly fits into that class of algorithms. It's use of page remapping and read/write page protections is precisely how you would emulate strong transactional memory primitives on x86.
I think this terse quotation (from their own research paper) sums up the relationship between the Vega hardware and the Linux software-based implementation:
"Azul has created commercial implementations of the C4 algorithm on three successive generations of its custom Vega hardware (custom processor instruction set, chip, system, and OS), as well on modern X86 hardware. While the algorithmic details are virtually identical between the platforms, the implementation of the LVB semantics varies significantly due to differences in available instruction sets, CPU features, and OS support capabilities." (http://www.azulsystems.com/sites/default/files/images/c4_pap...)
Regarding the performance of C4, the reason Azul doesn't publish TPC benchmarks is because there's no avoiding the immense costs of their page mapping hacks. From the paper above: "the garbage collector needs to sustain a page remapping at a rate that is 100x as high as the sustained object allocation rate for comfortable operation."
Page remapping is insanely expensive at the micro-granularity needed. They mitigate the cost by batching requests, but it's still significant. Furthermore, they must use atomic reads and writes for internal pointers. Those are cache killers.
I never said C4 can't be faster for particular workloads. Obviously for workloads sensitive to latency a pauseless collector can be faster overall. But as a general matter, those workloads are not in the majority. Ergo, for the majority of workloads C4 will not be faster, at least not on commodity hardware architectures.
You can continue to believe the hype, and believe that Azul possesses some sort of magical fairy dust, using techniques entirely beyond the comprehension of mere mortals. Or you can read about and learn how it _actually_ works. Their algorithm and implementations are all laudable and significant achievements. But there's nothing magical or secret about them.
I created the C4 algorithm. And I wrote the paper. I'm pretty sure I know what it does.
:-)
You raise another "I think this is how it works and therefore it must be slow" point in this posting that you didn't raise before, so let me address it to avoid confusion:
> "...Page remapping is insanely expensive at the micro-granularity needed."
If you actually read the paper (e.g. Table 2), you would have realize that C4 uses a page remapping mechanism that (in 2011) could sustain >700TB/sec of page remapping speed. Using my recommendation that remapping rate needs to be able to handle 100x the allocation rate, that's enough to keep up with 7TB/sec of allocation. So I think we have some headroom. The number have only gotten better since with Zing's loadable module.
To be more specific, when I said that C4 "emulates" the Vega architecture, I was referring the way that you presumably need to be able to invalidate an operation when another thread concurrently reads or writes to shared memory in the middle of a transaction (i.e. copying a black and updating pointers). By emulate, I meant you need to construct a primitive that provides an efficient lock-free large block copying operation, similar to what the Vega cache system provides. Of course, that's only a primitive--you need to then build compound operations atop that, and there's more than enough engineering there to keep people busy for quite awhile.
Again, if you can claim that in fact your page mapping and protection schemes are not analogous to the transactional memory support in the Vega hardware (which looks to be both small block transactional memory tied into the cache controller, as well as some useful page-level operations built into the memory controller), then mea culpa.
Sheesh. Read the C4 paper. See if you can point to a single place in the paper where we mention transactional memory. Or a need to invalidate an operation in the middle of some transaction. Or any form of emulation. C4 simply doesn't do or make any use of that stuff.
You seem to conflate Vega's [very cool] hardware transactional memory capabilities (SMA, OTC) with GC. Vega never used transactional memory for GC purposes. It used transactional memory to support OTC and transparent concurrent execution of synchronized blocks. Nothing to do with GC, and nothing to do with C4.
And yes. I can assertively "claim" that the page mapping and protection schemes in C4 are not analogous to the transactional memory support in Vega, and have nothing to do with caches or memory controllers. Vega (just like C4 on x86) used page mapping and protection schemes for GC purposes.
So as the designer you're telling me that your lock-free (excluding the locking necessary for updating the page tables) and possibly wait-free algorithm (unsure on the last part), simply can't be modeled as a transactional memory algorithm? If you're willing to claim that, I'll acquiesce. Because excluding the low-level details and the context of what's happening, the barrier algorithms look exactly like what you'd get trying to implement, e.g., the N-word transactional memory copying algorithms from 30+ years ago--in particular what Hoare laid out in the paper proving the universality of the LL/SC primitive--in a way that works on x86 for the JVM.
Regarding performance: number of allocations is not the same as transactional throughput. If you can also claim that the object system and VM subsystem of your JVM has no overhead over a traditional lock-based system (which doesn't use extensive mapping tricks, nor needs as many atomic operations), _or_ if you can show me TPC benchmarks where C4 is faster than the other existing locking collectors, I'll acquiesce.
Otherwise, I'm quite comfortable sticking to my guns.
I think what you've written is exceptional in numerous regards. And I don't mean to diminish the work in the slightest, but to only counter the hype--namely, that the algorithm uses techniques unknowable to others, or (as sometimes discussed elsewhere) that it's solved the problem of implementing a generic transactional memory model using regular x86 primitives.
If you read their published algorithm and are familiar with lock-less algorithms, it's clear that theirs is a transactional memory algorithm. Specifically, their LVB primitive. If this isn't obvious to you, I would recommend reading the seminal transactional memory algorithms from the 1970s and 1980s, including everything written by Hoare. Most of those are available from the ACM library. You particularly need to pay close to attention to how wait-free algorithms are achieved.
"Transactional memory" is not a marketing term, nor a synonym for a particular set of CPU instructions. It's a class of lock-free algorithms, especially lock-free, wait-free algorithms. And the C4 collector very clearly fits into that class of algorithms. It's use of page remapping and read/write page protections is precisely how you would emulate strong transactional memory primitives on x86.
I think this terse quotation (from their own research paper) sums up the relationship between the Vega hardware and the Linux software-based implementation:
"Azul has created commercial implementations of the C4 algorithm on three successive generations of its custom Vega hardware (custom processor instruction set, chip, system, and OS), as well on modern X86 hardware. While the algorithmic details are virtually identical between the platforms, the implementation of the LVB semantics varies significantly due to differences in available instruction sets, CPU features, and OS support capabilities." (http://www.azulsystems.com/sites/default/files/images/c4_pap...)
Regarding the performance of C4, the reason Azul doesn't publish TPC benchmarks is because there's no avoiding the immense costs of their page mapping hacks. From the paper above: "the garbage collector needs to sustain a page remapping at a rate that is 100x as high as the sustained object allocation rate for comfortable operation."
Page remapping is insanely expensive at the micro-granularity needed. They mitigate the cost by batching requests, but it's still significant. Furthermore, they must use atomic reads and writes for internal pointers. Those are cache killers.
I never said C4 can't be faster for particular workloads. Obviously for workloads sensitive to latency a pauseless collector can be faster overall. But as a general matter, those workloads are not in the majority. Ergo, for the majority of workloads C4 will not be faster, at least not on commodity hardware architectures.
You can continue to believe the hype, and believe that Azul possesses some sort of magical fairy dust, using techniques entirely beyond the comprehension of mere mortals. Or you can read about and learn how it _actually_ works. Their algorithm and implementations are all laudable and significant achievements. But there's nothing magical or secret about them.