Properly configured big.LITTLE clusters should be set up so that all CPUs report the same cache line size (which might be smaller than the true cache line size for some of the CPUs), to avoid exactly this kind of problem. The libgcc code assumes the hardware is correctly put together.
There is a Linux kernel patchset currently going through review which provides a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the kernel so they can be emulated with the safe correct value:
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg...
and it seems to me that that's really the right way to deal with this.
Also, if I'm reading the proposed fix in the mono pull request correctly, it doesn't deal with the problem entirely because there's a race condition where the code might start execution on the core with the larger cache line size, and then get context-switched to the core with the smaller cache line size midway through executing its cache-maintenance loop. The chances of things going wrong are much smaller, but they're still there...
(Edit: rereading the blog post, they say they need to figure out the global minimum, but I can't see how their code actually does that, since there's nothing that guarantees that the icache flush code gets run on every cpu before it's needed in anger.)
It should converge to the right value eventually but does seem like there's definitely a chance for it to be wrong one or more times before running on the little cores.
It's fine if core migration happens during the invalidation loop - the core migration itself surely must wipe the non-shared cache levels thoroughly, otherwise nothing would work.
EDIT: Actually if the big and little cores are used together, and not exclusively, then this might still be an issue, yeah.
No, in general Linux migrating processes between cores won't nuke the caches. The hardware's cache coherency protocols between CPUs in the cluster ensures that they are all in sync sufficiently that it's not needed.
I understood that the configurations currently in use usually only power up either the big or little cores at the same time, and that kind of migration has to wipe the caches, right? But that might be inaccurate, and you are of course right in the general case.
The state of the art in Linux scheduler handling of big.LITTLE hardware has moved through several different models, getting steadily better at getting best performance from the hardware (wikipedia has a good brief rundown: https://en.wikipedia.org/wiki/ARM_big.LITTLE). You're thinking about the in-kernel-scheduler approach, but global task scheduling (where you just tell the scheduler about all the cores and let it move processes around to suit) has been the recommended approach for a few years now I think.
Core migration don't need to reach a global synchronization point, just enough so that the 2 cores in question agree with each other. This can be done without requiring global visibility of all operations of the source core.
That probably preserves correctness, but also invalidates any attempt to rely on the cache line size to lay out data structures to avoid cache line ping-ponging. Which is a major reason a program would care about the cache line size to begin with.
I think the really correct answer might be to abandon the idea that the system has a single cache line size...
what's the incentive for cpu manufacturer to make effort of building extra cache memory in hardware for bigger cpu in ARM64, if there is no sane way to use it ?
The difference between cache line size and cache size is like paper. You can make it wider (bigger cache line size), taller (bigger cache size), or both.
The problem is like printing. If you put in an A4 or letter (ANSI A) sized sheet and tell your printer it's A3 or tabloid (ANSI B), you're gonna have problems.
Seeing bugs like this reminds me of how much nicer things are on x86 where JITs do not need to flush caches. You can actually modify the instruction immediately ahead of the currently executing one, and the CPU will naturally "do the right thing"[1] --- it does slow down execution, as the CPU is essentially automatically detecting and flushing its cache/pipeline, but used sparingly can be a great optimisation. The write can even come from another core ("cross-modifying code") and everything will still work. Someone I know used this to great effect in squeezing out the last bits of performance from an application by eliminating checks on a few flag variables and the associated branching in a tight loop --- it simply "poked" instruction bytes from another core into the loop when it was time for that core to do something else.
[1] With the exception of pre-Pentium CPUs, where modifying at various forward offsets from the locus of execution could give insight into how big the prefetch queue is. With the Pentium it was fully detected and the later multithreaded/multicored react similarly to cross-modifying code as described above, which leads me to believe that Intel is very much supportive of these things as otherwise they could've just told programmers to do as ARM does.
Maybe what ARM needs, short of doing it the Intel way, is a "flush region" instruction which takes both the address and size, so it can automatically flush the appropriate cache lines based on the current hardware's cacheline size.
x86 is really the oddity here -- it has its no-explicit-cache-maintenance design because of wanting to maintain backwards-compatibility with self-modifying code that was written for x86 cores that had no caches at all. Almost all other architectures have explicit cache maintenance because it's more efficient (and requires less hardware), at the minor cost of requiring the very few bits of software which do odd things like JITting to explicitly tell the CPU what they're doing.
An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere. The obvious observation from a RISC-architecture point of view is that you can get the equivalent effect without the pain of making a long-running interruptible instruction, by having an "invalidate one cache line" instruction plus an explicit loop in the code, and that's what most architectures do.
Actually, Intel explicitly broke backwards-compatibility starting with the Pentium, by adding the hardware to make SMC work without additional effort. The 486 and below needed an explicit branch to flush the prefetch queue, and this effect has been exploited for various anti-debugging tricks and even this amazing 8088-only optimisation:
An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere.
x86 has the REP prefix for this purpose; used with certain instructions, it decrements a register and if it's nonzero, executes the instruction. The earlier implementations simply didn't update the instruction pointer in this case so the CPU would repeatedly fetch and execute the same instruction, and it's interruptable between each step. The register counts down how many iterations remain. Otherwise, the instruction pointer moves to the next instruction. Modern x86 handles this by generating uops instead in the decoder, but the basic functionality is the same.
Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.
Perhaps a better (and simpler) workaround, then, could be to clamp the reported cache line size to 64 bytes. So even if the Samsung core reports 128-byte cache lines, the code would simply invalidate each line twice, and if it is migrated in the middle of the invalidation loop, it would still work correctly.
You could also say that the ARM implementation is not optimal since it uses the same cache size for vastly different designs (different pipeline length and memory access characteristics).
Samsung tried to improve performance, I don't think they are at fault at all.
Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.
Going to 128-byte cache lines was fine, but they should have made the lower-power part of the chip match (which in this case they likely couldn't because they licensed that design), or they should have made sure 128-byte lines were reported in all circumstances (which requires a hack similar to the workaround done in the kernel because again they can't make the A53 report 64 bytes).
> Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.
This is similar to blaming OS/BIOS manufaturers for Y2K breaking "correctly written code" at the turn of the millenium.The code was incorrect in this instance because it assumed the cache size would be the same for all cores, Samsung simply manufactured a SOC that breaks that faulty assumption.
If we forget about the ARM spec explicitly enforcing this anyway, here's some food for thought: how exactly do you suppose the GCC people could write correct code, if they aren't allowed to make that assumption?
The CTR holds minimum line length values for:
- the instruction caches
...this value is the most efficient address stride to use to apply to a sequence of address-based maintenance operations to a range of addresses...
The documentation for the CTR_EL0 reg talks about "caches under this processors control" which you could argue don't include other cores, but if you allow migration between cores, then line size changes underneath running software break the above "most efficient address stride" assumption. So you can't do that.
It boils down to this: If you can't assume that cache line size doesn't change underneath you, then you can't invalidate line by line at all, and would have to go word by word. That's terrible for performance (and a huge waste), which is why the spec says to use the above value for those operations.
I appreciate you digging up the relevant text from the manual, but I don't think you should accuse Samsung of wrongdoing based on such far fetched assumptions.
That document does not explicitly forbids this. In addition, take a look at their sample code which indeed reads some cache configuration registers during each call. The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)
I don't think considering how to actually implement the basic operation without a terrible performance penalty is a "far fetched assumption".
> That document does not explicitly forbids this.
Yet it makes it clear software can be written to assume it doesn't happen, which is the same thing.
> In addition, take a look at their sample code which indeed reads some cache configuration registers during each call.
This doesn't prevent the problem at all, it just reduces the window for things to go wrong.
> The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)
I trust the ARM engineers. As I already said, their code is right, Samsung got it wrong. Note that the libgcc code that breaks was ALSO written by ARM.
Best i can tell, the whole thing is designed so that you can go from a a single "little" all the way to two sets of four. Thus each core is designed to be used both independently and in larger configurations.
wow, just wow. That is a really awesome bug (and like the authors I have issues with trying to sleep when that sort of puzzle is sitting there :-)
Still a bit hazy on why they manually flush the cache for a given block of memory (presumably for protecting disclosure?) but I'm also a bit curious how it works if you get the sequence big fetches a cache line, switches to little which fetches a line (half as long and changing half the bytes in the cache) and then you switch back to big and its thinking it has a full cache line? Presumably there is some mechanism that invalidates cache lines?
Handling of the case where big and little both want the same thing in their cache should be dealt with by the usual cache-coherency traffic between the CPUs that ensures they don't disagree about what's in their L1 caches (very handwaved because I don't know the details).
The reason for the manual cache operations is because they're generating JITted code -- on ARM to ensure that what you execute is the same thing you just wrote you have to (1) clean the data cache, so your changes get out to main memory[.] and then (2) invalidate the icache, so that execution will fetch the fresh data from memory rather than using stale info. This clean-and-invalidate operation is usually informally called a flush, though it isn't really one in ARM terminology.
[.] not actually main memory, usually: only has to go out to the "point of unification" where the iside and dside come together, which is probably the L2 cache.
Not just that, the ISA usually requires that a cache invalidation instruction be issued regardless of whether the chip's coherency will automatically detect and invalidate it.
In cases such as this post, it is perfectly valid for the silicon engineers to say that its the software's fault for not adhering to the ISA.
Thanks for the update on the manual cache management!
I'll admit that I find the question of dissimilar cache line sizes into the same cache intriguing from an architecture point of view. It has me doodling all sorts of questions into my notebook.
I don't see why they have to do this in userspace at all. If they did:
* allocate read/write buffer
* JIT instructions into it
* change mapping to read/execute
* run the JITted code
Then the kernel manages flushing the data caches on the mapping change, and Mono gets to wrap a Somebody Else's Problem field around it. It sounds like they are instead:
* allocate read/write/execute buffer
* JIT instructions into it
* manually flush relevant data caches (with an assumption that the cache line size is constant)
That approach is harder to use in practice that in sounds. It's not like people have not tried it.
The OS only let you alloc in large granules, like 4k or 16k, and the vast majority of the methods are significantly smaller than that, meaning a JIT must colocate multiple methods in the same allocation block or waste a significant amount of memory.
We could get around that by remapping memory between read/write to read/execute and have the OS solve the problem for us. Except for a couple of small details, modifying a memory mapping is very expensive and we're, well, in the performance business, and that mono is multi-threaded so one thread might be executing code from the exact page we just made non-executable.
This approach, IIRC, was tried by Firefox as it has some security advantages, but discarded due to the measurable performance impact - and they don't have the second problem as JS is single threaded.
How is this safe in the multithreaded case anyway? If a process has just written a new JITted method and is flushing the i$ on the CPU it's executing on, but then gets scheduled away part-way through the flush, if you were very unlucky then couldn't another thread then get scheduled on that CPU and try to execute the just-written method, which failed to be fully flushed from that CPU's cache?
Multi-threaded safety is simply due to JIT controlling the visibility of the newly compiled code. First flush, then make it visible for execution, can't go wrong with that and scheduling won't matter.
Things get a lot more complicated when it comes to code patching, but the principle is similar.
I don't think that helps - the point is that the flush might not be effective if the flushing thread gets scheduled away from the core which has the stale I$ before it manages to fully issue the flush.
Or is the flush guaranteed to flush all cores caches? That would be a fairly unusual design.
IC IVAU instructions are broadcast to all cores in the same 'inner shareable domain' (all cores running the same OS instance are in the same inner shareable domain)
Firefox is shipping W^X for jitcode, as far as I know. Or at least https://bugzilla.mozilla.org/show_bug.cgi?id=1215479 is marked as fixed and I don't see any obvious bugs blocking it that are fallout from that change.
And the "Firefox (Ion)" and "Firefox (non writable jitcode)" lines on https://arewefastyet.com/ seem to coincide...
But yes, actually making this work in practice is not at all simple.
I see the argument about the page size potentially being large relative to the size of a jitted function. But,
> modifying a memory mapping is very expensive
It used to be true that operations on memory mappings were appallingly expensive. However, the advent of virtualization has driven a significant change in performance. IIRC, ARMv8 has a TLB invalidation operation that is per-entry, addressed by the virtual address being invalidated. You don't need to flush the entire TLB cache.
> Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.
I rather see the problem in the fact that there seems to be no possibility to say to the Linux scheduler: Only schedule this process/thread between cores that have the same cache line size. Or add an attribute when some thread is created on the cache line size of the cores it is allowed to run. Or an attribute when some thread is created for "allow arbitrary cache line size but don't let it run on cores with a different size". This way it would suffice to check for the cache size on program or thread start.
That's too specific a feature to expose to developers - most people wouldn't even know about the feature, and the ones that did might needlessly enable it just to be conservative.
The intention of the big.LITTLE architecture is to let processes be migrated seamlessly between the small and big core and let the unused core be turned off to save power. The kernel and the hardware should work together to make the core switching transparent and expose a safe way to invalidate the cache independent of the current processor.
> The intention of the big.LITTLE architecture is to let processes be migrated seamlessly between the small and big core and let the unused core be turned off to save power.
On the other hand flushing specific cache lines is a rather special feature. If the code uses such obscure low-level features (much more than the "typical" application) such as the invariant that the size of a cache-line stays constant over the execution (which most applications really don't care about) one can at least expect from the developer to pass this information to the scheduler so that the scheduler can take that this invariant will indeed be satisfied.
Anything that uses a JIT necessarily uses this "obscure low-level feature", including any program implemented in one of many programming languages. Forcing such programs onto the heavy processor makes no sense; there just needs to be a way to properly clear the cache.
On https://news.ycombinator.com/item?id=12483698 I wrote a better idea how this might be implemented without causing this problem. Nevertheless even if we use my first, worse interface this should not be a problem: Spawn a thread that does the lowlevel stuff, sync it and after that run the JITted code.
But then you would never be able to take advantage of the power/performance tradeoff that big.LITTLE gives you; of being able to migrate you to a big core when you need it, but save power by migrating you to the little core when you're not doing anything CPU intensive, and shutting off the power-hungry big core.
> But then you would never be able to take advantage of the power/performance tradeoff that big.LITTLE gives you
Rather: Only applications that don't depend on the invariant that the size of a cache line will stay constant over the execution time will take advantage of this. I can imagine ways how one could increase the advantage, but I don't want to go too much into technical details here.
Why would programs want to "depend on the invariant that the size of a cache line will stay constant"? That's a minor detail that most applications don't care about.
In this case, it probably mattered because when JITing code, you need to explicitly flush instruction cache after changing code. But that's not anything the application cares about; that's something the runtime cares about, and the runtime is going to want to support taking advantage of the speed/power tradeoff without making application developers have to go to the trouble of thinking about it.
Also, from a user perspective, you don't want applications to be able to demand only the big, powerful cores. That will eat through your battery much faster, with likely very little benefit.
> Also, from a user perspective, you don't want applications to be able to demand only the big, powerful cores.
This should rather be a setting in the rights system that a user can refuse the right to an application to be able to demand big, powerful cores. I can nevertheless imagine quite well scenarios where enabling a (specific) application to do so can be quite useful.
Another idea how one could solve this problem by a clever kernel interface. This idea is probably a better than the approach in my post above, since this enables to migrate threads from big to little cores and vice versa except for critical regions.
Add an interface to the kernel/scheduler
void lock_cacheline_size(...)
void unlock_cacheline_size(...)
Calling lock_cacheline_size tells that from now on the thread must stay on a core with the same cache line size. Consider it as locking a mutex on the cache line size.
After this you can run critical code that depends on the fact that the cache line size stays constant for the whole time.
After you are finished with it, you call unlock_cacheline_size (as a kind of unlocking the mutex on the cache line size) - from now on the scheduler is free again to migrate the code again to cores with different cache line sizes.
A simpler and more robust approach would be to add a system call to flush a memory range. The kernel can then do all the locking, without the risk that migration to the other core gets indefinitely blocked by user space code.
>An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.
Doesn't ARM have an "Enter Critical Region" instruction?
Cell behaves more like a CPU + GPGPU system, big.LITTLE can schedule an instruction stream on any core of the cluster (depending on the configuration setup), on Cell you'd primary use the general-purpose PPE from which you'd start (and chain) vector-based threads on SPEs.
PPE and SPE don't even share an ISA, SPE have a custom-built SIMD-oriented ISA.
The difference is that big.LITTLE ARMs use the same instruction set on all cores. In general I wouldn't expect a lot of terminological rigor from blog posts.
> The Playstation 4 isn't a AMP machine because programming the Cell was so hard.
The PS4 could be an AMP design if you consider the (closely coupled) GPU a processor.
It doesn't require the same gymnastics as the PS3 though because both the main processor and GPU are more capable. The SPUs were required to perform computation that the anemic PPU could not do as well as fill in where the pre-unified shader model GPU was unable to keep the pace.
Memory was shared on the PS3 but, from the SPUs, required explicit put and fetch operations.
Cell is essentially distributed memory cluster on single chip, because each SPU has it's own address space and cannot directly access main memory. I'm not sure about what the exact definition of AMP is, but it does not exactly match my feeling of what AMP should be. In this regard Wii seems more like AMP systems with two completely different CPUs (PPC and ARM) sharing what essentially amounts to be same address space (and in WiiU there are 3 PPC cores where one of them is slightly different than other two and cache coherency between them can only be described as broken).
There is no question of hardness of programming for Cell, but I think it's mostly about middleware support (probably because the platform is so different from PC and xbox360).
The real problem with the Cell was that each Cell SPE processor only has 256K of local memory. It has bulk DMA access to main memory, but that's more like I/O. 256K is too small for a video frame, a game level, or much else in a modern game. So everything has to be done on an assembly line basis, where data is pumped into a Cell processor, processed, and pumped out. Great for audio, terrible for everything else. In comparison, the main processor had access to 256MB of RAM.
If they'd had, say, 16MB per processor, it might have worked out. One CPU for collision detection and physics, one for NPC management and AI, etc. But giving each SPE processor only 0.1% of the total memory space was too constraining.
"There are only two hard things in Computer Science: cache invalidation and naming things" ― Phil Karlton; not sure to what extent this quote is compatible with the second one by Rob. Or does it mean by implication that simply "Computer Science is bugs waiting to happen"?...
The parent actually appears to have the quote both correct in content and attribution. Supposedly, someone else added the "off by one"[1][2][3]. That seems to be the extent of the Internet's knowledge on the quote, though the Skeptics link notes that there's nothing direct to the supposed originator.
This is one of those quotes where I feel there's more than one right answer. I like the addition of "off-by-one", and to make it a nice round three things, I usually use this version:
There are three hard things in computer science:
1. Naming things
2. 3. Concurrency
Cache Invalidation
4. Off-by-one errors
One example is when doing a dma operation to memory. That often bypasses the cache. If the new data is to 'stick' the cache needs to be convinced it doesn't know what's in that buffer any more. This is an ARM thing; Intel architectures integrate DMA with the cache.
That would create a race condition addressed at the bottom of the article: the process can get switched onto another CPU between the invocation of get_current_cpu_cache_line_size() and the invalidation.
An astute reader might realize that computing the cache line on every
invocation is not enough for user space code: It can happen that a process
gets scheduled on a different CPU while executing the __clear_cache
function with a certain cache line size, where it might not be valid
anymore.
No, you just sometimes issue twice as many flush requests as necessary. You can't flush or invalidate half a cache line, since the data is stored in units of cache lines.
In theory I think you could just invalidate the addresses byte for byte, ignoring the cache line size, but I assume the performance hit would be noticeable.
Performance. get_current_cpu_cache_line_size would need to run some code to determine the cache line size, and that code takes longer to run than using a cached value.
Along similar lines, if you have an optimized routine using specific CPU instructions, you don't want to call CPUID (or equivalent) on every call to find out if you have those instructions; you want to call it once and cache the answer. If it can return different answers on different CPUs in the same system, you need to use a different mechanism instead, such as asking the OS for the least common denominator features of available CPUs, or notifying the OS that you need to run on one of the more capable CPUs that has the feature you need.
Stupid question but does that work in a virtualised environment where your program can be live-migrated to another physical machine with a different CPU?
As others said, one typically pretends to run a fixed CPU on all CPUs.
Also, for this specific case, I doubt migration will keep around the contents of the cache lines and their 'dirty' bits (corollary: it will be possible to reliably detect a move, if one is willing to continuously run code that detects cache-line timing differences)
Nope. There's not really any alternative other than "Don't do that", or limit migration to machines that have a superset of the instructions on the original machine.
Usually you configure the VM to only report CPUID values corresponding to lowest-common-denominator features on everything you might want to migrate to. Then as long as the guest code plays nicely and looks at the CPUID feature flags to see what it can use, it'll migrate happily. QEMU has support for this, for instance.
it reads some coproc regs, which are instructions that cannot be reordered. they slow down everything on an OOO core. after that just some bitmasking (not slow)
For 64-bit ARMv8 (ie AArch64) system registers are in general reorderable; software must provide explicit synchronization (typically via barrier instructions) where it does not want the reordering, except for a few registers which have implicit synchronization. Since CTR_EL0 is entirely constant there's no inherent reason why it shouldn't be reorderable pretty freely, though it's an implementation detail how fast or otherwise it is in practice. (Benchmark if it matters to you!)
(This is all documented in the v8 ARM ARM section "Synchronization requirements for AArch64 System Registers".)
This is an excellent bug journey, and I'm even more impressed that the resulting discovery has already been used to improve Dolphin. A testament to the quality of both projects.
I wonder why no one tried to validate Asymmetric MultiProcessing by first validating all cases with either little or big first. And then bisect further down when both are enabled.
Overwriting executable machine code in memory is a rare case (only JITs need it, only for emitting code). Some CPUs do wire the memory write operation to the instruction cache to make sure the instruction cache knows some code changed, but ARM chose a simpler design where the instruction cache assumes the code never changes in memory and if this assumption is wrong, the programmer has to use a special instruction to tell the instruction cache to throw something out of cache. The cache clearing instruction throws out either 64 bytes aligned to multiple of 64 or 128 bytes aligned to multiple of 128, depending on the core.
There is a Linux kernel patchset currently going through review which provides a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the kernel so they can be emulated with the safe correct value: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg... and it seems to me that that's really the right way to deal with this.