A tale of an impossible bug: big.LITTLE and caching

pm215 · on Sept 12, 2016

Properly configured big.LITTLE clusters should be set up so that all CPUs report the same cache line size (which might be smaller than the true cache line size for some of the CPUs), to avoid exactly this kind of problem. The libgcc code assumes the hardware is correctly put together.

There is a Linux kernel patchset currently going through review which provides a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the kernel so they can be emulated with the safe correct value: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg... and it seems to me that that's really the right way to deal with this.

pm215 · on Sept 12, 2016

Also, if I'm reading the proposed fix in the mono pull request correctly, it doesn't deal with the problem entirely because there's a race condition where the code might start execution on the core with the larger cache line size, and then get context-switched to the core with the smaller cache line size midway through executing its cache-maintenance loop. The chances of things going wrong are much smaller, but they're still there...

(Edit: rereading the blog post, they say they need to figure out the global minimum, but I can't see how their code actually does that, since there's nothing that guarantees that the icache flush code gets run on every cpu before it's needed in anger.)

wyldfire · on Sept 12, 2016

It should converge to the right value eventually but does seem like there's definitely a chance for it to be wrong one or more times before running on the little cores.

hrydgard · on Sept 12, 2016

It's fine if core migration happens during the invalidation loop - the core migration itself surely must wipe the non-shared cache levels thoroughly, otherwise nothing would work.

EDIT: Actually if the big and little cores are used together, and not exclusively, then this might still be an issue, yeah.

pm215 · on Sept 12, 2016

No, in general Linux migrating processes between cores won't nuke the caches. The hardware's cache coherency protocols between CPUs in the cluster ensures that they are all in sync sufficiently that it's not needed.

hrydgard · on Sept 12, 2016

I understood that the configurations currently in use usually only power up either the big or little cores at the same time, and that kind of migration has to wipe the caches, right? But that might be inaccurate, and you are of course right in the general case.

pm215 · on Sept 12, 2016

The state of the art in Linux scheduler handling of big.LITTLE hardware has moved through several different models, getting steadily better at getting best performance from the hardware (wikipedia has a good brief rundown: https://en.wikipedia.org/wiki/ARM_big.LITTLE). You're thinking about the in-kernel-scheduler approach, but global task scheduling (where you just tell the scheduler about all the cores and let it move processes around to suit) has been the recommended approach for a few years now I think.

rodrigokumpera · on Sept 12, 2016

Yes and no.

Core migration don't need to reach a global synchronization point, just enough so that the 2 cores in question agree with each other. This can be done without requiring global visibility of all operations of the source core.

Senji · on Sept 13, 2016

Spin up N threads with single core affinities where N = total cores.

If anything, the OS should have an API to tell you this info in advance.

pm215 · on Sept 13, 2016

This won't work in the presence of CPU hotplug (which Android uses for power management), because some of the CPUs might not be online when you do it.

The API for "tell me this info" is "read the CTR_EL0 register"; it's a hardware bug that it doesn't do the right thing on this particular chip.

rayiner · on Sept 12, 2016

That probably preserves correctness, but also invalidates any attempt to rely on the cache line size to lay out data structures to avoid cache line ping-ponging. Which is a major reason a program would care about the cache line size to begin with.

I think the really correct answer might be to abandon the idea that the system has a single cache line size...

mayank10j · on Sept 12, 2016

what's the incentive for cpu manufacturer to make effort of building extra cache memory in hardware for bigger cpu in ARM64, if there is no sane way to use it ?

pm215 · on Sept 12, 2016

This setting is the cache line size, not the total cache size. You can happily give the bigger cpu more total cache.

colejohnson66 · on Sept 13, 2016

The difference between cache line size and cache size is like paper. You can make it wider (bigger cache line size), taller (bigger cache size), or both.

The problem is like printing. If you put in an A4 or letter (ANSI A) sized sheet and tell your printer it's A3 or tabloid (ANSI B), you're gonna have problems.

userbinator · on Sept 13, 2016

Seeing bugs like this reminds me of how much nicer things are on x86 where JITs do not need to flush caches. You can actually modify the instruction immediately ahead of the currently executing one, and the CPU will naturally "do the right thing"[1] --- it does slow down execution, as the CPU is essentially automatically detecting and flushing its cache/pipeline, but used sparingly can be a great optimisation. The write can even come from another core ("cross-modifying code") and everything will still work. Someone I know used this to great effect in squeezing out the last bits of performance from an application by eliminating checks on a few flag variables and the associated branching in a tight loop --- it simply "poked" instruction bytes from another core into the loop when it was time for that core to do something else.

[1] With the exception of pre-Pentium CPUs, where modifying at various forward offsets from the locus of execution could give insight into how big the prefetch queue is. With the Pentium it was fully detected and the later multithreaded/multicored react similarly to cross-modifying code as described above, which leads me to believe that Intel is very much supportive of these things as otherwise they could've just told programmers to do as ARM does.

Maybe what ARM needs, short of doing it the Intel way, is a "flush region" instruction which takes both the address and size, so it can automatically flush the appropriate cache lines based on the current hardware's cacheline size.

pm215 · on Sept 13, 2016

x86 is really the oddity here -- it has its no-explicit-cache-maintenance design because of wanting to maintain backwards-compatibility with self-modifying code that was written for x86 cores that had no caches at all. Almost all other architectures have explicit cache maintenance because it's more efficient (and requires less hardware), at the minor cost of requiring the very few bits of software which do odd things like JITting to explicitly tell the CPU what they're doing.

An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere. The obvious observation from a RISC-architecture point of view is that you can get the equivalent effect without the pain of making a long-running interruptible instruction, by having an "invalidate one cache line" instruction plus an explicit loop in the code, and that's what most architectures do.

userbinator · on Sept 13, 2016

Actually, Intel explicitly broke backwards-compatibility starting with the Pentium, by adding the hardware to make SMC work without additional effort. The 486 and below needed an explicit branch to flush the prefetch queue, and this effect has been exploited for various anti-debugging tricks and even this amazing 8088-only optimisation:

https://news.ycombinator.com/item?id=9340231

An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere.

x86 has the REP prefix for this purpose; used with certain instructions, it decrements a register and if it's nonzero, executes the instruction. The earlier implementations simply didn't update the instruction pointer in this case so the CPU would repeatedly fetch and execute the same instruction, and it's interruptable between each step. The register counts down how many iterations remain. Otherwise, the instruction pointer moves to the next instruction. Modern x86 handles this by generating uops instead in the decoder, but the basic functionality is the same.

anarazel · on Sept 12, 2016

Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.

gcp · on Sept 13, 2016

ARM's own designs (A53, A57, A72, A73) all have 64-byte cache line sizes and avoid the problem entirely.

The one at fault appears to be Samsung, who designed M1 Mongoose with 128 byte lines and packed it together with A53 cores in their SoC.

cesarb · on Sept 13, 2016

Perhaps a better (and simpler) workaround, then, could be to clamp the reported cache line size to 64 bytes. So even if the Samsung core reports 128-byte cache lines, the code would simply invalidate each line twice, and if it is migrated in the middle of the invalidation loop, it would still work correctly.

gcp · on Sept 13, 2016

This is exactly the fix used in the kernel, yes. The trickiness and complexity comes from the cache line size probe having to be intercepted.

pawadu · on Sept 13, 2016

You could also say that the ARM implementation is not optimal since it uses the same cache size for vastly different designs (different pipeline length and memory access characteristics).

Samsung tried to improve performance, I don't think they are at fault at all.

gcp · on Sept 13, 2016

Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.

Going to 128-byte cache lines was fine, but they should have made the lower-power part of the chip match (which in this case they likely couldn't because they licensed that design), or they should have made sure 128-byte lines were reported in all circumstances (which requires a hack similar to the workaround done in the kernel because again they can't make the A53 report 64 bytes).

sangnoir · on Sept 13, 2016

> Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.

This is similar to blaming OS/BIOS manufaturers for Y2K breaking "correctly written code" at the turn of the millenium.The code was incorrect in this instance because it assumed the cache size would be the same for all cores, Samsung simply manufactured a SOC that breaks that faulty assumption.

oshepherd · on Sept 13, 2016

The architecture states that CTR_EL0 reports the _minimum_ ICache line size across all cores.

In this system, that would be 64 bytes.

pawadu · on Sept 13, 2016

Can you point me to the manual that does say so? This is all I could find on the subject

> [3:0] IminLine Log 2 of the number of words in the smallest cache line of all the Instruction Caches that the processor controls.

> This values is:

> 0x4

> Smallest Instruction Cache line size is 16 words.

gcp · on Sept 13, 2016

The assumption isn't faulty, it's guaranteed to be right. And it has to be, or there's no way to write functional code for ARM.

abysmallyideal · on Sept 16, 2016

That is a bit far-fetched, as it seems to mostly affect how mono and gcc(?) JIT is designed.

pawadu · on Sept 13, 2016

> Samsung's design breaks correctly written user-land code.

Are you sure? From the article, it sounds to me that the real cause of the error is an incorrect assumption made by the GCC people.

gcp · on Sept 13, 2016

Yes, I am sure.

If we forget about the ARM spec explicitly enforcing this anyway, here's some food for thought: how exactly do you suppose the GCC people could write correct code, if they aren't allowed to make that assumption?

pawadu · on Sept 13, 2016

Okay, I did not know that. But could you point me to the part of the spec that says this? I wasn't able to find this in their public specs.

gcp · on Sept 13, 2016

ARM ARM, section B2.2.6

The CTR holds minimum line length values for: - the instruction caches

...this value is the most efficient address stride to use to apply to a sequence of address-based maintenance operations to a range of addresses...

The documentation for the CTR_EL0 reg talks about "caches under this processors control" which you could argue don't include other cores, but if you allow migration between cores, then line size changes underneath running software break the above "most efficient address stride" assumption. So you can't do that.

It boils down to this: If you can't assume that cache line size doesn't change underneath you, then you can't invalidate line by line at all, and would have to go word by word. That's terrible for performance (and a huge waste), which is why the spec says to use the above value for those operations.

pawadu · on Sept 13, 2016

I appreciate you digging up the relevant text from the manual, but I don't think you should accuse Samsung of wrongdoing based on such far fetched assumptions.

That document does not explicitly forbids this. In addition, take a look at their sample code which indeed reads some cache configuration registers during each call. The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)

gcp · on Sept 13, 2016

> far fetched assumptions

I don't think considering how to actually implement the basic operation without a terrible performance penalty is a "far fetched assumption".

> That document does not explicitly forbids this.

Yet it makes it clear software can be written to assume it doesn't happen, which is the same thing.

> In addition, take a look at their sample code which indeed reads some cache configuration registers during each call.

This doesn't prevent the problem at all, it just reduces the window for things to go wrong.

> The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)

I trust the ARM engineers. As I already said, their code is right, Samsung got it wrong. Note that the libgcc code that breaks was ALSO written by ARM.

digi_owl · on Sept 12, 2016

Best i can tell, the whole thing is designed so that you can go from a a single "little" all the way to two sets of four. Thus each core is designed to be used both independently and in larger configurations.

ChuckMcM · on Sept 12, 2016

wow, just wow. That is a really awesome bug (and like the authors I have issues with trying to sleep when that sort of puzzle is sitting there :-)

Still a bit hazy on why they manually flush the cache for a given block of memory (presumably for protecting disclosure?) but I'm also a bit curious how it works if you get the sequence big fetches a cache line, switches to little which fetches a line (half as long and changing half the bytes in the cache) and then you switch back to big and its thinking it has a full cache line? Presumably there is some mechanism that invalidates cache lines?

pm215 · on Sept 12, 2016

Handling of the case where big and little both want the same thing in their cache should be dealt with by the usual cache-coherency traffic between the CPUs that ensures they don't disagree about what's in their L1 caches (very handwaved because I don't know the details).

The reason for the manual cache operations is because they're generating JITted code -- on ARM to ensure that what you execute is the same thing you just wrote you have to (1) clean the data cache, so your changes get out to main memory[.] and then (2) invalidate the icache, so that execution will fetch the fresh data from memory rather than using stale info. This clean-and-invalidate operation is usually informally called a flush, though it isn't really one in ARM terminology.

[.] not actually main memory, usually: only has to go out to the "point of unification" where the iside and dside come together, which is probably the L2 cache.

filereaper · on Sept 12, 2016

+1

Not just that, the ISA usually requires that a cache invalidation instruction be issued regardless of whether the chip's coherency will automatically detect and invalidate it.

In cases such as this post, it is perfectly valid for the silicon engineers to say that its the software's fault for not adhering to the ISA.

ChuckMcM · on Sept 12, 2016

Thanks for the update on the manual cache management!

I'll admit that I find the question of dissimilar cache line sizes into the same cache intriguing from an architecture point of view. It has me doodling all sorts of questions into my notebook.

brandmeyer · on Sept 12, 2016

I don't see why they have to do this in userspace at all. If they did:

* allocate read/write buffer

* JIT instructions into it

* change mapping to read/execute

* run the JITted code

Then the kernel manages flushing the data caches on the mapping change, and Mono gets to wrap a Somebody Else's Problem field around it. It sounds like they are instead:

* allocate read/write/execute buffer

* JIT instructions into it

* manually flush relevant data caches (with an assumption that the cache line size is constant)

* run the JITted code

rodrigokumpera · on Sept 12, 2016

That approach is harder to use in practice that in sounds. It's not like people have not tried it.

The OS only let you alloc in large granules, like 4k or 16k, and the vast majority of the methods are significantly smaller than that, meaning a JIT must colocate multiple methods in the same allocation block or waste a significant amount of memory.

We could get around that by remapping memory between read/write to read/execute and have the OS solve the problem for us. Except for a couple of small details, modifying a memory mapping is very expensive and we're, well, in the performance business, and that mono is multi-threaded so one thread might be executing code from the exact page we just made non-executable.

This approach, IIRC, was tried by Firefox as it has some security advantages, but discarded due to the measurable performance impact - and they don't have the second problem as JS is single threaded.

Full Disclosure: I'm part of the Mono team.

caf · on Sept 13, 2016

How is this safe in the multithreaded case anyway? If a process has just written a new JITted method and is flushing the i$ on the CPU it's executing on, but then gets scheduled away part-way through the flush, if you were very unlucky then couldn't another thread then get scheduled on that CPU and try to execute the just-written method, which failed to be fully flushed from that CPU's cache?

rodrigokumpera · on Sept 13, 2016

Multi-threaded safety is simply due to JIT controlling the visibility of the newly compiled code. First flush, then make it visible for execution, can't go wrong with that and scheduling won't matter.

Things get a lot more complicated when it comes to code patching, but the principle is similar.

caf · on Sept 13, 2016

I don't think that helps - the point is that the flush might not be effective if the flushing thread gets scheduled away from the core which has the stale I$ before it manages to fully issue the flush.

Or is the flush guaranteed to flush all cores caches? That would be a fairly unusual design.

oshepherd · on Sept 13, 2016

IC IVAU instructions are broadcast to all cores in the same 'inner shareable domain' (all cores running the same OS instance are in the same inner shareable domain)

caf · on Sept 13, 2016

That does seem like a good solution to let you do this kind of invalidation in userspace. Thanks.

bzbarsky · on Sept 13, 2016

Firefox is shipping W^X for jitcode, as far as I know. Or at least https://bugzilla.mozilla.org/show_bug.cgi?id=1215479 is marked as fixed and I don't see any obvious bugs blocking it that are fallout from that change.

And the "Firefox (Ion)" and "Firefox (non writable jitcode)" lines on https://arewefastyet.com/ seem to coincide...

But yes, actually making this work in practice is not at all simple.

brandmeyer · on Sept 13, 2016

I see the argument about the page size potentially being large relative to the size of a jitted function. But,

> modifying a memory mapping is very expensive

It used to be true that operations on memory mappings were appallingly expensive. However, the advent of virtualization has driven a significant change in performance. IIRC, ARMv8 has a TLB invalidation operation that is per-entry, addressed by the virtual address being invalidated. You don't need to flush the entire TLB cache.

filereaper · on Sept 12, 2016

JITs need to provide read/write/execute access to JIT'ed bodies, particularly for things like Polymorphic Inline Cache's (PICs : https://en.wikipedia.org/wiki/Inline_caching#Polymorphic_inl...)

In which case a manual flush of the icache is needed, hence this problem.

hexa00 · on Sept 13, 2016

I had this problem too with GDB on the Odroid UX4 big.LITTLE SoC.

Since GDB is patching the instrution with ptrace to insert a breakpoint for example.

See my blog post about it: https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-a...

Or the post on the GDB mailling list: https://www.sourceware.org/ml/gdb/2015-11/msg00030.html

Too bad however that the kernel patchset mentionned in a previous post only covers arm64..

So it's still a problem from arm32.

wolfgke · on Sept 12, 2016

> Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.

I rather see the problem in the fact that there seems to be no possibility to say to the Linux scheduler: Only schedule this process/thread between cores that have the same cache line size. Or add an attribute when some thread is created on the cache line size of the cores it is allowed to run. Or an attribute when some thread is created for "allow arbitrary cache line size but don't let it run on cores with a different size". This way it would suffice to check for the cache size on program or thread start.

tveita · on Sept 12, 2016

That's too specific a feature to expose to developers - most people wouldn't even know about the feature, and the ones that did might needlessly enable it just to be conservative.

The intention of the big.LITTLE architecture is to let processes be migrated seamlessly between the small and big core and let the unused core be turned off to save power. The kernel and the hardware should work together to make the core switching transparent and expose a safe way to invalidate the cache independent of the current processor.

wolfgke · on Sept 12, 2016

> The intention of the big.LITTLE architecture is to let processes be migrated seamlessly between the small and big core and let the unused core be turned off to save power.

On the other hand flushing specific cache lines is a rather special feature. If the code uses such obscure low-level features (much more than the "typical" application) such as the invariant that the size of a cache-line stays constant over the execution (which most applications really don't care about) one can at least expect from the developer to pass this information to the scheduler so that the scheduler can take that this invariant will indeed be satisfied.

comex · on Sept 12, 2016

Anything that uses a JIT necessarily uses this "obscure low-level feature", including any program implemented in one of many programming languages. Forcing such programs onto the heavy processor makes no sense; there just needs to be a way to properly clear the cache.

wolfgke · on Sept 12, 2016

On https://news.ycombinator.com/item?id=12483698 I wrote a better idea how this might be implemented without causing this problem. Nevertheless even if we use my first, worse interface this should not be a problem: Spawn a thread that does the lowlevel stuff, sync it and after that run the JITted code.

lambda · on Sept 12, 2016

But then you would never be able to take advantage of the power/performance tradeoff that big.LITTLE gives you; of being able to migrate you to a big core when you need it, but save power by migrating you to the little core when you're not doing anything CPU intensive, and shutting off the power-hungry big core.

wolfgke · on Sept 12, 2016

> But then you would never be able to take advantage of the power/performance tradeoff that big.LITTLE gives you

Rather: Only applications that don't depend on the invariant that the size of a cache line will stay constant over the execution time will take advantage of this. I can imagine ways how one could increase the advantage, but I don't want to go too much into technical details here.

lambda · on Sept 12, 2016

Why would programs want to "depend on the invariant that the size of a cache line will stay constant"? That's a minor detail that most applications don't care about.

In this case, it probably mattered because when JITing code, you need to explicitly flush instruction cache after changing code. But that's not anything the application cares about; that's something the runtime cares about, and the runtime is going to want to support taking advantage of the speed/power tradeoff without making application developers have to go to the trouble of thinking about it.

Also, from a user perspective, you don't want applications to be able to demand only the big, powerful cores. That will eat through your battery much faster, with likely very little benefit.

wolfgke · on Sept 12, 2016

> Also, from a user perspective, you don't want applications to be able to demand only the big, powerful cores.

This should rather be a setting in the rights system that a user can refuse the right to an application to be able to demand big, powerful cores. I can nevertheless imagine quite well scenarios where enabling a (specific) application to do so can be quite useful.

monocasa · on Sept 12, 2016

In that case, you could read /proc/cpuinfo and set your affinity mask.

wolfgke · on Sept 12, 2016

Suggest this to the authors of the original article. :-)

wolfgke · on Sept 12, 2016

Another idea how one could solve this problem by a clever kernel interface. This idea is probably a better than the approach in my post above, since this enables to migrate threads from big to little cores and vice versa except for critical regions.

Add an interface to the kernel/scheduler

void lock_cacheline_size(...)

void unlock_cacheline_size(...)

Calling lock_cacheline_size tells that from now on the thread must stay on a core with the same cache line size. Consider it as locking a mutex on the cache line size.

After this you can run critical code that depends on the fact that the cache line size stays constant for the whole time.

After you are finished with it, you call unlock_cacheline_size (as a kind of unlocking the mutex on the cache line size) - from now on the scheduler is free again to migrate the code again to cores with different cache line sizes.

cesarb · on Sept 12, 2016

A simpler and more robust approach would be to add a system call to flush a memory range. The kernel can then do all the locking, without the risk that migration to the other core gets indefinitely blocked by user space code.

Senji · on Sept 13, 2016

>An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.

Doesn't ARM have an "Enter Critical Region" instruction?

sjmulder · on Sept 12, 2016

Huge props to the team for finding this out. What a nasty issue.

One question about the intro, which states that this is the first mass produced AMP architecture, but isn't the PlayStation 3's Cell CPU one?

masklinn · on Sept 12, 2016

Cell behaves more like a CPU + GPGPU system, big.LITTLE can schedule an instruction stream on any core of the cluster (depending on the configuration setup), on Cell you'd primary use the general-purpose PPE from which you'd start (and chain) vector-based threads on SPEs.

PPE and SPE don't even share an ISA, SPE have a custom-built SIMD-oriented ISA.

wmf · on Sept 12, 2016

The difference is that big.LITTLE ARMs use the same instruction set on all cores. In general I wouldn't expect a lot of terminological rigor from blog posts.

Animats · on Sept 12, 2016

"first mass produced AMP architecture"

Nope. Remember the Cell? The processor in the Playstation 3? One main CPU with 8 little CPUs and no shared memory, just channels.

The Playstation 4 isn't a AMP machine because programming the Cell was so hard.

maximilianburke · on Sept 12, 2016

> The Playstation 4 isn't a AMP machine because programming the Cell was so hard.

The PS4 could be an AMP design if you consider the (closely coupled) GPU a processor.

It doesn't require the same gymnastics as the PS3 though because both the main processor and GPU are more capable. The SPUs were required to perform computation that the anemic PPU could not do as well as fill in where the pre-unified shader model GPU was unable to keep the pace.

Memory was shared on the PS3 but, from the SPUs, required explicit put and fetch operations.

dfox · on Sept 13, 2016

Cell is essentially distributed memory cluster on single chip, because each SPU has it's own address space and cannot directly access main memory. I'm not sure about what the exact definition of AMP is, but it does not exactly match my feeling of what AMP should be. In this regard Wii seems more like AMP systems with two completely different CPUs (PPC and ARM) sharing what essentially amounts to be same address space (and in WiiU there are 3 PPC cores where one of them is slightly different than other two and cache coherency between them can only be described as broken).

There is no question of hardness of programming for Cell, but I think it's mostly about middleware support (probably because the platform is so different from PC and xbox360).

Animats · on Sept 13, 2016

The real problem with the Cell was that each Cell SPE processor only has 256K of local memory. It has bulk DMA access to main memory, but that's more like I/O. 256K is too small for a video frame, a game level, or much else in a modern game. So everything has to be done on an assembly line basis, where data is pumped into a Cell processor, processed, and pumped out. Great for audio, terrible for everything else. In comparison, the main processor had access to 256MB of RAM.

If they'd had, say, 16MB per processor, it might have worked out. One CPU for collision detection and physics, one for NPC management and AI, etc. But giving each SPE processor only 0.1% of the total memory space was too constraining.

dsp1234 · on Sept 12, 2016

It appears the that the caching code was added in this patch:

https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html

Prior to that, the call:

  asm volatile ("mrs\t%0, ctr_el0":"=r" (cache_info));

was always made.

willvarfar · on Sept 12, 2016

But the task can be rescheduled on a little core part-way through the execution...

The big cores should report the smaller cache line size always.

dsp1234 · on Sept 12, 2016

I was just pointing out where the caching of the size was added. I'm not making a comment on where the fix should be.

funny_falcon · on Sept 13, 2016

https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html

pawadu · on Sept 13, 2016

Well, that is the patch that causes this problem...

At the time this patch was submitted, ARM was running a program that awarded engineers who could improve performance of aarch64.

SixSigma · on Sept 12, 2016

There's no such thing as a simple cache bug. - Rob Pike

Caches are bugs waiting to happen. Rob Pike ‏@rob_pike 21 Mar 2014

akavel · on Sept 12, 2016

"There are only two hard things in Computer Science: cache invalidation and naming things" ― Phil Karlton; not sure to what extent this quote is compatible with the second one by Rob. Or does it mean by implication that simply "Computer Science is bugs waiting to happen"?...

mikeash · on Sept 12, 2016

It's a great quote, but it's wrong. There are actually two hard things in CS: cache invalidation, naming, and off-by-one errors.

deathanatos · on Sept 13, 2016

The parent actually appears to have the quote both correct in content and attribution. Supposedly, someone else added the "off by one"[1][2][3]. That seems to be the extent of the Internet's knowledge on the quote, though the Skeptics link notes that there's nothing direct to the supposed originator.

This is one of those quotes where I feel there's more than one right answer. I like the addition of "off-by-one", and to make it a nice round three things, I usually use this version:

  There are three hard things in computer science:

  1. Naming things
  2. 3. Concurrency
  Cache Invalidation
  4. Off-by-one errors

[1]: https://twitter.com/timbray/status/506146595650699264

[2]: https://skeptics.stackexchange.com/questions/19836/has-phil-...

[3]: http://martinfowler.com/bliki/TwoHardThings.html

mikeash · on Sept 13, 2016

The other one is the original, I just like the "off by one" version better.

I like the addition of "concurrency" too, but I'm not quite sure how to make it flow....

bakul · on Sept 12, 2016

Then there is this quote: "All programming is an exercise in caching." -Terje Mathisen

AceJohnny2 · on Sept 12, 2016

As an embedded developer also working on big.LITTLE arm64...

I would do nasty, painful things to the designer of such a system.

randyrand · on Sept 12, 2016

When would a programmer explicitly need to invalidate the CPU cache? Does this not happen automatically on a context switch?

JoeAltmaier · on Sept 12, 2016

One example is when doing a dma operation to memory. That often bypasses the cache. If the new data is to 'stick' the cache needs to be convinced it doesn't know what's in that buffer any more. This is an ARM thing; Intel architectures integrate DMA with the cache.

xxie24 · on Sept 12, 2016

From the pseudo code, what is disadvantage that making get_current_cpu_cache_line_size() always get called?

_ihaque · on Sept 12, 2016

That would create a race condition addressed at the bottom of the article: the process can get switched onto another CPU between the invocation of get_current_cpu_cache_line_size() and the invalidation.

  An astute reader might realize that computing the cache line on every
  invocation is not enough for user space code: It can happen that a process
  gets scheduled on a different CPU while executing the __clear_cache
  function with a certain cache line size, where it might not be valid
  anymore.

K0nserv · on Sept 12, 2016

The follow up doesn't make sense to me

   Therefore, we have to try to figure out a global minimum of the cache line sizes across all CPUs.

Wouldn't this mean they'd always just end up clearing half the cache line for larger core anyway?

tveita · on Sept 12, 2016

No, you just sometimes issue twice as many flush requests as necessary. You can't flush or invalidate half a cache line, since the data is stored in units of cache lines.

In theory I think you could just invalidate the addresses byte for byte, ignoring the cache line size, but I assume the performance hit would be noticeable.

JoshTriplett · on Sept 12, 2016

Performance. get_current_cpu_cache_line_size would need to run some code to determine the cache line size, and that code takes longer to run than using a cached value.

Along similar lines, if you have an optimized routine using specific CPU instructions, you don't want to call CPUID (or equivalent) on every call to find out if you have those instructions; you want to call it once and cache the answer. If it can return different answers on different CPUs in the same system, you need to use a different mechanism instead, such as asking the OS for the least common denominator features of available CPUs, or notifying the OS that you need to run on one of the more capable CPUs that has the feature you need.

cm2187 · on Sept 12, 2016

Stupid question but does that work in a virtualised environment where your program can be live-migrated to another physical machine with a different CPU?

tcas · on Sept 12, 2016

I believe virtual machines will typically alter the CPUID a bit: https://tech.mendix.com/linux/2016/08/18/xen-cpuid-masking/

You'll choose a lowest common denominator of features sets.

Someone · on Sept 12, 2016

As others said, one typically pretends to run a fixed CPU on all CPUs.

Also, for this specific case, I doubt migration will keep around the contents of the cache lines and their 'dirty' bits (corollary: it will be possible to reliably detect a move, if one is willing to continuously run code that detects cache-line timing differences)

mjg59 · on Sept 12, 2016

Nope. There's not really any alternative other than "Don't do that", or limit migration to machines that have a superset of the instructions on the original machine.

pm215 · on Sept 12, 2016

Usually you configure the VM to only report CPUID values corresponding to lowest-common-denominator features on everything you might want to migrate to. Then as long as the guest code plays nicely and looks at the CPUID feature flags to see what it can use, it'll migrate happily. QEMU has support for this, for instance.

cm2187 · on Sept 12, 2016

I presume that AWS or Azure wouldn't do that?

Tuna-Fish · on Sept 12, 2016

It's slow, and also it doesn't fix the bug, as the code could get migrated immediately after calling it.

The only real fix is to do what they are doing -- always use the smallest line size of the system, regardless of which core you are running on.

rodrigokumpera · on Sept 12, 2016

That's a question for the libgcc team. I seriously doubt it to be slow enough to matter.

dmitrygr · on Sept 12, 2016

it reads some coproc regs, which are instructions that cannot be reordered. they slow down everything on an OOO core. after that just some bitmasking (not slow)

pm215 · on Sept 12, 2016

For 64-bit ARMv8 (ie AArch64) system registers are in general reorderable; software must provide explicit synchronization (typically via barrier instructions) where it does not want the reordering, except for a few registers which have implicit synchronization. Since CTR_EL0 is entirely constant there's no inherent reason why it shouldn't be reorderable pretty freely, though it's an implementation detail how fast or otherwise it is in practice. (Benchmark if it matters to you!)

(This is all documented in the v8 ARM ARM section "Synchronization requirements for AArch64 System Registers".)

Mizza · on Sept 12, 2016

This is an excellent bug journey, and I'm even more impressed that the resulting discovery has already been used to improve Dolphin. A testament to the quality of both projects.

flamedoge · on Sept 12, 2016

I wonder why no one tried to validate Asymmetric MultiProcessing by first validating all cases with either little or big first. And then bisect further down when both are enabled.

PDoyle · on Sept 13, 2016

Because that's the sort of thing you think of only when you already know the answer.

GrumpyNl · on Sept 12, 2016

I don't understand that level of coding, what i do understand is the great way of debugging. Its all about deduction mr Watson.

dsp1234 · on Sept 12, 2016

Watson was actually Dr. Watson. Which is/was also the name of a debugger in Windows[0].

[0] - https://en.wikipedia.org/wiki/Dr._Watson_(debugger)

GrumpyNl · on Sept 12, 2016

You are right, although i was referring to a Sherlock Holmes quote, but that is also Dr.

gens · on Sept 13, 2016

Funny thing is that Sherlock Holmes usually used abductive reasoning, instead of strictly deductive reasoning.

digi_owl · on Sept 12, 2016

Ah, such a familiar sight.

betolink · on Sept 13, 2016

Can someone explain why cache flush is used for on ARM or in general low level programming?

wolf550e · on Sept 13, 2016

Overwriting executable machine code in memory is a rare case (only JITs need it, only for emitting code). Some CPUs do wire the memory write operation to the instruction cache to make sure the instruction cache knows some code changed, but ARM chose a simpler design where the instruction cache assumes the code never changes in memory and if this assumption is wrong, the programmer has to use a special instruction to tell the instruction cache to throw something out of cache. The cache clearing instruction throws out either 64 bytes aligned to multiple of 64 or 128 bytes aligned to multiple of 128, depending on the core.