Handling of the case where big and little both want the same thing in their cache should be dealt with by the usual cache-coherency traffic between the CPUs that ensures they don't disagree about what's in their L1 caches (very handwaved because I don't know the details).
The reason for the manual cache operations is because they're generating JITted code -- on ARM to ensure that what you execute is the same thing you just wrote you have to (1) clean the data cache, so your changes get out to main memory[.] and then (2) invalidate the icache, so that execution will fetch the fresh data from memory rather than using stale info. This clean-and-invalidate operation is usually informally called a flush, though it isn't really one in ARM terminology.
[.] not actually main memory, usually: only has to go out to the "point of unification" where the iside and dside come together, which is probably the L2 cache.
Not just that, the ISA usually requires that a cache invalidation instruction be issued regardless of whether the chip's coherency will automatically detect and invalidate it.
In cases such as this post, it is perfectly valid for the silicon engineers to say that its the software's fault for not adhering to the ISA.
Thanks for the update on the manual cache management!
I'll admit that I find the question of dissimilar cache line sizes into the same cache intriguing from an architecture point of view. It has me doodling all sorts of questions into my notebook.
I don't see why they have to do this in userspace at all. If they did:
* allocate read/write buffer
* JIT instructions into it
* change mapping to read/execute
* run the JITted code
Then the kernel manages flushing the data caches on the mapping change, and Mono gets to wrap a Somebody Else's Problem field around it. It sounds like they are instead:
* allocate read/write/execute buffer
* JIT instructions into it
* manually flush relevant data caches (with an assumption that the cache line size is constant)
That approach is harder to use in practice that in sounds. It's not like people have not tried it.
The OS only let you alloc in large granules, like 4k or 16k, and the vast majority of the methods are significantly smaller than that, meaning a JIT must colocate multiple methods in the same allocation block or waste a significant amount of memory.
We could get around that by remapping memory between read/write to read/execute and have the OS solve the problem for us. Except for a couple of small details, modifying a memory mapping is very expensive and we're, well, in the performance business, and that mono is multi-threaded so one thread might be executing code from the exact page we just made non-executable.
This approach, IIRC, was tried by Firefox as it has some security advantages, but discarded due to the measurable performance impact - and they don't have the second problem as JS is single threaded.
How is this safe in the multithreaded case anyway? If a process has just written a new JITted method and is flushing the i$ on the CPU it's executing on, but then gets scheduled away part-way through the flush, if you were very unlucky then couldn't another thread then get scheduled on that CPU and try to execute the just-written method, which failed to be fully flushed from that CPU's cache?
Multi-threaded safety is simply due to JIT controlling the visibility of the newly compiled code. First flush, then make it visible for execution, can't go wrong with that and scheduling won't matter.
Things get a lot more complicated when it comes to code patching, but the principle is similar.
I don't think that helps - the point is that the flush might not be effective if the flushing thread gets scheduled away from the core which has the stale I$ before it manages to fully issue the flush.
Or is the flush guaranteed to flush all cores caches? That would be a fairly unusual design.
IC IVAU instructions are broadcast to all cores in the same 'inner shareable domain' (all cores running the same OS instance are in the same inner shareable domain)
Firefox is shipping W^X for jitcode, as far as I know. Or at least https://bugzilla.mozilla.org/show_bug.cgi?id=1215479 is marked as fixed and I don't see any obvious bugs blocking it that are fallout from that change.
And the "Firefox (Ion)" and "Firefox (non writable jitcode)" lines on https://arewefastyet.com/ seem to coincide...
But yes, actually making this work in practice is not at all simple.
I see the argument about the page size potentially being large relative to the size of a jitted function. But,
> modifying a memory mapping is very expensive
It used to be true that operations on memory mappings were appallingly expensive. However, the advent of virtualization has driven a significant change in performance. IIRC, ARMv8 has a TLB invalidation operation that is per-entry, addressed by the virtual address being invalidated. You don't need to flush the entire TLB cache.
The reason for the manual cache operations is because they're generating JITted code -- on ARM to ensure that what you execute is the same thing you just wrote you have to (1) clean the data cache, so your changes get out to main memory[.] and then (2) invalidate the icache, so that execution will fetch the fresh data from memory rather than using stale info. This clean-and-invalidate operation is usually informally called a flush, though it isn't really one in ARM terminology.
[.] not actually main memory, usually: only has to go out to the "point of unification" where the iside and dside come together, which is probably the L2 cache.