Different cacheline sizes for the different cores seems like an absurdly bad ide...

gcp · on Sept 13, 2016

ARM's own designs (A53, A57, A72, A73) all have 64-byte cache line sizes and avoid the problem entirely.

The one at fault appears to be Samsung, who designed M1 Mongoose with 128 byte lines and packed it together with A53 cores in their SoC.

cesarb · on Sept 13, 2016

Perhaps a better (and simpler) workaround, then, could be to clamp the reported cache line size to 64 bytes. So even if the Samsung core reports 128-byte cache lines, the code would simply invalidate each line twice, and if it is migrated in the middle of the invalidation loop, it would still work correctly.

gcp · on Sept 13, 2016

This is exactly the fix used in the kernel, yes. The trickiness and complexity comes from the cache line size probe having to be intercepted.

pawadu · on Sept 13, 2016

You could also say that the ARM implementation is not optimal since it uses the same cache size for vastly different designs (different pipeline length and memory access characteristics).

Samsung tried to improve performance, I don't think they are at fault at all.

gcp · on Sept 13, 2016

Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.

Going to 128-byte cache lines was fine, but they should have made the lower-power part of the chip match (which in this case they likely couldn't because they licensed that design), or they should have made sure 128-byte lines were reported in all circumstances (which requires a hack similar to the workaround done in the kernel because again they can't make the A53 report 64 bytes).

sangnoir · on Sept 13, 2016

> Samsung's design breaks correctly written user-land code. Hence, they're 100% at fault.

This is similar to blaming OS/BIOS manufaturers for Y2K breaking "correctly written code" at the turn of the millenium.The code was incorrect in this instance because it assumed the cache size would be the same for all cores, Samsung simply manufactured a SOC that breaks that faulty assumption.

oshepherd · on Sept 13, 2016

The architecture states that CTR_EL0 reports the _minimum_ ICache line size across all cores.

In this system, that would be 64 bytes.

pawadu · on Sept 13, 2016

Can you point me to the manual that does say so? This is all I could find on the subject

> [3:0] IminLine Log 2 of the number of words in the smallest cache line of all the Instruction Caches that the processor controls.

> This values is:

> 0x4

> Smallest Instruction Cache line size is 16 words.

gcp · on Sept 13, 2016

The assumption isn't faulty, it's guaranteed to be right. And it has to be, or there's no way to write functional code for ARM.

abysmallyideal · on Sept 16, 2016

That is a bit far-fetched, as it seems to mostly affect how mono and gcc(?) JIT is designed.

pawadu · on Sept 13, 2016

> Samsung's design breaks correctly written user-land code.

Are you sure? From the article, it sounds to me that the real cause of the error is an incorrect assumption made by the GCC people.

gcp · on Sept 13, 2016

Yes, I am sure.

If we forget about the ARM spec explicitly enforcing this anyway, here's some food for thought: how exactly do you suppose the GCC people could write correct code, if they aren't allowed to make that assumption?

pawadu · on Sept 13, 2016

Okay, I did not know that. But could you point me to the part of the spec that says this? I wasn't able to find this in their public specs.

gcp · on Sept 13, 2016

ARM ARM, section B2.2.6

The CTR holds minimum line length values for: - the instruction caches

...this value is the most efficient address stride to use to apply to a sequence of address-based maintenance operations to a range of addresses...

The documentation for the CTR_EL0 reg talks about "caches under this processors control" which you could argue don't include other cores, but if you allow migration between cores, then line size changes underneath running software break the above "most efficient address stride" assumption. So you can't do that.

It boils down to this: If you can't assume that cache line size doesn't change underneath you, then you can't invalidate line by line at all, and would have to go word by word. That's terrible for performance (and a huge waste), which is why the spec says to use the above value for those operations.

pawadu · on Sept 13, 2016

I appreciate you digging up the relevant text from the manual, but I don't think you should accuse Samsung of wrongdoing based on such far fetched assumptions.

That document does not explicitly forbids this. In addition, take a look at their sample code which indeed reads some cache configuration registers during each call. The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)

gcp · on Sept 13, 2016

> far fetched assumptions

I don't think considering how to actually implement the basic operation without a terrible performance penalty is a "far fetched assumption".

> That document does not explicitly forbids this.

Yet it makes it clear software can be written to assume it doesn't happen, which is the same thing.

> In addition, take a look at their sample code which indeed reads some cache configuration registers during each call.

This doesn't prevent the problem at all, it just reduces the window for things to go wrong.

> The same code is found verbatim in the linux kernel, if you now don't trust the ARM engineers :)

I trust the ARM engineers. As I already said, their code is right, Samsung got it wrong. Note that the libgcc code that breaks was ALSO written by ARM.

digi_owl · on Sept 12, 2016

Best i can tell, the whole thing is designed so that you can go from a a single "little" all the way to two sets of four. Thus each core is designed to be used both independently and in larger configurations.