Sure big.LITTLE chips have been around for a while. But my understanding is the instruction set is identical between the cores, they just operate at different speeds. That in itself must be a fun balancing act for the scheduler.
To clarify my concerns (I haven't dug in to the specific instruction set differences):
Assume the main core supports AVX2 and the smaller cores don't. Which core do you execute the code on? Which one will get you the best performance per watt? How do you account for that in the OS scheduler? What do you want to optimise for?
If your code is compiled for AVX2, it'll fail on the small cores unless it does continuous runtime checking (which is expensive, but given processes can migrate between cores, presumably necessary).
It's not so much the speeds that differ, but the micro-architecture. The big cores are typically out-of-order cores with a large amount of cache, to get as many instructions as possible per cycle. The SMALL cores are in-order cores with a small amount of cache. These have a lower IPC, but they also use much less energy per instruction compared to the big cores.
You do the same with fp. First floating point instruction traps and the process is then flagged as needing fp. This is then used in the context switch to also include fp instructions.
The advantage is you avoid storing fp registers unless you are going to use them.
That flag could easily determine what you can run where.
As mentioned in your link, that's ultimately against ARM's requirements for big.LITTLE and was due to buggy Samsung patches. A fixed kernel only exposes the subset of features available on all CPUs.
How does the scheduler know which process will use CPU instructions that the small cores don't support? Will it just try it out and then move the process to the higher support core if it detects an error?
>How does the scheduler know which process will use CPU instructions that the small cores don't support?
I don't see that as required? If you catch the illegal instruction signal that CPUs throw you can just run it on the other CPU since the instruction counter would not have incremented. There's a delay in the catch and retry on the other CPU but i don't see the big deal here?
I once ran into the same sort of issue with this very chip, in that the little A55 cores supported half-float compute, while the big Mongoose cores didn't. Seriously a pain.
To clarify my concerns (I haven't dug in to the specific instruction set differences):
Assume the main core supports AVX2 and the smaller cores don't. Which core do you execute the code on? Which one will get you the best performance per watt? How do you account for that in the OS scheduler? What do you want to optimise for?
If your code is compiled for AVX2, it'll fail on the small cores unless it does continuous runtime checking (which is expensive, but given processes can migrate between cores, presumably necessary).