There is no if statement in the hot loop or in the kernel thread so there is no performance cost there.
The multiplexing thread is separate to the kernel thread so you could say it's 1:M:N thread scheduling. I should have been clearer in my comment. There is 3 types of threads.
The multiplexing thread timeslices the preemption of the lightweight threads and kernel threads every 10 milliseconds. That is it stops all the loops in the lightweight thread and in the lightweight thread and causes the next lightweight thread to execute.
So there is no overhead except for a structure variable retrieval in the loop body
The risk is that a thread fails to reschedule/be preempted/end the loop in this timeslice but if the exact same interleaving happens to every other timeslice after, it shall never preempt. But this is unlikely in practice.
The chance of an incrementing at the end of a loop data race with setting the index to the limit and a preventing preemption a unlikely but possible. I assume the body of the loop has the most execution time, not the loop conditional and increment.
I've not measured the overhead of the atomics. In theory the atomics are on the slow path of the loop - they happen during preemption and at the end of a hot loop iteration
The loop itself is the hot part that we don't want to slow down
The worst we do is slow down the execution of the next hot loop with atomic set.
Atomic reads are cheap. I'm not sure if atomic writes are cheap if uncontended
The data race on the volatile version of the C code is technically UB, so you are at the mercy of the architecture, optimization settings and phase of the moon.
The right answer is probably use atomics but that has a performance cost.
I am still trying to find out if atomics suffer from contendedness.
My perfect scenario is the atomic increment should be as cheap as the non atomic increment.
But I think the bus pausing and cache coherence protocols mean that the data is flushed from the store buffer to the memory, which is slow. I don't know if it acts as a lock and has an uncontended option
It occurred to me I can detect a potential race happening in the multiplexing thread. The code in the multiplexing thread could be slow and it would be fine (it doesn't affect hot loop performance) We essentially track whether or not we set the looping variable to the index and if it changes back due to a data race, we set it back again to correct it.