I like it! Do you have any sense for what the perf hit is for making those loops...

samsquire · on Sept 29, 2022

There is no if statement in the hot loop or in the kernel thread so there is no performance cost there.

The multiplexing thread is separate to the kernel thread so you could say it's 1:M:N thread scheduling. I should have been clearer in my comment. There is 3 types of threads.

The multiplexing thread timeslices the preemption of the lightweight threads and kernel threads every 10 milliseconds. That is it stops all the loops in the lightweight thread and in the lightweight thread and causes the next lightweight thread to execute.

So there is no overhead except for a structure variable retrieval in the loop body

Rather than

For(int I = 0; I < 1000000; I++) {

}

We have

Register_loop(thread_loops, 0, 0, 1000000);

For(; thread_loops[0].index < thread_loops[0].limit; thread_loops[0].index++) {

}

Handle_virtual_interrupt()

And in the thread multiplexer scheduler, we do this

thread_loops[0].index = thread_loops[0].limit

ReactiveJelly · on Sept 30, 2022

But that means `thread_loops [0].index` must be read / written with atomic ops, right?

I saw that's how wasmtime solves pre-emption of wasm threads with their "epoch-based interruption". [1]

I see your Rust implementation does indeed use atomic ops.

Have you _measured_ that there is no overhead to atomics?

[1] https://docs.rs/wasmtime/1.0.1/wasmtime/struct.Config.html#m...

(Also it says that malicious guest code cannot skip the check, which is nice)

samsquire · on Sept 30, 2022

I didn't use atomic ops in Java or C.

It still works.

The risk is that a thread fails to reschedule/be preempted/end the loop in this timeslice but if the exact same interleaving happens to every other timeslice after, it shall never preempt. But this is unlikely in practice.

The chance of an incrementing at the end of a loop data race with setting the index to the limit and a preventing preemption a unlikely but possible. I assume the body of the loop has the most execution time, not the loop conditional and increment.

I've not measured the overhead of the atomics. In theory the atomics are on the slow path of the loop - they happen during preemption and at the end of a hot loop iteration The loop itself is the hot part that we don't want to slow down The worst we do is slow down the execution of the next hot loop with atomic set.

Atomic reads are cheap. I'm not sure if atomic writes are cheap if uncontended

mattgreenrocks · on Oct 1, 2022

The data race on the volatile version of the C code is technically UB, so you are at the mercy of the architecture, optimization settings and phase of the moon.

This is a great answer for more detail: https://stackoverflow.com/a/60482370

samsquire · on Oct 1, 2022

Thanks Matt. I shall need to do more testing.

The right answer is probably use atomics but that has a performance cost.

I am still trying to find out if atomics suffer from contendedness.

My perfect scenario is the atomic increment should be as cheap as the non atomic increment.

But I think the bus pausing and cache coherence protocols mean that the data is flushed from the store buffer to the memory, which is slow. I don't know if it acts as a lock and has an uncontended option

samsquire · on Sept 30, 2022

It occurred to me I can detect a potential race happening in the multiplexing thread. The code in the multiplexing thread could be slow and it would be fine (it doesn't affect hot loop performance) We essentially track whether or not we set the looping variable to the index and if it changes back due to a data race, we set it back again to correct it.