Wouldn't this kind of development target be better served by optimizing small C/++ programs instead of trying to optimize to some abstract virtual machine implemented on top of the hardware? I mean if speed really is your goal, why not do it correctly instead of hitting yourself in the face with an extra tree before starting?
Why do you assume that the JVM adds overhead? While in some cases a program is better served by C/C++ manual memory management and fine-tuned memory alignment, this is not usually the case.
You can think of the JVM as a very good optimizing compiler that compiles your program when you load it in a way that's tailored to your environment.
Also, when it comes to concurrency support, the JVM is usually years ahead of C++ (lock-free data structures, etc.). If you're doing concurrency, the JVM is usually a better target than C++.
Not to mention the kinds of programs that would most benefit from lightweight threads are high connection count servers. Precisely the kinds of applications where the JVM weaknesses are most hidden (startup time, base level latency, etc).
It's possible to perform compare-and-swaps in java just like it is in C. They compile right down to the same primitives that flip bits in metal: you'll get CMPXCHGL instructions (using x86_64 as an example) from the jvm just as you will from gcc.