Building a Distributed Hypervisor on FreeBSD (2015) [video]

mtanski · on March 8, 2016

This is interesting but short on details. Is it essentially creating a virtual many core / NUMA machine? If that's so I wonder what the overhead for such things as emulating x86 cache coherency.

tg2 · on March 8, 2016

Furthermore, fundamental things such as the fact CPU interconnects are in the 160+Gbps range compared to (in their example) 10gbps network prevents this from scaling as a single system would. Even in a single dual socket system it is not uncommon to pin (via numactl) processes to a single CPU in order to get full performance out of applications and prevent saturating this interconnect.

Accessing memory from a remote core over the network is seriously handicapped in this respect, and purpose built cluster systems such as infiniband rdma with bandwidth in the 100gbps realm still have issues with this latency. A single stick of previous-gen ddr3-1600 can exceed that bandwidth, with 15x faster access time.

To give an example, a very low latency (non Ethernet) network is around 1300μs, local dram (from the same socket) is 60ns. L3 cache on the same socket is 15ns.

You can be smart about pinning memory allocation to the local CPU core requesting it, and minimizing thread migration to another host, but there is no magic bullet to getting past these theoretical limits. Accessing packets arriving on a network card in host A from host B's core would halve the network bandwidth unless there is a dedicated network for the clustering.

That being said, virtualization isn't perfect either and the overhead can be substantial as we've seen when comparing to containers on bare metal.

I'd really like to take this for a test drive and benchmark it with some R jobs.

Even if it's slower, the value of not having to use cluster-aware toolkits is valuable to many, not to mention simplicity of operation.

I wish they would release an open source version before getting sucked up as a stop-loss by Intel.

mtanski · on March 8, 2016

I imagine that a untuned guest operating scheduler would wreck havoc in this scenario.