Why is TCP accept() performance so bad under Xen?

aliguori · on May 23, 2011

This is a very well known issue (at least among virtualization developers :-)). ESX does a surprisingly good job handling this but both Xen and KVM are still trying to catch up here.

The issue is small packet performance. You can isolate it pretty easily with netperf TCP_RR. In order to send a packet, the hypervisor needs to switch from the guest, to the hypervisor, and in the case of Xen, to domain-0. These switches are very expensive.

Normally, you don't notice this because almost all I/O in hypervisors is done with a data structure known as a lockless ring-queue. These data structures are extremely efficient at batching requests in such a way as to minimize the overhead of world switching by trying to never do it.

But TCP_RR is the pathological test case for this. No matter how smart the data structure is, you still end up taking one exit per packet. In particular, with small packets, you've got multiple world switches to move a very small number of bytes (usually around 64).

There are ways to improve this (using things like adaptive polling) but this is still an area of active development. I don't follow Xen too closely any more but we've got quite a few new things in KVM that help with this and I would expect dramatic improvements in the short term future.

abofh · on May 22, 2011

First, you should test on an unloaded domU, not anything in EC2;

- EC2 will dynamically adjust your CPU share as you try and use it, so you will _not_ get consistent results over any short period. - EC2 is subject to other people's loads, which may be IO, CPU or network bound.

Going from there: - Xen is slower any time you need dom0/domU coordination - it wouldn't surprise me to learn that there's some sort of coordination happening in accept() to tag the session through the upper dom0 firewalls. - You don't describe what your backlog is on the listening socket, but you should make sure you're accepting as many as you can during your CPU share on EC2 -- your slice _will_ be interrupted at inopportune times.

Finally, EC2 is _lousy_ performance-wise, especially w/r/t disk IO - it doesn't sound like it, but if you're logging to disk after accept(), this could be the killer.

Tangentially -- you _might_ get better accept() performance if you turn ON syncookies, as then the handshake occurs basically at the kernel, and the accept() is only relevant _after_ the handshake is done. It's a bit hacky, but with _large_ numbers of connections, it can improve your performance a bit.

cgbystrom · on May 22, 2011

(I wrote the Serverfault question)

Thanks for the suggestions, here's some clarifications:

* I clarified what hardware I've tested in a comment (see below).

* I've run tests with my server ranging from 10 sec up to 10 minutes. They're consistently bad unfortunately.

* Interesting what you say about dom0/domU. I'm no Xen guru, but the culprit is probably something like that. I've been using a backlog of 1024 for the server tests (set both in Java land and sysctl.conf). The netperf are all defaults, both in terms of run time and backlog. Was actually trying to monitor the backlog somehow, but I'm not sure that's even possible in Linux?

* The server isn't doing anything disk IO-bound so this shouldn't be the case.

* syncookies seems like a good idea, I will definitely try that along with a lot bigger backlog and see if it makes any difference. I'll also try and see if netperf can be tweaked as well to provide a better, isolated test case.

Writing this off as Xen overhead would be such a shame, virtualization should not cause this much overhead. I'll continue investigating!

ay · on May 22, 2011

A nit re. syncoookies: TCP three-way handshake always occurs in the kernel. You can grab the kernel source, the relevant stuff is in net/ipv4/tcp_input.c; What syncookie may somewhat help you with is if you have super-large number of half-open (SYNRCVD state) connections - and even then, the data structures for those are supposed to be efficient enough for this not to be a problem.

OTOH, what you will be trading off with syncookies, is that they subtly violate the TCP standard, and especially this will be a problem with the "server talks first" connections (like [E]SMTP, SSH): if your third ACK of the three way handshake gets lost on the way from client to the server, the canonical TCP implementation would have retransmitted the SYNACK from the server side. Except that the whole point of the syncookies is not to keep the state on the server side, i.e. there is nothing that can at all retransmit the SYNACK.

The HTTP, being by nature "client talks first" protocol, hides this problem (the ACK for the SYNACK will be effectively retransmitted because it will be part of the data segment), but I thought it might be useful to remind that turning syncookies on by default is not the standard modus operandi.

justincormack · on May 22, 2011

He is not using EC2...

guan · on May 22, 2011

The author of the original article is, at least for some of his tests: “(on an 8-core EC2 instance, c1.xlarge running Xen)”

justincormack · on May 22, 2011

Yeah, true, rereading it he is not clear about whether he has run the software on xen on the same box he ran raw on, although it is implied. It is a very different question on EC2.

cgbystrom · on May 22, 2011

I've tried it on variety of different servers/hardware using both my server+apachebench and netperf as benchmarking tools.

* On EC2 I've tried the c1.xlarge and the giant cc1.4xlarge. With cc1.4xlarge, I saw maybe a ~10% increase accept() rate. * Two separate, virtualized servers at the office. * A private, virtualized server on Rackspace was briefly tested as well.

A compilation of netperf results are available at https://gist.github.com/985475

bigiain · on May 22, 2011

I doubt it'll reveal anything super useful, but running your tests on a single-tenant EC2 instance could add an extra datapoint. I suspect it'll only reveal that the problem _isn't_ due to EC2/Xen swapping your virtualized machine out. (assuming you haven't tried it already...)

adamt · on May 22, 2011

The fact that the CPU is high on one CPU suggests that to me all interrupts from the NIC are going to just the one CPU (generally default on Linux). An 8 core ec2 machine has lots of total CPU, but individual cores are not that fast.

Change the interrupt cpu affinity to split network interrupts over multiple cores. See:

  http://www.cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt

cgbystrom · on May 22, 2011

Doesn't Linux schedule interrupts on CPU0 by default?

I tried enabling RPS/RFS, which to my understanding, did this; load balance the interrupt handling among multiple cores. With this enabled, I saw little to no difference in connection rate. But then again I'm guru, I might as well double check this.

Updated my little "action plan" in the original Serverfault question with this info.

aliguori · on May 23, 2011

I think netfront still uses a single transmit queue so you'll only ever see one VCPU cpu bound for intensive I/O.

nodata · on May 22, 2011

Won't irqbalance do this?

ciupicri · on May 22, 2011

Unfortunately irqbalance has some limitations on some multiple core CPUs. On my Core 2 Duo 6400 CPU it doesn't do anything. I also found this in the man page:

This raises a few interesting cases in which the behavior of irqbalance may be non-intuitive. Most notably, cases in which a system has only one cache domain. Nominally these systems are only single cpu environments, but can also be found in multi-core environments in which the cores share an L2 cache. In these situations irqbalance will exit immediately, since there is no work that irqbalance can do which will improve interrupt handling performance.

justincormack · on May 22, 2011

The clue seems to be in very high cpu load on one cpu under xen. Need to do some digging to see what it is. I would look at the interrupts under xen and see if they are not being balanced. The config of the network interfaces and which drivers are being used are key.

There is really not enough information in the post to diagnose though, although someone might recognise te situation.

mike_esspe · on May 22, 2011

You should check if you have a listen queue overflow and syncache buckets overflow in netstat -s.

For FreeBSD you should also check if there are packet drops in sysctl net.inet.ip.intr_queue_drops

cgbystrom · on May 22, 2011

Thanks, noted. Updated the "action plan" in my question at Serverfault.

nathanhammond · on May 23, 2011

Carl, I can confirm that this issue exists. We have seen the exact same performance characteristics on EC2, regardless of instance type. The same performance characteristics apply even if the instance is a type which would be singly-hosted on the host hardware. And in spite of days of effort, nothing I could do seemed to "tune" it out.

Our most amusing result was watching a t1.micro beat the pants off a cc.4xlarge.

icehawk · on May 22, 2011

It would be interesting to compare accept() performance between different hypervisors to see what the performance hit is with KVM, VMware, etc.

kqueue · on May 22, 2011

just a hypothesis. accept() triggers a context switch. Context switches involves MMU commands, and can be slower in xen if it is being emulated in software (type2 hypervisor). This is much slower than a native OS doing a context switch and the MMU operations are executed at the CPU level.

kijiki · on May 23, 2011

Real hardware (and 32bit paravirtualized Xen) don't require any MMU changes during a syscall. This is because on real hardware, there is no VMM (Virtual Machine Monitor) to protect, and on 32bit paravirt, Xen can use x86 segments to protect the monitor.

For better or for worse (most would say better), segment limit checking is disabled in 64bit mode, so 64bit paravirtual Xen has to use the MMU to protect it's monitor. Basically, both the kernel and userspace actually run in ring3, but on different page tables. This means that expensive MMU updates (and TLB flushes) are required both on the way into and out of the kernel for every syscall.

aliguori · on May 23, 2011

I'd be surprised if it was 64-bit. 64-bit PV Xen is awfully slow because of exactly what you reference. HVM would be quite a bit better.

kijiki · on May 23, 2011

All the EC2 instance types mentioned in the article are 64 bit PV.

kqueue · on May 23, 2011

When the kernel takes control the MMU gets involved to update the memory mapping.