The fact that the CPU is high on one CPU suggests that to me all interrupts from the NIC are going to just the one CPU (generally default on Linux). An 8 core ec2 machine has lots of total CPU, but individual cores are not that fast.
Change the interrupt cpu affinity to split network interrupts over multiple cores. See:
Doesn't Linux schedule interrupts on CPU0 by default?
I tried enabling RPS/RFS, which to my understanding, did this; load balance the interrupt handling among multiple cores. With this enabled, I saw little to no difference in connection rate.
But then again I'm guru, I might as well double check this.
Updated my little "action plan" in the original Serverfault question with this info.
Unfortunately irqbalance has some limitations on some multiple core CPUs. On my Core 2 Duo 6400 CPU it doesn't do anything. I also found this in the man page:
This raises a few interesting cases in which the behavior of irqbalance may be non-intuitive. Most notably, cases in which a system has only one cache domain. Nominally these systems are only single cpu environments, but can also be found in multi-core environments in which the cores share an L2 cache. In these situations irqbalance will exit immediately, since there is no work that irqbalance can do which will improve interrupt handling performance.
Change the interrupt cpu affinity to split network interrupts over multiple cores. See: