This is great. I've come across a few articles lately about caches and their effects on performance. It's really interesting stuff. If this interested you, then you might find this article[1] by Dan Luu on improving performance by misaligning data interesting as well.
At that point, the performance of his benchmark depends of DRAM access. In modern system, DRAM is very parallel -- a single access is glacially slow but you can have a lot of accesses in flight. It would appear that a single thread of his benchmark cannot provide enough access parallelism to saturate his DRAM interfaces.
I'm not sure if it's happening here, but for what it's worth, it's not true that L2 can't be shared. On cores with hyperthreading, it's a shared resource.
[1]: http://danluu.com/3c-conflict/