The problem is not the performance of the low-level crypto code IMHO, but how it interfaces with the rest, which is where you're crossing a myriad of locks (and atomic ops for newer versions) that cost a lot as soon as you're interested in using more than one CPU core :-/