AFAICT, it's a result of their internal architectural decisions vs. the structure of some of these crypto operations. The simplest example is that AMD offers a hardware bit rotate instruction and NVidia doesn't. (The newer compute_35 devices have a funnel shifter that gets halfway there, but it requires using a 64 bit op, which is still slower).
But beyond that, for scrypt, AMD has an internal architecture that encourages using a internal 4-wide vector ops (the key being that threads in a stream can diverge and the scheduler handles them, but the vector ops are strictly locked). This seems to give them an advantage in several "stupidly parallel" workloads that consist of basic operations in huge parallel, whereas Nvidia's architecture regains ground when the control flow/etc., becomes a bit more complicated. But calculating 64k SHA-2 or scrypt hashes in parallel doesn't stress the thread scheduler or divergence handling at all. Also, scrypt internally does well when you parallelize it into groups of 4 uint32's - that's actually the key optimization I added to my version of the miner. Which, again, maps quite nicely to AMD's preferred optimization path.
Graphics and HPC are mostly floating point but SHA and scrypt are all integer, so it's kind of an accident that any GPU does well on them. AMD GPUs have a a rotate instruction that's worth a lot in SHA and tend to have better integer performance in general.