Basic Chacha C implementations do not get auto-vectorized down to ultra-efficient code. The most efficient implementations are intrinsic/assembler that process 4 (SSE2/AVX/NEON) or 8 (AVX2) Chacha20 blocks at once. This is due to layout of variables and operations being designed for efficient SIMD use and the blocks being independent of each other. (Shay's Chacha20 implementation is also not the fastest!)
Basic GNU C implementations don't get auto-vectorised full stop. But with a little bit of effort Chacha20 can be made to vectorise. The implementation in here is vectorised by GNU C:
If "ultra-efficient code" means what could be produced by a programmer highly skilled in some amd64 implementation (intel core2, amd bulldozer, ...) for that implementation then yes I doubt GNU C produces it. But the odds are GNU C's output runs faster than that's guru's code on other amd64 implementations.
That Salsa implementation is not being vectorized? Salsa also requires some values to be shuffled around to actually work in SSE registers, djb made a bit of a boo-boo when designing it. Chacha fixes that, so its SIMD implementations are a bit more straightforward.