> Only the non portable AVX2 based libraries are faster
The paper in OP is about 2 times as fast as their previous AVX2 code. I think they are superior to Turbobase64 here (though with more stringent hardware requirements).
Even 5% sounds sketchy indeed. However, AVX2 (or Haswell New Instructions) hasn't quite been on every Intel processor this decade, rather only since the 4th generation (late 2013), and only in Core-branded processors.
I think 40GB/s result when it is not fitting L1 cache. If you check result graph, On L1 cache fitting data memcpy is ahead with big difference, then they are almost head to head
It is faster than SSE on Intel/AMD and SIMD NEON on ARM see benchmark at Turbo Base64: https://github.com/powturbo/TurboBase64