Nothing terribly complex. For the typical case of 2D uncompressed textures with ...

Nothing terribly complex.

For the typical case of 2D uncompressed textures with some kind of trilinear/anisotropic filtering enabled, you are basically just calculating 8 addresses based on texture coordinates and clamping and mipmap levels, doing a gathered load (and CPUs hate doing gathered loads). Remember to optimise for the case that each group of 4 addresses are typically, but not always right next to each other in memory.

Then you use trilinear interpolation to filter your 8 raw texels into a single filtered texel. With the exception of the gathered load, none of these operations is actually that expensive on CPUs, but shaders do a lot of these texture reads and it's really cheap to and faster to implement it all in dedicated hardware.

You can also put these texture decoders closer to the memory, so the full 8 texels don't need to travel all the way to the shader core, just the final filtered texel. And since each texture decoder serves many threads of execution, you have chances to share resources between similar/identical texture sample operations.

And while the texture sampler is doing it's thing, the shader core is free to do other operations (typical another thread of execution). It's not that the CPU can't decode textures, its just that CPU cores without dedicated texture decoding hardware can't compete with the hardware which has the dedicated texture decoding hardware.