In most MCU there is not an FPU so all floating point compute is emulated with s...

In most MCU there is not an FPU so all floating point compute is emulated with software, so it's really slow. But yes, simple SIMD on integer improve so much the performance !

The main limitation is often not the time to process but the RAM available, some architecture of model need to keep multiple layers in ram or very big layers, and you hit the hard limit of RAM pretty quickly.

Concerning the training on MCU, it's possible but with simple need and special architecture of model, again the RAM is the limit.