Basic question: Why is this faster than running Intel Linux apps in an emulated Intel Linux VM? Because Rosetta is faster than QEMU, and you only need to emulate one application rather than the entire kernel?
Emulation of an x86 kernel level means you lose the hardware-assisted virtualization support you'd get with an ARM kernel, and emulating an MMU is slow (among other things.)
Technically this would be replacing QEMU user-mode emulation. Which isn't fast in a large part because QEMU being portable to all host architectures was more important than speed.
a lot of the performance gains in rosetta 2 come from load time translation of executables and libraries. so when you run a binary on the host mac, rosetta jumps in, does a one time translation to something that can run natively on the mx processor and then (probably) caches the result on the filesystem. next time you run it, it runs the cached translation. if you're running a vm without support inside the guest for this, then you're just running a big intel blob on the mx processor and having to do realtime translation which is really slow. (worst case, you have an interrupt for every instruction, as you have to translate it... although i assume it must be better than that. either way you're constantly interrupting and context switching between the target code you're trying to run and the translation environment, context switches are expensive in and of themselves, and they also defeat a lot of caching)
I think NAS is a bit higher level than what the OP had in mind - NAS isn't usually used to search for fundamental operations like self-attention or convolution. But I guess you could probably adapt it quite easily.
In the same theme, there’s a recent documentary called “The Weight of Gold”, narrated by Michael Phelps and other Olympic champions. Really poignant stories.
1. I wonder how many times the test set can be used on "incremental changes in future versions of the model" before losing statistical validity.
2. This article describes their process, but not the FDA's process. Are there specific regulatory requirements for ML models beyond their four types of reports?
1) AFIAK there are no hard and set rules for this. I think this would have to be the manufacturer's judgement call. Good point though, with enough time you may end up just over-fitting to the test set.
This article is about optimization (finding good parameters), not the approximation power of neural networks (which is well-known through the universal approximation theorem).