The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.
I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}
Just to be clear, it's all theoretically possible. There are already versions of BNN versions of YoLo and other CNNs. No reason why transformers wouldn't work for that or audio. It just might be harder to get them to train well enough.
Speech to text, however, is super interesting. You just gave me an idea! I'm gonna go run some experiments :D
I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}