Transcribe it locally using whisper and output tokens/sec?
Yeah, totally easier than `len(transcribe(a))/len(a)`
The tokens/second can be used as ground truth labels for a fft->small neural net model.
Transcribe it locally using whisper and output tokens/sec?