WER is really dataset-dependent, you can’t really compare across different test ...

WER is really dataset-dependent, you can’t really compare across different test sets unless they’re almost identical (acoustic environment, microphone quality, speaker profiles, linguistic domain, etc).

WER for the exact same model can vary wildly between datasets, it’s really only useful for comparing model performance on a single test set.