OK, I understand. That is a different problem indeed. Although I am not sure tha...

OK, I understand. That is a different problem indeed. Although I am not sure that BERT would be of much more help there either, unless you have quite a large training corpus. The simplest/cheapest approach might be some kind of transcription normalization. As you say, what is introducing ambiguity here is the act of transcribing these sounds into the English alphabet. There’s not a single source language/alphabet? Also, I am not sure if fastText/GloVe really would not pick up the semantic similarity in spite of differing transcriptions as long as you have a large enough training corpus. I’d experiment with the settings here (bag of words/skipgram, min/max lengths etc.).