There seems to be a small error in the reported results: In most rows the model that did better is highlighted, but in the row reporting results for the FLEURS test, it is the losing model (Gemini, which scored 7.6% while GPT4-v scored 17.6%) that is highlighted.