Actually GPT-4 is at least marginally capable at almost all languages, and I expect LLMs being as nascent as they are, will become more proficient at more and more languages. The thing is that training a LLM is fundamentally simpler to train than prior NLP translation models. For example, I can carry on a reasonably good conversation with ChatGPT in the ancient dead language of Pali and have it translated into cuneiform, then Chinese, then Esperanto, and to English in a new session and the translation is almost flawless from the original. (I have in fact done this)
If it weren’t for this I would have agreed with you last year. But I see fairly clearly the way to preserve native tongues is to take a base LLM like Llama 2 and fine tune it with your native language. As people are invested in their native language this doesn’t seem unreasonable. As things develop and sharing LLM LoRa becomes easier, I think we will find that universal translator for all spoken languages forthcoming.
But I challenge the notion that not having a translation device assures native languages survive. The same pressures exist. You simply can’t tell the regionally powerful speakers to learn the minority languages and expect them to actually do it. And you can’t expect it to not be a stigma. Humans just don’t work that way, and never have. “When in Rome” applied in Roman times, in our time, and in all times. The only way to eliminate that is to make it not an issue.
> For example, I can carry on a reasonably good conversation with ChatGPT in the ancient dead language of Pali and have it translated into cuneiform
ChatGPT’s ability to translate into Pali or cuneiform is limited by the size of the corpus. As I said, most languages of the world do not have a sizeable electronic corpus, and what has appeared in writing is only a limited portion of those languages. ChatGPT cannot magically guess words or idioms that it was never trained on.
This is well known to anyone working in corpus linguistics. Do you have any formal background in the field?
> As people are invested in their native language this doesn’t seem unreasonable.
Outside a few relatively privileged languages, people are much less invested in their native language than you assume. Due to the pressures of poverty, political oppression, and social stigma, it can be difficult for linguists to even find speakers willing to answer some questions about their language, let alone train MT.
Then it’s up to those that care to preserve languages they care about. I for one am not an advocate that there be no one preserving languages or dedicating their lives to preserving languages. That’s a wonderful thing to do, and we have a tool that they can use to encode those languages for all time in a way that is functionally accessible to all for all time.
Reading the thread I don’t notice any proposal of an alternative. Without that I will still hold onto the idea that we can improve things with the miracles we create in our technologies. While corporations might not feel a profit potential here, surely the open source world has shown we don’t need corporations to do amazing things.
If it weren’t for this I would have agreed with you last year. But I see fairly clearly the way to preserve native tongues is to take a base LLM like Llama 2 and fine tune it with your native language. As people are invested in their native language this doesn’t seem unreasonable. As things develop and sharing LLM LoRa becomes easier, I think we will find that universal translator for all spoken languages forthcoming.
But I challenge the notion that not having a translation device assures native languages survive. The same pressures exist. You simply can’t tell the regionally powerful speakers to learn the minority languages and expect them to actually do it. And you can’t expect it to not be a stigma. Humans just don’t work that way, and never have. “When in Rome” applied in Roman times, in our time, and in all times. The only way to eliminate that is to make it not an issue.