Hacker News new | past | comments | ask | show | jobs | submit login

I am one of these people! I am one of a handful of people who speak my ancestral language, Kiksht. I am lucky to be uniquely well-suited to this work, as I am (as far as I know) the lone person from my tribe whose academic research background is in linguistics, NLP, and ML. (We have, e.g., linguists, but very few computational linguists.)

So far I have not had that much luck getting the models to learn the Kiksht grammar and morphology via in-context learning, I think the model will have to be trained on the corpus to actually work for it. I think this mostly makes sense, since they have functionally nothing in common with western languages.

To illustrate the point a bit: the bulk of training data is still English, and in English, the semantics of a sentence are mainly derived from the specific order in which the words appear, mostly because it lost its cases some centuries ago. Its morphology is mainly "derivational" and mainly suffixal, meaning that words can be arbitrarily complicated by adding suffixes to them. So baked into English is word order that sometimes we insert words into sentences simply to make the word order sensible. e.g., when we say "it's raining outside", the "it's" refers to nothing at all—it is there entirely because the word order of English demands that it exists.

Kiksht in contrast is completely different. Its semantics are nearly entirely derived from triple-prefixal structure of (in particular) verbs. Word ordering almost does not matter. There are, like, 12 tenses, and some of them require both a prefix and a reflective suffix. Verbs are often 1 or 2 characters, and with the prefix structure, a single verb can often be a complete sentence. And so on.

I will continue working on this because I think it will eventually be of help. But right now the deep learning that has been most helpful to me has been to do things like computational typology. For example, discovering the "vowel inventory" of a language is shockingly hard. Languages have somewhat consistent consonants, but discovering all the varieties of `a` that one can say in a language is very hard, and deep learning is strangely good at it.




Awesome. Good luck to you!

I am also working on low-resource languages (in Central America, but not my heritage). I see on Wikipedia [0] it seems it's a case of revival. Are you collecting resources/data or using existing? (I see some links on Wikipedia).

[0] https://en.wikipedia.org/wiki/Upper_Chinook_language


We are fortunate to have a (comparatively) large amount of written and recorded language artifacts. Kiksht (and Chinookan languages generally) were heavily studied in the early 1900s by linguists like Sapir.

re: revival, the Wikipedia article is a little misleading, Gladys was the last person whose first language was Kiksht, not the last speaker. And, in any event, languages are constantly changing. If we had been left alone in 1804 it would be different now than it was then. We will mold the language to our current context just like any other people.


Super interesting, thank you very much for sharing your thoughts!

HN is still one of the few places on the internet to get such esoteric, expert and intellectually stimulating content. It's like an Island where the spirit of 'the old internet' still lives on.


I am applying for graduate school (after 20 years in the software industry) with the intent of studying computational linguistics; specifically, for the documentation and support of dying/dead languages.

While I am not indigenous, I hope to help alleviate this problem. I'd love to hear about your research!


I did research on entirely unrelated NLP, actually. I worked on search for awhile (I am corecipient of SIGIR best paper ‘17) ad a bit in space- and memory- efficient nlp tasks.


Still, that's cool. Is this the paper in question? https://dl.acm.org/doi/10.1145/3077136.3080789


That’s the one!


Wow kiksht sounds like a pretty cool language! Are there any resources you'd recommend for the language itself? I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!


> I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!

That's a fairly common language feature; such languages are generally called "agglutinating".

Prominent examples of agglutinating languages are the Eskimo languages, Turkic languages, and Finnish.

https://archive.is/QQnB6

There should be no shortage of resources available if you want to learn Turkish or Finnish.



So, bad news. Culturally, the Wasq'u consider Kiksht something that is for the community rather than outsiders. So unfortunately I think it will be extremely challenging to find someone to teach you, or resources to teach yourself.


How do you combine that feeling/observation together with what you're working on now, which I'm guessing you'll eventually want to publish/make available somehow? Or maybe I misunderstand what the "it will eventually be of help" part of your first comment.


I myself am very conflicted. I will always do what the elders say but opening things up greatly enhances the chance of survival.


Yeah, I understand, somewhat of an dilemma. Still, I wish you luck and hope that you manage to find a way that is acceptable to everyone involved.


Does that present philosophical questions on if an AI is part of the community or an outsider?


I’m not sure that the AI itself has many ethical complications. Many terms of service grant AI companies ownership of the data you supply to them though and that very problematic for many tribes which consider the language sacred and not to be shared.


I wonder why their language is dying. /s


Having a language that only your in group can speak is a good survival strategy


Good luck I wish you the best. I think you will almost certainly need to create a LoRA and fine tune an existing model. Is there enough written material available? I think this would be a valuable effort for humanity, as I think the more languages we can model, the more powerful our models will become because they embody different semantic structures with different strengths. (Beyond the obvious benefits of language preservation)


There is more material than you'd normally find, but it is very hard to even fine-tune at that volume, unfortunately. I think we might be able to bootstrap something like that with a shared corpus of related languages, though.


You can always fine tune with the corpus you have and then try in context on top the fine tuning even if it’s insufficient. Then with that - and perhaps augmenting with RAG against the corpus - you might be able to build a context that’s stable enough in the language that you can generate human mediated synthetic corpuses and other reinforcement to enrich the LORA.


> I will continue working on this because I think it will eventually be of help.

They say language shapes thought. Having an LLM speak such a different language natively seems like it would uncover so much about how these models work, as a side effect of helping preserve Kiksht. What a cool place to be!


Almost sounds like Cebuano / Waray-Waray in that sense.


I was wondering about what the limitations were.

Lots of languages, even Indo-European languages, have very different word order from English or a much less significant word order.


According to Wikipedia there were 69 fluent speakers of Kiksht in 1990, and the last one passed away in 2012. How did you learn the language?

https://en.wikipedia.org/wiki/Upper_Chinook_language


I learned it from my grandmother and from Gladys's grandkid. Gladys was the last person whose first language was Kiksht, not the last person who speaks it.


FYI, the Wikipedia article currently states:

> The last fully fluent speaker of Kiksht, Gladys Thompson, died in July 2012

Which is from https://web.archive.org/web/20191010153203/http://www.opb.or...

Maybe there are some sources talking about fully fluent people still being alive? As currently the article gives the impression they were the last person to "fully" speak it.


I’m aware and I think people do not care much to correct it. The tribe has had consistently bad experiences with outsiders attempting to document this kind of thing and I sense that people are mostly wanting to be left alone at this point. There is some possibility that I would make a different decision, but it’s not for me to decide, it’s for the elders.


This is the way ;)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: