Hacker News new | past | comments | ask | show | jobs | submit | infruset's comments login

Does anyone have a clue how far we are from having "LLMs for animals"? Even if we don't understand what the LLM is saying to a dolphin or a monkey, does it change much from feeding millions of texts to a model without ever explaining language to it as a prerequisite?


A predictive/generative model of animal "vocalizations" would be almost trivial to do with current speech or music generation models. And those could be conditioned with contextual information easily.


Wouldn't we need several hundred gigabytes of ingestible/structured contextual info for animal vocalizations in order to train a model with any accuracy? Even if we had it, seems to me the model would be able to tell us what sounds probably “should” follow those of a given recording, but not what they mean.


We could train a transformer that could predict the next token, whether it's the next sound from one animal or a sound from another animal replying to it. However, we wouldn't understand the majority of what it means, except for the most obvious sounds that we could derive from context and observation of behavior. This wouldn't result in a ChatGPT-like interface, as it is impossible for us to translate most of these sounds into a meaningful conversation with animals.


Why not label a fine-tuning dataset with human descriptions based on video recordings. We explain in human language what they do, and then tune the model. It doesn't need to be a very large dataset, but it would allow for models to directly translate to human language from bird calls.


What if they just sit and talk? What is the description of this? What if only part of the communication is relevant? What if it's not relevant at all because they reacted to atmospheric changes? Or electromagnetic signals, that can't be observed on video? Or smell? Or sound outside of human hearing frequency? What if the decision based on communication is deferred? etc etc

As I mentioned before, only the most obvious examples of behaviors and context can be translated into anything meaningful.


But then it's not a translation of the bird tweets, but more like a predictive mapping from tweets to behaviors.


Reminds me of Wittgenstein's if a lion could speak, we would not understand it.



Generative models yes, since there are terabytes of audio available. High quality contextual info is much harder to obtain. It’s like saying that we could easily build a model for X if we had training data available.

With LLMs we can leverage human insight to e.g. caption or describe images (which was what made CLIP and successors possible). With animals we often have no idea beyond a location. There is work to include kinematic data with audio to try and associate movement with vocalisation but it’s early days.

https://cloud.google.com/blog/transform/can-generative-ai-he...


It's "almost trivial" and "easily" done, I only wonder why we aren't speaking to animals already.

Oh wait. Because the devil's in the details, the ones SW dev hubris glosses over ;) ;)


To clarify: I didn't mean a model that would "translate" animal sounds to some representation of language or meaning. I meant a model that would capture statistical regularities in animal sounds and perhaps be able to link these to contextual information (e.g. time of day, other animals around, season etc).

By almost trivial I mean it wouldn't require much new technology. Something like WaveNet or VQ-VAE could be applied almost out of the box.

Data availability is may be a significant problem, but there are some huge animal sound datasets. E.g. https://blog.google/intl/en-au/company-news/technology/a2o-s...


Someone already mentioned Aza Raskin, but the organisation you should look up is Earth Species Project. It’s a fairly open question and fairly philosophical - do the semantics of language transcend species? Certainly there is evidence that “concepts” are somewhat language agnostic in LLM embedding spaces.

https://www.earthspecies.org/about-us#team


Captivating watch from Aza Raskin on the subject:

https://youtu.be/3tUXbbbMhvk


I had the pleasure of hanging out with him at Stochastic Labs in 2018 while he was working on this, and I was working on 3D fractal stuff there. Pretty fun place, and was my first time living in the US.

At the time it seemed a bit wild / long shot, but now he just looks like a pioneer.


Presumably anyone with a multimodal transformer already pretrained on Human data could be further pretrained on animal vocalizations. I don't know whether any of the large model owners are doing this.


Very nice ! How might one go about adapting this to other languages ? Does a version of the model downloaded exist somewhere ?


I would think translation to other languages would be trivial. The model is just a map of word to vector. Every word is converted to it's vector representation; then the query word is compared to the input words using cosine similarity.


I assumed you meant computer languages :-) If you mean human languages, yes Google publishes word2vec embeddings in many different human languages. Not sure though how easy it is to download.


Added support for multiple languages, using fasttext's embeddings


If on top of rigorous, you want them to be formally verified in Coq at the same time as they are computed: https://www.lri.fr/~melquion/doc/18-jar.pdf


is this the one mentioned in fredrik's post? he links https://www.lri.fr/~melquion/doc/16-itp-article.pdf which is presumably a different paper by the same author


I think they are the conference and journal versions of the same paper. Hadn't seen it was mentioned in the article, I should have read it more thoroughly!


Can it be pointed to a remote ollama server ?


It's interesting that historically, in France, the more prestigious a publisher is, the more bland the cover. Book covers with colors all over the place feel cheap over here.


Those quotes from reviews that American publishers like to put on covers make books feel particularly cheap and disposable.


And this french idea goes all the way to omitting any kind of blurb. It's pretty shocking. There is no information whatsoever on the book itself beyond author, title - and publisher's name because that's important /s. ... The book may as well not be on the shelf.


Oh I would love for my covers not to be blighted by "New York Times Bestselling author" or "Now a popular <TV channel> show!" and the like. Keep the blurb on the back, but don't pollute the cover!


I buy a lot of used books and I can never bring myself to pick up one of those. I'd have to be desperate.


Fitzcarraldo Editions is currently trying to ape this for the English language market.

https://fitzcarraldoeditions.com/


Oh, I like that.


I get that HN is an english-language website from the US, and that I'm not entitled for this website to include me. Don't take this as personal criticism -- if anything I like your project --, I'm hating the game and not the player. But the english language and US/UK culture are hegemonic today, which means people from all around the world read and write the words "book" and "author" on HN without realizing, or realizing too well, depending on who they are, that what is actually meant is "english-language books" (including a moderate amount of translations, thank goodness) and "english-speaking authors" (a majority of whom are American or British, even when they are immigrants). I'm not sure this comment will do anything, but someone had to say it, maybe just to maintain a modicum of awareness that millions of books are written in other languages and never make it to the english-speaking eye, and are thus buried alive as not-really-books and their writers not-really-authors in the global conversation. Maybe the relevant adjectives ("english-language", "American", "British") might be used more to remind readers that by "books" we do not mean "all books", but "books accessible in the US"?


Isn’t it kind of implicit in the English language description? Like if there was a site/post “les cent meilleurs livres de 2023”, it would be fairly unsurprising if they were all in French.

I would love a more inclusive, multilingual culture, but I’m not sure qualifying sentences to refer to the language they’re written in is the right place to start?


There would never be a site/post "les cent meilleurs livres de 2023" on a global discussion board such as this because by definition it would not be in English. There might be a "the best french books of 2023" post, although there wouldn't be, because French books simply don't exist in the anglosphere, but my point is precisely that the global conversation happens in English, the only word that ever reaches it is "book", and it's never about anything else than English books.


You're just wrong. You're defining "global" discussion boards to only include that which is in english. There are vast francophone sections of the internet, not to mention chinese, malay, hindi, spanish etc. You are fully welcome to go and look at these places as well; they are also global in the same sense that y combinator is.

hacker news is a discussion board attached to a startup accelerator based in silicon valley. it's only global because many people find the content that's *already here* valuable.


> You are fully welcome to go and look at these places as well; they are also global in the same sense that y combinator is.

Allow me to respectfully disagree: they are not. The dominance of US culture around the world, for the best and for the worst, is a fact of life for all of us who live outside it. If you look at the rates of translation of books from and to English, you will immediately see where the center and the periphery lie.


That seems to be an isolated standpoint. While the US and UK undeniably have a big influence on global thinking, they are surely far from dominant in any than some specialised categories?

Going through some the biggest countries by population: China, India, Indonesia, Pakistan, Brazil, Nigeria, Bangladesh, Russia, Mexico, Japan, Iran, UK, Germany, France, which one would you say would have a big to moderate US cultural influence? I would say UK, Mexico, and to a lesser degree Germany, and then France, everything else hardly?

Book publishing looks also healthy in a lot of countries https://en.m.wikipedia.org/wiki/Books_published_per_country_...


Random people from all these countries can name of the top of their head a bunch of any of the following:

- US presidents

- US pop artists

- US filmmakers

- US cities

- US CEOs

- US companies

- US TV shows

I stopped there but I could go on for long. Now, take any country X other than the US and ask a random resident of any of the other countries to name just one of each category: a president of X, a CEO, a film, etc... If you think the answer has any chance to compete with the equivalent question asked of the US, well, I think you don't realize how big the cultural influence of the US is. What is domestic news in the US is still news in the rest of the world, but the reverse is simply not true.


We apparently live in very different societies.

Of course a random Indonesian would know Indian, Chinese, and French presidents, and vice versa. Young people I see hype up Korean bands. Chinese, Japanese, German, French and Indian companies are well known and highly influential (Ant, Tata, Huawei, all car brands, Samsung, Sony, Nintendo, …). Who are US filmmakers that are more known than non-US ones, Spielberg and Nolan?

Your experience is obviously vastly different. Let’s just not state it as fundamental truth.


> books accessible in the US

Sadly books accessible in the US / English language is slowly becoming synonymous with general availability, through no fault of the US/UK or other English speaking countries. A huge amount of books are simply never translated to language with smaller markets. I can read English just fine, but an entire book is a struggle. There are so many books that I want to read, mostly non-fiction, and they are never going to be translated and even when they are circulation is low and reprints are rare. Normally I can read a book in Danish in about a week, depending on the time available, and English book is normally about three to four weeks.

To be clear: I solely blame Danish publishers and bookshops. They churn out crime/detective novel at an absolutely insane pace. Want to read about someone being killed and have the murder investigate by an alcoholic Scandinavian cop, the Danish publishers have you covered. Want to read "Meditations" in Danish, well screw you. Want to read the most popular book on this list, Demon Copperhead, well to bad. Slaughterhouse-Five you can get, for more than three times the price of the English version.

I get this is probably different for Chinese, German, French or Spanish, but for smaller language you either read what everyone else reads or you read English language books.


> Sadly books accessible in the US / English language is slowly becoming synonymous with general availability, through no fault of the US/UK or other English speaking countries. A huge amount of books are simply never translated to language with smaller markets.

Depends on the market I guess. I've checked the availability of the top 10 books of 2023 on Shepherd (https://shepherd.com/bboy/2023) in Hungary, which is a tiny market of ~10 million people. 50% of the books are available in in translated version, which I think is not terrible.


Well, I don't think anyone in particular is to blame; I'm guessing Danish publishers and booksellers cater to their public and have little economic leeway to risk translating risky books for a small linguistic group. That's the problem, it's the effect of the forces at play in a global free market economy dominated by the English language: we all have to roll over and make way for what sells, regardless of the respect literatures and languages otherwise deserve.


I'd agree with that, it's just that the bookstores complain about rapidly declining business, and honestly most bookstores aren't bookstores anymore, they are arts and crafts stores with a few books. It's just that if they want to sell a larger number of books, then they need to have a larger selection, which would require them to invest in translations.

It's true that translations are costly, so instead they opt to push books they know will sell well, but the supermarkets sell those exact same books. It's just that without a large selection there's no reason to specifically go to a bookstore and so they die out.


Agreed.

A side note I want to add to this: copyright and other IP laws make sure a lot more money goes from Denmark to the US than from the US to Denmark. I think smaller cultural domains are loosing in a sense. European movie makers constantly need grants where Hollywood booms.


I appreciate the discussion but this does read a bit ridiculous to me. If we are speaking English and we say the best books of 2023 I cannot imagine any scenario that we would think the books are not written in English. I would immediately assume this includes translated books but again in English. Similar if it was the best book list in German or any other language I would assume all of the books are written in that language.

I don’t understand what there is to hate. If I was visiting a site written entirely in French I would have zero expectation that any book list would clarify that the books are French language. What game is being played?


Thanks for your answer. I will try to clarify a bit.

Of course we all know the books are going to be written in English. I am not trying to ask for an idiot-proof label stating the obvious lest a reader might waste a click expecting German-language books.

The point I am trying to make is that the word "book" in the dominant anglosphere has come to mean almost exclusively books coming from an English-speaking country (and even among those I'm sure the proportions are skewed towards the US/UK, although I would be happy to be disproved). So if I discuss "books" in an English conversation (English being the language we are all forced to speak globally now) it is often implicitly expected that we are discussing those books, the books of the anglosphere. Some food for thought, less than 1% of books read in the US are translations[0], which is not the case in other countries (if only because a lot of countries read a lot of translated books from.. English).

> If I was visiting a site written entirely in French I would have zero expectation that any book list would clarify that the books are French language

This comment seems to assume that all languages are equal and interchangeable; they are not. This is maybe hard to realize from within the English-speaking global culture, but other languages are now vassals of English. What I'm saying is that it would be a small act of acknowledgement of this hegemony to remember what is being left out of the conversation.

[0] https://lithub.com/why-do-americans-read-so-few-books-in-tra...


Creator here: Translations are on my radar, and I will look at how I can identify translations as well to break them out into their own subsection of this list in future years (I think there is value in that). I don't know how to do that yet :).

To the larger point, I understand what you are saying. The world is moving toward one global language, and that has pros and cons. My hope is it has more pros than cons, but we could also be losing or minimizing some very special aspects of culture/thinking that language impacts. I think about this a lot as I live in Portugal, and I am likely moving to France in 2024.

I was just reading this NY Times about this type of thing happening within French today: https://www.nytimes.com/2023/12/12/world/africa/africa-frenc...


Thanks for the link! And I really appreciate any work about books, really, and yours is of great magnitude, thanks for sharing your project. If you ever look to expand it to French, I could be interested in helping.


Sweet, how can I get in touch? my email is ben@shepherd.com if you want to drop me a hello :)


I think you might have some sympathy with the arguments made in the book-length essay 日本語が亡びるとき: 英語の世紀の中で, which also argues that the English language dominance has a deleterious effect on non-English-language literature and that English speakers don't really notice this. (The book is available in English translation under the title "The Fall of Language in the Age of English". I don't think it's available in French translation, which is another example of English-language-dominance: English books get translated into lots of other languages, non-English books often get an English translation only, if that.)


I've read it, and I warmly recommend it! And it is true and ironic that the only language I could read it in was.. English


It’s funny, because your response highlights the problem:

For a native speaker, it’s obvious. For the rest of the world that speaks English as a second language, there is no such implication; most content we consume and platforms we visit online is in English, and the language also serves as a common denominator between speakers of several foreign languages, where the only shared one is English. Hence, we absolutely use English platforms to discuss non-English content.


Creator here: I would love to expand and support books in other languages. But the technology challenges and costs associated with that are massive. It is on my radar but not something I can examine until I meet my costs on this project. I have talked to friends in Spain and France about this (I live in Portugal).


I've seen the same reaction from people learning to program. Why are there thousands of programming languages? Why not put everything under one standard?

The main answer, as for many variations of this question (languages, laws, units of measure, programming languages), is history. Efforts to formalize mathematics have spawned in different universities, at different times, in different teams with different cultures and approaches. The mathematical theories underlying the software also vary greatly. There isn't one agreed upon formalization of mathematics. There's classical logic and there's intuitionistic logic, which wants to see every existence theorem backed by an actual witness and does not agree that `not (not A) = A`. (Speaking of which, different systems have (very) different notions of equality! In case you thought this one would at least be simple). Sometimes two pieces of software have nearly identical foundations, such as Lean and Coq to some extent; but one is decades old, and the other is a rewrite from scratch using other unification algorithms and a different programming language. Sometimes people just don't get along and start competing projects.

Note that some people have intended to unify various proof systems, which reminds me of the classical https://xkcd.com/927/


Do any commercial domestic solutions exist?


I have the same question, and more generally: Any generic way of doing this for any of the open source or semi open source models, especially Mistral[0]?

[0] https://news.ycombinator.com/item?id=37675496


> Encompasses Replit's top 30 programming languages with a custom trained 32K vocabulary for high performance and coverage

Any idea where the list can be found?


> The model is trained in bfloat16 on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's Stack Dedup V2 dataset and a dev-oriented samples from StackExchange.

Following the link to the "Stack Dedup V2" page: https://huggingface.co/datasets/bigcode/the-stack-dedup

> The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The full list can be found here.

https://huggingface.co/datasets/bigcode/the-stack-dedup/blob...

It requires login to see the JSON file.


just added the list to the README on Hugging Face!


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: