More

numeri · 2024-09-20T17:11:47.000000Z

I've got bad news for you – that term was used in deep learning research well before LLMs came on the scene. It has nothing to do with pundits trying to popularize anything or trying to justify LLMs' shortcomings, it was just a label researchers gave to a phenomenon they were trying to study.

A couple papers that use it in this way prior to LLMs:

- 2021: The Curious Case of Hallucinations in Neural Machine Translation (https://arxiv.org/abs/2104.06683)

- 2019: Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation (https://aclanthology.org/W19-6623/)

numeri · 2024-09-20T16:59:19.000000Z

Sort of like this? It does help: Source-Aware Training Enables Knowledge Attribution in Language Models (https://arxiv.org/abs/2404.01019)

From the abstract:

> ... To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted.

numeri · 2024-09-20T13:19:00.000000Z

OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.

That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

[1]: https://openai.com/index/learning-to-reason-with-llms/

numeri · 2024-07-19T02:54:00.000000Z

SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.

Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.

What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.

numeri · 2024-07-11T00:40:42.000000Z

I'm also confused about some of the figures' captions, which don't seem to match the results:

- "Only Sonnet-3.5 can count the squares in a majority of the images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have accuracy of >50%

- "Sonnet-3.5 tends to conservatively answer "No" regardless of the actual distance between the two circles.", but it somehow gets 91% accuracy? That doesn't sound like it tends to answer "No" regardless of distance.

numeri · 2024-07-10T09:55:16.000000Z

Any career in fundamental research is more or less like that. From what I've seen personally, academics and government labs are the two biggest places you can find the most open ended roles. Each comes with their own caveats, of course.

esd_g0d · 2024-07-14T14:42:53.000000Z

interesting to know. in my limited experience though, in practice, in academia the politics is so heavy (more than in your average workplace) that political matters often overshadow research matters. maybe that is one of the caveats you had in mind. i am curious about government labs though. though it seems country-dependent. how is it different from academia?

numeri · 2024-04-29T20:00:41.000000Z

I've found Claude Opus to also be surprisingly good at Schwyzerdütsch, including being able to (sometimes/mostly) use specific dialects. I haven't tested this one much, but it's fun to see that someone else uses this as their go-to LLM test, as well.

numeri · 2024-04-19T16:58:53.000000Z

Is "meringues me" a typo, or a really fun new vocab word for me?

numeri · 2024-04-19T14:54:05.000000Z

Yeah, this is correct and I'm not sure what paper GP was thinking of – Chinchilla is only about finding the point at which it would be more useful to scale the model rather than training longer.

Chinchilla optimal scaling is not useful if you want to use the model, just if you want to beat some other model on some metric for the minimal training costs.

candiodari · 2024-04-23T22:36:04.000000Z

Well, my point is that "scale the model" is equivalent to upping inference costs.

numeri · 2024-03-31T03:09:25.000000Z

I believe this was one of Machiavelli's big arguments in The Prince – that sometimes a country in crisis needs a single strong leader/monarch/dictator, using cruelty if necessary to keep control and bring stability.