These words are nonsense to me. Can someone explain?

esafak · on July 11, 2023

GPT-4 is the name of a machine learning (language) model that is the basis of chatGPT. This post speculates about its internals.

https://en.wikipedia.org/wiki/GPT-4

DanAtC · on July 11, 2023

I meant things like:

- parameters

- layers

- "Mixture Of Experts"

- tokens

That's about as far as I made it

DANmode · on July 11, 2023

Parameters: In the context of AI and language models, parameters refer to the internal settings or variables that an AI model uses to make predictions or generate responses. Think of them as the knobs and switches that can be adjusted to fine-tune how the AI understands and generates language. These parameters are learned during the training process, where the AI model analyzes vast amounts of data to optimize its performance.

Layers: In AI, layers are like stacked building blocks within a neural network, which is the fundamental structure of many AI models. Each layer performs different computations, transforming the input data as it passes through them. Think of a layer as a specific task or filter that the AI model can utilize to understand and process information. The deeper the neural network, the more layers it has, allowing for more complex patterns and representations to be learned.

"Mixture Of Experts": An "MoE" is an approach in AI that combines multiple specialized AI models, known as "experts," to work together on a task. Each expert focuses on a particular subset or aspect of the problem, leveraging their expertise to contribute to the final result. It's like having a team of experts who specialize in different areas collaborating to provide the best solution. By dividing the task and letting each expert handle their niche, the AI model can achieve better overall performance.

Tokens: In the context of AI and language models, tokens are chunks of text that are used as input or output. They can be individual words, characters, or even subwords, depending on how the language model is designed. For example, in the sentence "I love cats," the tokens would be "I," "love," and "cats." Tokens help the AI model understand and process language by breaking it down into manageable units. They allow the model to learn patterns, context, and relationships between words to generate meaningful responses or predictions.

See: https://platform.openai.com/tokenizer

TeMPOraL · on July 11, 2023

> "Mixture Of Experts": An "MoE" is an approach in AI that combines multiple specialized AI models, known as "experts," (...)

Wonder when that stopped being called just an "ensemble model", which is a term I recall from 10 years ago. Terminology churn?

why_only_15 · on July 11, 2023

Mixture of experts is different from ensembles because MoE happens at every layer as opposed to joining the models once at the end

TeMPOraL · on July 11, 2023

Thanks, that makes sense - and isn't obvious from the explanations I see people give.

dr_kiszonka · on July 11, 2023

Great explanations! How about Multi-Query Attention?

why_only_15 · on July 11, 2023

This is the original paper: https://arxiv.org/abs/1911.02150 . The idea is that with a transformer you have many heads, say 64 for LLaMa, and for each head you have 1 "query" vector one "key" vector and one "value" vector per token. Most of the cost of inferencing models is loading the key and value vectors from GPU memory to the GPU itself. the idea behind MQA is that instead of having 64 queries, 64 keys, and 64 values, you have 64 queries, 1 key, and 1 value ("Multi-Query" as opposed to "Multi-Head", the original name). This means that there is much less data to load from GPU memory to the GPU during inference.

dr_kiszonka · on July 11, 2023

Beautiful. Thanks!