Parameters: In the context of AI and language models, parameters refer to the internal settings or variables that an AI model uses to make predictions or generate responses. Think of them as the knobs and switches that can be adjusted to fine-tune how the AI understands and generates language. These parameters are learned during the training process, where the AI model analyzes vast amounts of data to optimize its performance.
Layers: In AI, layers are like stacked building blocks within a neural network, which is the fundamental structure of many AI models. Each layer performs different computations, transforming the input data as it passes through them. Think of a layer as a specific task or filter that the AI model can utilize to understand and process information. The deeper the neural network, the more layers it has, allowing for more complex patterns and representations to be learned.
"Mixture Of Experts": An "MoE" is an approach in AI that combines multiple specialized AI models, known as "experts," to work together on a task. Each expert focuses on a particular subset or aspect of the problem, leveraging their expertise to contribute to the final result. It's like having a team of experts who specialize in different areas collaborating to provide the best solution. By dividing the task and letting each expert handle their niche, the AI model can achieve better overall performance.
Tokens: In the context of AI and language models, tokens are chunks of text that are used as input or output. They can be individual words, characters, or even subwords, depending on how the language model is designed. For example, in the sentence "I love cats," the tokens would be "I," "love," and "cats." Tokens help the AI model understand and process language by breaking it down into manageable units. They allow the model to learn patterns, context, and relationships between words to generate meaningful responses or predictions.
This is the original paper: https://arxiv.org/abs/1911.02150 . The idea is that with a transformer you have many heads, say 64 for LLaMa, and for each head you have 1 "query" vector one "key" vector and one "value" vector per token. Most of the cost of inferencing models is loading the key and value vectors from GPU memory to the GPU itself. the idea behind MQA is that instead of having 64 queries, 64 keys, and 64 values, you have 64 queries, 1 key, and 1 value ("Multi-Query" as opposed to "Multi-Head", the original name). This means that there is much less data to load from GPU memory to the GPU during inference.