Hacker News

blackbear_ · on May 17, 2023

Absolute BS. Transformers were literally introduced by that paper.

Sai_ · on May 17, 2023

Maybe ELI5 was the wrong level for my ask. I still want to understand it from a technical perspective.

Maybe ELI am a software engineer but not in AI/ML?

redox99 · on May 17, 2023

*Prompt:* I'm a software engineer who has never done anything with AI/ML. Explain to me transformers and the "Attention is all you need" paper.

*GPT4:* "Attention is All You Need" is a seminal paper published in 2017 by Vaswani et al. that introduced the Transformer architecture, a breakthrough in the field of AI and natural language processing (NLP). Transformers have since become the foundation for many state-of-the-art models, including GPT (Generative Pre-trained Transformer) series, BERT (Bidirectional Encoder Representations from Transformers), and more.

At a high level, the Transformer architecture is designed to handle sequential data, like text, in parallel rather than sequentially, as was common with earlier models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. This parallelization allows for faster training and better scalability. The key innovation in Transformers is the attention mechanism, which is used to weigh the importance of different input tokens when generating an output token.

Here's a simplified explanation of the main components of a Transformer:

    Input Embeddings: Transformers take input data in the form of text tokens. These tokens are converted into continuous vectors through an embedding layer.

    Positional Encoding: Unlike RNNs and LSTMs, Transformers don't have an inherent sense of sequence order. To address this, positional encoding is added to the input embeddings to provide information about the position of each token in the sequence.

    Encoder and Decoder: The Transformer architecture consists of an encoder and a decoder. The encoder is a stack of identical layers, each containing two primary components: a multi-head self-attention mechanism and a position-wise feed-forward network. The decoder is also composed of a stack of identical layers, but it has an additional multi-head attention mechanism that attends to the encoder's output.

    Multi-Head Attention: The attention mechanism is the core innovation in the Transformer architecture. It computes a weighted average of input tokens to generate context-aware representations. Multi-head attention allows the model to focus on different aspects of the input by employing multiple attention heads, each with its own set of learnable parameters.

    Feed-Forward Networks: These are simple fully connected layers that follow the attention mechanism in each layer of the encoder and decoder.

    Output: Finally, the decoder generates the output sequence token by token. This is done by applying a linear layer followed by a softmax activation function to get the probabilities of each possible output token.

In summary, the Transformer architecture introduced in the "Attention is All You Need" paper revolutionized NLP by enabling more efficient and powerful models. The key innovation is the attention mechanism, which helps the model learn complex relationships between input tokens and produce context-aware representations. This has led to a wide array of applications, from machine translation and text summarization to sentiment analysis and language modeling.

*Me:* It's okayish but you're probably interested in a more detailed low level explanation. Also transformers are widely used for computer vision nowadays (a bit forgivable because of ChatGPT 2021 cutoff, but they had already taken off by then)