Is it only me, or after reading this article with a lot of high-level, vague phrases and anecdotes - skipping the actual essence of many smart tricks making transformers computationally efficient - it is actually harder to grasp how transformers “really work”.
I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation
There's also this type of articles where the first half of the article is easily understandable by a layman but then they suddenly drop a lot of jargon and math formulas and you get completely lost.
A friend once described these kin of descriptions by analogy with a recipie that went;
Recipie for buns,
First you need flour, this is a white fined grained powder that is produced from ground wheat that can be acquired by exchanging for money (a standardised convention for storing value) at a store which contains many such products. When mixed with the raising agent and other ingredients you should remove the buns from the oven when golden brown.
For this situation, if it feels worth it, I have been applying chatGPT Q&A on the jargon to bridge the gap. I haven’t read this article through yet, so can’t recommend, but in many cases it’s a super useful contextual jargon clearer.
Agreed, I have made my own shakespeare babbler following Karpathy's videos. I have a decent understanding of the structure and process but I don't really grasp how they work.
It's obvious how the error reduces, but I feel like there's something semanticly going on that isn't directly expressed in the code.
Im saving the latter half for tomorrow but so far its making sense. People have different learning styles, and I think this is lacking in the visual department. Parts like the vectors all being displayed next to the word like "cat", could have been better annotated to show where those numbers come from visually.
Sign me up for that, I made my own out of legos as a kid. I managed to make two of them that looked like robots and vehicles. Using just hinges and the small squares with rotating centers.
If it looks like Frankenstein, and it often acts like Frankenstein...
These things can and are and will be useful, but its not a magic ticket to knowing things. I booted up lamma2 at home, first question "how do I make you smarter?", all it returned were responses like "tell me a book or field to research". Obviously, that doesnt work, and it never suggested importing libraries or running training cycles.
Hey, I search a linux problem on brave, and the ai gives me the right answer 75% of the time, nice, but the other 25% it is just regurgiating outdated reddit threads. It has no cognition, it has no bearing on what consequences are, it just pronounces the result of a statistical model, it doesnt know what any of it means.
Articles like this pop up often on HN but I don't think they can lead to genuine comprehension. Doubtful that people outside ML circles will dive into Mamba or analogous architectures if they supplant transformers. They're just harder to understand.
Even as a kid, I was scared of thinking about what happened to humans in the vehicles when they transformed. You'd always be one bad pivot away from being crushed to a pulp.
Writing a "how does X work" post is a good way to learn about X. And when X is very popular, lots of people will be writing and sharing those posts, both for their own study and just for clicks.
Teach someone something, then reflect on the experience. Youll find you put additional effort into formalizing your own understanding, I always try to teach 2-3 people something when I learn it straight out of the gate.
Why do we add the positional encoding to the original vector instead of appending it? Doesn't adding positional data to the embedding destroy information?
The hope is that the neural network will learn to undo the positional information and understand that it's an augmentation used to tell the neural network where the word occurred in the sequence.
Sequence being a misleading term since the input to the network is actually a set of tokens meaning no positional information without the extra information added.
No, they are a type that takes a monad as a type parameter and creates a new monad that extends the original monads capabilities. The downside is the need to use lift all the time to access the outer monad.
I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation