Hacker News new | past | comments | ask | show | jobs | submit login
How do transformers work? (nintyzeros.substack.com)
134 points by willem19x 11 months ago | hide | past | favorite | 39 comments



Is it only me, or after reading this article with a lot of high-level, vague phrases and anecdotes - skipping the actual essence of many smart tricks making transformers computationally efficient - it is actually harder to grasp how transformers “really work”.

I recommend videos from Andrej Karpathy on this topic. Well delivered, clearly explaining main techniques and providing python implementation


There's also this type of articles where the first half of the article is easily understandable by a layman but then they suddenly drop a lot of jargon and math formulas and you get completely lost.


A friend once described these kin of descriptions by analogy with a recipie that went;

Recipie for buns, First you need flour, this is a white fined grained powder that is produced from ground wheat that can be acquired by exchanging for money (a standardised convention for storing value) at a store which contains many such products. When mixed with the raising agent and other ingredients you should remove the buns from the oven when golden brown.


For this situation, if it feels worth it, I have been applying chatGPT Q&A on the jargon to bridge the gap. I haven’t read this article through yet, so can’t recommend, but in many cases it’s a super useful contextual jargon clearer.


Agreed, I have made my own shakespeare babbler following Karpathy's videos. I have a decent understanding of the structure and process but I don't really grasp how they work.

It's obvious how the error reduces, but I feel like there's something semanticly going on that isn't directly expressed in the code.


Im saving the latter half for tomorrow but so far its making sense. People have different learning styles, and I think this is lacking in the visual department. Parts like the vectors all being displayed next to the word like "cat", could have been better annotated to show where those numbers come from visually.


Super data science had a nice episode on this recently.


Does this apply to both types of transformers? Decepticons and Autobots?


I would actually love to see an article on the design process for transforming toys like Transformers.

It seems like such a cool design feat to come up with those. At least I thought so as a kid...


Sign me up for that, I made my own out of legos as a kid. I managed to make two of them that looked like robots and vehicles. Using just hinges and the small squares with rotating centers.


I guess I should just have googled it. First result:

https://interestingengineering.com/culture/the-design-proces...


There's more to them than meets the eye.


How do you determine the required permeability of the core material?


You have something against Maximals and Predacons?


No, just was thinking OG Transformers. All though some of the very new stuff isn’t my thing.


Is it normal not to come away from these with any kind of intuition like “ah, that’s why transformers work so well”

Instead I just come away feeling like it’s a Frankenstein of different matrices and statistics.


If it looks like Frankenstein, and it often acts like Frankenstein...

These things can and are and will be useful, but its not a magic ticket to knowing things. I booted up lamma2 at home, first question "how do I make you smarter?", all it returned were responses like "tell me a book or field to research". Obviously, that doesnt work, and it never suggested importing libraries or running training cycles.

Hey, I search a linux problem on brave, and the ai gives me the right answer 75% of the time, nice, but the other 25% it is just regurgiating outdated reddit threads. It has no cognition, it has no bearing on what consequences are, it just pronounces the result of a statistical model, it doesnt know what any of it means.


And we already have people zealously argue that LLM has cognition, just like predicted by sci-fi


The reason they're good is that they're not recurrent: you can do attention over all of the sequence at once.

Recurrent models had two flaws: (1) vanishing gradient problem (2) not parallelizable


Wasn't there a better name that could have been used for these things?

One that wasn't already used for an extremely common electrical device, as well as a toy/movie franchise?


Articles like this pop up often on HN but I don't think they can lead to genuine comprehension. Doubtful that people outside ML circles will dive into Mamba or analogous architectures if they supplant transformers. They're just harder to understand.


Arent they aliens? So their technology to transform between vehicle and robot would be somewhat alien to humans.


Even as a kid, I was scared of thinking about what happened to humans in the vehicles when they transformed. You'd always be one bad pivot away from being crushed to a pulp.


Transformers use magnetic fields to transform electricity.

That's why Tesla's AC system beat out Edison's DC system.


Sir, teslas have batteroes, those are DC.


So if this article is of interest to you then this was a good discussion of a nice 3D visualization of a small LLM (NanoGPT) in motion ...

LLM Visualization https://news.ycombinator.com/item?id=38505211 (https://bbycroft.net/llm)


Time for the weekly LLM explainer.


Writing a "how does X work" post is a good way to learn about X. And when X is very popular, lots of people will be writing and sharing those posts, both for their own study and just for clicks.


Hence, the weekly LLM explainer at the top of the hackernews feed.


How does that work?


Teach someone something, then reflect on the experience. Youll find you put additional effort into formalizing your own understanding, I always try to teach 2-3 people something when I learn it straight out of the gate.


Waiting for @dang to post a list


Why do we add the positional encoding to the original vector instead of appending it? Doesn't adding positional data to the embedding destroy information?


The hope is that the neural network will learn to undo the positional information and understand that it's an augmentation used to tell the neural network where the word occurred in the sequence.

Sequence being a misleading term since the input to the network is actually a set of tokens meaning no positional information without the extra information added.


What's up with the excessive reliance on nested bullet points? It certainly doesn't look like a good blog post.


i want to see a "how do State Spaces work" paper


Oh that RNN-CNN duality is such a nice math trick. You can do recurrence in parallel.


No, they use two coils around a metallic core. I made a few when I was a kid. I know.


No, they are a type that takes a monad as a type parameter and creates a new monad that extends the original monads capabilities. The downside is the need to use lift all the time to access the outer monad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: