This is a _great_ article. One of the things I enjoy most is finding new ways to understand or think about things I already feel like I know. This article helped me do both with transformer networks. I especially liked how explicitly and simply things were explained like queries, keys, and values; permutation equivariance; and even the distinction between learned model parameters and parameters derived from the data (like the attention weights).
The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.
Same. Magnetization current and core saturation are fairly fundamental properties influencing transformer design, but they're barely even mentioned in introductory texts.
I feel like a lot of modern transformers are just sort of cargo-cult imports of old designs because everyone who knew the salient parameters has retired and the current crew just kinda nudges things until they work. A from-scratch explanation, up to the current state of the art, would be invaluable to anyone who deals with them.
But nah. This is HN, where headlines are their own code.
RF transformers (in forms of various coils, chokes, baluns, etc) are more interesting and complex to analyze. I once winded three of them, and none worked. I thought I finally found an article that explains the subject, well, not a chance ;-)
Wow. I have been looking for a good resource on implementing self-attention/transformers on my own for the last week - can't wait to read this through.
Noob question: I have some 1D conv net for financial time series prediction. Could a transformer architecture be better for this task, is it worth a try?
If you think a longer context length might be helpful consider stacking convolutions to give higher units a bigger receptive field, or try the convolutional LSTM. If that helps and you have a further argument for why an even larger context window would be helpful then perhaps try attention and in that case a Transformer would be reasonable. But your stacked conv net would be the fastest and most obvious thing that should work (with the caveat that I know nothing else about your data and it's characteristics, which is a really big caveat).
Consider looking at your errors and judging whether they stem from things your current model doesn't do well but that Transformers do, i.e., correlating two steps in a sequence across a large number of time steps. Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.
Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware such a thing existed.
> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.
But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.
Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.
Yeah, in some sense any layer is a "memory module". Perhaps more specifically, attention solves the problem of directly correlating two items in a sequence that are very, very far away from each other. I'd generally caution against using attention prematurely as it's extremely slow, meaning you'll waste a lot of your time and resources without knowing if it'll help. Stacking conv layers or using recurrence is an easy middle step that, if it helps, can guide you on whether attention could provide even more gains.
FWIW I immediately thought of the ML architecture, not the toy or electrical device. HN is very ML-heavy these days, and in that context 'transformers' has a familiar and obvious meaning.
Huh? The website's TLD is NL not ML. (And I'll also chime in to say that I thought it was either electrical transformers or 3D printed transformers toys).
The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.