Transformers from Scratch

cgearhart · on Aug 23, 2019

This is a _great_ article. One of the things I enjoy most is finding new ways to understand or think about things I already feel like I know. This article helped me do both with transformer networks. I especially liked how explicitly and simply things were explained like queries, keys, and values; permutation equivariance; and even the distinction between learned model parameters and parameters derived from the data (like the attention weights).

The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.

dusted · on Aug 23, 2019

And here I was, excited to learn something about actual transformers, something involving wire and metal..

myself248 · on Aug 23, 2019

Same. Magnetization current and core saturation are fairly fundamental properties influencing transformer design, but they're barely even mentioned in introductory texts.

I feel like a lot of modern transformers are just sort of cargo-cult imports of old designs because everyone who knew the salient parameters has retired and the current crew just kinda nudges things until they work. A from-scratch explanation, up to the current state of the art, would be invaluable to anyone who deals with them.

But nah. This is HN, where headlines are their own code.

petemc_ · on Aug 23, 2019

In that case you might enjoy this:

https://ludens.cl/paradise/turbine/turbine.html

Was posted here a while back. Fascinating guy.

CDSlice · on Aug 23, 2019

I thought it would be about making a Transformers like toy using 3D printing or something.

skohan · on Aug 23, 2019

That would be a really neat project: printing a functional transformer as a single piece.

NKCSS · on Aug 23, 2019

Same here, would have loved a build creating some actual transformers :)

megous · on Aug 23, 2019

We did in school on a coil winding machine, and also learned the math.

I don't rember the winding part being that much fun. :)

It was something old, akin to this: https://www.youtube.com/watch?v=Y-GyMYZ8yTU

segfaultbuserr · on Aug 23, 2019

RF transformers (in forms of various coils, chokes, baluns, etc) are more interesting and complex to analyze. I once winded three of them, and none worked. I thought I finally found an article that explains the subject, well, not a chance ;-)

quickthrower2 · on Aug 23, 2019

I was hoping for monad transformers.

rzk · on Aug 23, 2019

This is the best article I have read so far explaining the transformer architecture. The clear and intuitive explanation can’t be praised enough.

Note that the teacher has a Machine Learning course with video lectures on youtube that he references throughout the article : http://www.peterbloem.nl/teaching/machine-learning

Gallactide · on Aug 23, 2019

This man was my professor at the VU.

Honestly his lectures were fun and easy to look forward too, I'm really glad his post is getting traction.

If you find his video lectures they are a really graceful introduction to most ML concepts.

kranner · on Aug 23, 2019

VU Machine Learning 2019 playlist:

https://www.youtube.com/watch?v=-pve3oIvxa8&list=PLCof9EqayQ...

isoprophlex · on Aug 23, 2019

Stellar article, I never understood self attention; this makes it so very clear in a few concise lines, with little fluff.

The author has a gift for explaining these concepts.

NHQ · on Aug 23, 2019

This is sweet. I've written conv, dense, and recurrent networks from scratch. Transformers next!

Plug: I just published this demo using GD to find control points for Bezier Curves: http://nhq.github.io/beezy/public/

ropiwqefjnpoa · on Aug 23, 2019

Ah yes, machine learning architecture transformers, I knew that.

siekmanj · on Aug 23, 2019

Wow. I have been looking for a good resource on implementing self-attention/transformers on my own for the last week - can't wait to read this through.

ccccppppp · on Aug 23, 2019

Noob question: I have some 1D conv net for financial time series prediction. Could a transformer architecture be better for this task, is it worth a try?

hadsed · on Aug 23, 2019

If you think a longer context length might be helpful consider stacking convolutions to give higher units a bigger receptive field, or try the convolutional LSTM. If that helps and you have a further argument for why an even larger context window would be helpful then perhaps try attention and in that case a Transformer would be reasonable. But your stacked conv net would be the fastest and most obvious thing that should work (with the caveat that I know nothing else about your data and it's characteristics, which is a really big caveat).

Consider looking at your errors and judging whether they stem from things your current model doesn't do well but that Transformers do, i.e., correlating two steps in a sequence across a large number of time steps. Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

ccccppppp · on Aug 23, 2019

Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware such a thing existed.

> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.

Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.

hadsed · on Sept 5, 2019

Yeah, in some sense any layer is a "memory module". Perhaps more specifically, attention solves the problem of directly correlating two items in a sequence that are very, very far away from each other. I'd generally caution against using attention prematurely as it's extremely slow, meaning you'll waste a lot of your time and resources without knowing if it'll help. Stacking conv layers or using recurrence is an easy middle step that, if it helps, can guide you on whether attention could provide even more gains.

gwbas1c · on Aug 23, 2019

The title is deceiving. I thought this was an article about building your own electrical transformer, or building your own version of the 1980s toy.

kranner · on Aug 23, 2019

FWIW I immediately thought of the ML architecture, not the toy or electrical device. HN is very ML-heavy these days, and in that context 'transformers' has a familiar and obvious meaning.

Beldin · on Aug 23, 2019

Same here: i was wondering if he had implemented some autobots in the kids programming language Scratch.

Because that would seriously be awesome.

SenHeng · on Aug 23, 2019

I, too, thought it was the former, but the toy geek in me hoped for the latter. To my surprise it was neither.

coolness · on Aug 23, 2019

I think the top level domain of the author's blog kinda gives it away :)

ChickeNES · on Aug 23, 2019

Huh? The website's TLD is NL not ML. (And I'll also chime in to say that I thought it was either electrical transformers or 3D printed transformers toys).

coolness · on Aug 25, 2019

Oh wow, my sight must be poor since I honestly thought it was .ml.