Hacker News new | past | comments | ask | show | jobs | submit login
I made a transformer to predict a simple sequence manually (vgel.me)
360 points by lukastyrychtr on Sept 22, 2023 | hide | past | favorite | 94 comments



A related line of work is "Thinking Like Transformers" [1]. They introduce a primitive programming language, RASP, which is composed of operations capable of being modeled with transformer components, and demonstrate how different programs can be written with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have an excellent blog post on it as well [2]. Follow on work actually demonstrated how RASP-like programs could actually be compiled into model weights without training [3].

[1] https://arxiv.org/abs/2106.06981

[2] https://srush.github.io/raspy/

[3] https://arxiv.org/abs/2301.05062


Huge fan of RASP et al. If you enjoy this space, might be fun to take a glance at some of my work on HandCrafted Transformers [1] wherein I hand-pick the weights in a transformer model to do long-handed addition similar to how humans learn to do it in gradeshcool.

[1] https://colab.research.google.com/github/newhouseb/handcraft...


It seems like a functional language like Haskell would be the right tool for this.

Also going from a net to code would be super interesting in terms of explain-ability.


I thought I understood transformers well, even though I had never implemented them. Then one day I implemented them, and they didn't work/train nearly as well as the standard pytorch transformer.

I eventually realized that I had ignored the dropout, because I thought my data could never overfit. (I trained the transformer to add numbers, and I never showed it the same pair twice.) Turns out dropout has a much bigger role than I had realized.

TLDR, just go and implement a transformer. The more from scratch the better. Everyone I know who tried it, ended up learning something they hadn't expected. From how training is parallelized over tokens down to how backprop really works. It's different for every person.


Would you have any literature to recommend, to help get some insight in how to approach the task?


Andrej Karpathy has a series on YouTube where he builds everything up from scratch. He implements a transformer when he builds a GPT-style model: https://www.youtube.com/watch?v=kCc8FmEb1nY


Thank you for the tip!


There are a lot of "simple transformer" implementations on github. But I'll recommend Microsofts new Phi 1.5: https://huggingface.co/microsoft/phi-1_5/blob/main/modeling_... It's well written, and very modern, including rotary embeddings and a kv-cache for inference.


Thank you!


I love Karpathy et al but this is what finally made it click for me: https://youtu.be/kWLed8o5M2Y?si=SJT5_lCJ0hSR7Z_k


I've been kicking around a similar idea for awhile. Why can't we have an intuitive interface to the weights of a model, that a domain expert can tweak by hand to accelerate training? For example, in a vision model, they can increase the "orangeness" collection of weights when detecting traffic cones. That way, instead of requiring thousands/millions more examples to calibrate "orangeness" right, it's accelerated by a human expert. The difficulty is obviously having this interface map to the collections of weights that mean different things, but is there a technical reason this can't be done?


> weights of a model, that a domain expert can tweak by hand

This sounds similar to how image recognition was done before deep learning [1]

[1] https://www.youtube.com/watch?v=8SF_h3xF3cE&t=1358s


Great example. Right, the deep learning approach uncovers all kinds of hidden features and relationships automatically that a team of humans might miss.

I guess I'm thinking about this problem from the perspective of these GPT models requiring more training data than a normal person can acquire. Currently, it seems you need the entire internet worth of training data (and a lot of money) to get something that can communicate reasonably well. But most people can communicate reasonably well, so it would be cool if that basic communication knowledge could be somehow used to accelerate training and minimize the reliance on training data.


I am still learning transformers, but I believe part of the issue may be that the weights do not necessarily correlate to things like "orangeness"

Instead of a transformer for each color, you have like 5 to 100 weights that represent some arbitrary combination of colors. Literally the arbitrariness is defined by the dataset and the number of weights allocated.

They may even represent more than just color.

So I am not sure if a weight is actually a "dial" like you are describing it, where you can turn up or down different qualities. I think the relationship between weights and features is relatively chaotic.

Like you may increase orangeness but decrease "cone shapedness" or accidentally make it identify deer as trees or something, all by just changing 1 value on 1 weight


It is possible that the parameters, like weights in a machine learning model, interact to yield outcomes in a manner analogous to the interactions between genes in biological systems, which produce traits. These interactions involve complex interdependencies, so there really aren't 1 to 1 dials.


> the deep learning approach uncovers all kinds of hidden features and relationships automatically that a team of humans might miss

sitting in a lecture from a decent DeepLearning practitioner, there were two questions from the audience (among others). The first question asked "How can we check the results using other models, so that computers will catch the errors that humans miss?"

The second question was more like "when a model is built across a non-trivial input space, the features and classes that come out are one set of possibilities, but there are many more possibilities. How can we discover more about the model that is built, knowing that there are inherent epistemological conflicts in any model?"

I also thought it was interesting that the two questioners were from large but very different demographic groups, and at different stages of learning and practice (the second question was from a senior coder).


The reason you're looking for is called "The Bitter Lesson".

The short version is, trying to give human assistance to AIs is almost always less cost-effective than making them run on more computing power.

By the time your human expect has calibrated your weight layers to detect orange traffic cones, your GPU cluster has trained its AI to detect traffic cones, traffic ligts, trees, other cars, traffic cones with a slightly different shade of orange, etc.


The AI community has over-learned this bitter lesson and is now unprepared to make useful AI in data-constrained environments.

This is why almost all of the countless enterprise AI demos taking place as we speak are doomed to failure.


"The bitter lesson of machine learning is that building knowledge into agents does not work in the long run, and breakthrough progress comes from scaling computation by search and learning.02 This applies to domains where domain knowledge is weak or hard to express mathematically. The rapid progress of ML applied to LQCD, mol.2 dyn., protein folding, and computer graphics is the result of combining domain knowledge with ML."

This passage says that the advantages of scaling a certain kind of learning are especially good when .. two conditions.. but a side-effect of that statement is, when knowledge is well known, and maybe straightforward to express, these kinds of learning systems, not as great.. this is true.

without taking on big other topics, I think this "bitter lesson" is unspecific enough to include some self-serving utility.. just tell the other camps to give up, you lost. that sort of thing.


The number of layers and weights is really not at a scale we can handle updating manually, and even if we could the downstream effects of modifying weights are way too hard to manage. Say you are updating the picture to be better at orange, but unless you can monitor all the other colours for correctness at the same time you probably are creating issues for other colours without realizing it


The technical reason it can't be done (or would be very difficult to do) is that weights are typically very uninterpretable. There aren't specific clusters of neurons that map to one concept or another, everything kind of does everything.


I wonder if an expert can "impose" weights onto a model and the model will opt to continue with them when it resumes training. For example, in the vision example, the expert may not know where "orangeness" currently exists, but if they impose their own collection of weight adjustments that represent orangeness, will the model continue to use these weights as the path of least resistance when continuing to optimize? Just spitballing, but if we can't pick out which neurons do what, the alternative would seem to be to encourage the model to adopt a control interface of neurons.


That would make it less efficient - since learning is compression, a less compressed model will also learn less at the same size.


Humans have much more compressed models, trying to transfer human learning to machine learning could potentially be a way to get more efficient models.

The way we do that currently is by labeling data for training, but maybe there are better ways to do it. Like some semi code to write with hints for the model. Like, instead of labeled data, could have a series of "lectures" of labeled data that would lead to a good end state, instead of training on all the data in parallel.

You don't teach a child calculus by showing them a million calculus problems after all, you ramp up starting with simple numbers and then slowly ramping up with more concepts. But to do that we would need to change how we train models.

Edit: By doing it that way you could see the skill of the model after each lecture, and update the lecture to try to make the model learn better. Not sure how to do that, but such ways to work with parts of models is a potential way forward.


There are techniques for this called "curriculum learning" and "textbooks".

I'm not sure exactly what is in the textbooks since I admit to not reading the papers yet.

I'm personally wondering if you could increase the reliability of training on web data by labeling it with where each document came from, so it knows different authors disagree on things. But this brings back the issue where people don't like it if a model can write "in the style of Author Name"…


Might be a worthwhile tradeoff. Less efficient model but a clear control interface, versus more efficient model but it's a blackbox.


On the idea of interpreting the weights, I've been very interested if it's possible to compute basis vectors of the weights matrix to define the core concepts within the model and then do a change of basis to allow reorganizing the model to more human understood concepts?

I think the inherent compression of a specific training set into a matrix makes this more difficult cause the basis vectors likely won't contain clean representations of human ideas, but I also wonder if starting a new training set with an initialized (or fixed) matrix of human defined concepts would help align the model's weights to something that can be interpretable


The attention mechanisms present in transformers don’t seem easy to map to semantics that humans can understand. There are too many parameters involved


I always wanted to at least have a shallow understanding of Transformers but the paper was way too technical for me.

This really helped me understand how they work! Or at least I understood your example, it was very clear. And I also got to brush up my matrix stuff from uni lol.

Thanks!


It's some kind of abstract machine, like a turing machine or the machine that parses regexes, isn't it?


This is simplified a bit - It's just a "machine" that maps [set of inputs] -> [set of probabilities of the next output]

First you define a list of tokens - lets say 24 letters because that's easier.

They are a machine that takes an input sequence of tokens, does a deterministic series of matrix operations, and outputs what is a list of the probability of every token.

"learning" is just the process of setting some of the numbers inside of a matrix(s) used for some of the operations.

Notice that there's only a single "if" statement in their final code, and it's for evaluating the result's accuracy. All of the "logic" is from the result of these matrix operations.


It's kind of hard to interpret these things as "automata" in the sense that one might usually think of them.

Everything is usually a little fuzzy in a neural network. There's rarely anything like an if/else statement, although (as in the transformer example) you have some cases of "masking" values with 0 or -∞. The output is almost always fuzzy as well, being a collection of scores or probabilities. For example, a model that distinguishes cat pictures and dog pictures might emit a result like "dog:0.95 cat:0.05", and we say that it predicted a cat because the dog score is higher than the cat score.

In fact, the core of the transformer, the attention mechanism, is based on a kind of "soft lookup" operation. In a non-fuzzy system, you might want to do something like loop through each token in the sequence, check if that token is relevant to the current token, and take some action if it's relevant. But in a transformer, relevance is not a binary decision. Instead, the attention mechanism computes a continuous relevance score between each pair of tokens in the sequence, and uses those scores to take further action.

But some things are not easily generalized directly from a system based on of binary decisions. For example, those relevance scores are used as weights to compute a weighted average over tokens in the vocabulary, and thereby obtain an "average token" for the current position in the sequence. I don't think there's an easy way to interpret this as an extension of some process based on branching logic.


It’s just any pile of linear algebra that’s been in contact with the AllSpark, right?


Yes! Check out this paper describing how Linear Transformers are secretly Fast Weight Programmers: https://arxiv.org/abs/2102.11174.


Jürgen Schmidhuber reminds me of Richard Feynman in one very specific way: while everybody else in their respective fields uses math to hide the deep insights they've been mining for publications, Schmidhuber and Feynman just simply tell you the big insight, and then proceed to refine and illuminate it with math.


Neural networks are Turing machines. You can make them perform any computation by carefully setting up their weights. It would be nice to have compilers for them that were not based on approximation, though.


> It would be nice to have compilers for them that were not based on approximation, though.

Could you elaborate?


People typically set the weights of a neural network using heuristic approximation algorithms, by looking at a large set of example inputs/outputs and trying to find weights that perform the needed computation as accurately as possible. This approximation process is called training. But this approximation happens because nobody really knows how to set the weights otherwise. It would be nice if we had "compilers" for neural networks, where you write an algorithm in a programming language, and you get a neural network (architecture+weights) that performs the same computation.

TFA is a beautiful step in that direction. What I want is an automated way to do this, without having to hire vgel every time.


A turing complete system doesn't necessarily mean it's useful, it just means that it's equivalent with a turing machine. The ability to describe any possible algorithm is not that powerful in itself.

As an example, algebraic type systems are often TC simply because general recursion is allowed.

Feed forward networks are effectively DAGs and while you may be able to express any algorithms using them they are also pairwise linear in respect to inputs.

Statistical learning is powerful in finding and matching patterns, but graph rewriting, which is what your doing with initial random weights and training is not trivial.

More importantly it doesn't make issues like the halting problem decidable.

I don't see why the same limits in graph rewriting languages which were explored in the 90s won't hit using feed forward networks as computation systems outside of the application of nation-state scale computing power.

But I am open to understanding where I am wrong.


Short version of this without the caveats: It's not even Turing complete.

I review a few papers on the topic here: https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/


The point of training is to create computer programs through optimization, because there are many problems (like understanding language) that we just don't know how to write programs to do.

It's not that we don't know how to set the weights - neural networks are only designed with weights because it makes them easy to optimize.

There is no reason to use them if you plan to write your own code for them. You won't be able to do anything that you couldn't do in a normal programming language, because what makes NNs special is the training process.


Why would you do that when it's better to do the opposite? Given a model quantize it and compile it to direct code objects that do the same thing much much much faster?

The generality of the approach [NNs] implies that they are effectively a union of all programs that may be represented, and as such there needs to be the capacity for that, this capacity is in size, which makes them wasteful for exact solutions.

it is fairly trivial to create FFNNs that behave as decision trees using just relus if you can encode your problem as a continuous problem with a finite set of inputs. Then you can very well say that this decision tree is, well, a program, and there you have it.

The actual problem is the encoding, which is why NNs are so powerful, that is, they learn the encodings themselves through grad descent and variants.


That makes no entropy or cyber-netic sense at all. You would just get a neural network that outputs the exact formula, or algo. Like, if you would just do a sine it would be a taylor series encoded into neurons

Its like going from computing PI as a constant to computing it as a giantic float.

You lose info


Something close exists:

RASP https://arxiv.org/abs/2106.06981

python implementation: https://srush.github.io/raspy/


The whole point is that we don’t know what rules to encode and we want the system to derive the weights for itself as part of the training process. We have compilers that can do deterministic code - that’s the normal approach.


Would I be able to solve the Travelling Salesman Problem with a Transformer with the appropriately assigned weights? That would be an achievement. You'd beat some known bounds of the complexity of TSP.


We don't know.

And even if we could try to solve this problem, there is no known way of verifying the solution would be correct in general.


TSPs are not unsolvable.

My point was that Transformers and neural networks as they are now are not Turing machines if you don't allow for the model to grow with the input size. That said, it has to grow in "depth" not just parameters. The fact that people think fixed depth computation can universally compute everything is a worrying trend.


> maybe even feel inspired to make your own model by hand as well!

Other then a learning exercise to satisfy your curiosity what are you doing with this? I'm starting to get the feeling that anything complex with ml models is unreasonable for a at home blog reader?


In nanoGPT, you pre-train a model on Shakespeare, and in 3 minutes it gets to a Lewis Carroll's Jabberwocky level of fidelity on the source material. It makes up lots of plausible-seeming old English words, learns the basics of English grammar, the layout of the plays, etc. I was pretty amazed that it got that good in such a short period of time.

I think locally training a bunch of models to the fidelity of Shakespeare-from-Wish.com might tell you when you've hit on a winning architecture, and when to try scaling up.


Author states in the first paragraph of their blog post:

"I've been wanting to understand transformers and attention better for awhile now—I'd read The Illustrated Transformer, but still didn't feel like I had an intuitive understanding of what the various pieces of attention were doing. What's the difference between q and k? And don't even get me started on v!"


lol three people said the same thing, i get that. thats why i said other than learning and satisfying curiosity...


It's an excellent learning exercise, not just to satisfy curiosity but to develop and deepen understanding.


I dunno, maybe they actually enjoy hacking on projects like this? Weird I know.


Darn it Theia: I can't stop playing your hoverator game!


[stub for sweeping offtopicness under the rug]


i thought, here is a guy building an electrical transformer by hand, but no, more AI shite


same lol but this post actually looks rly good


Minor request: can we have 'neural network' or something in the title? This is related to the machine learning 'transformer' architecture, rather than the bundle of coils that couples two circuits electromagnetically.


And not about "alien robots who can disguise themselves by transforming into everyday machinery, primarily vehicles" [1] either. :)

[1] https://en.wikipedia.org/wiki/Transformers_(film)


"Transformers: More than meets the eye!"

That was my second thought. My first thought was coils of wire wrapping iron cores done by someone who does not know what they're doing.


Ha, I didn't think of that. But if it was, that might be more impressive than either coiling wire or writing Python code :)


You might appreciate this then https://m.youtube.com/watch?v=uFmV0Xxae18


"no training" gave it away for me.


I misinterpreted that as "I have no training; I am a beginner" rather than "I am not going to train this thing" :)


That's how I interpreted as well.

"Electrical transformer without any formal training in electric engineering" is basically how the title read to me. Followed by a lot of confusion when the article was not that.

Good on him, it's just... an odd way to phrase it :)


Heh, I went a step further and thought "ooh, a novice EE project, but titling it like that everyone's going to be confused about it involving machine learning" until I realized I was the one that got it backwards...


I didn't even consider that it COULD be about LLMs until I saw the comment above.


Though he was a novis and had no training in the field.


I also thought the physical transformer


Funny how even though I studied EE, it didn't cross my mind it could be an electrical transformer.


It's always mildly annoying when different technologies have the same name or acronym


Somehow I expected he made it out of mechanical parts. Like literally “by hand”.


Nitpick: the author is female


[flagged]


I don't think that is the prevailing opinion among trans folks.


[flagged]


I don't think they undergo hormonal treatment just to transition their gender.


Well, one comment at the root was flagged and the thread was deleted, so hopefully I can reply to you here, plus you were starting to replying to another guy you thought he was me.

My answer was: "Gender-affirming care and gender transition (as the results on Google will call it if you accidentally search "sex transition"), none of this changes your sex, most mammals (including humans) have the XY sex determination system, taking estrogens or other hormones doesn't make you lose your Y and gain an extra X, nor does it make you start producing ova, the only difference is how you look and sound, which (hopefully) makes you more comfortable with your gender, and that's why it's called "gender-affirming care" and "gender transition"."


I don't understand the focus on genes that some people seem to be placing. After the gonads are formed the Y chromosome does not really matter (unless if you have a particularly nasty mutation). All Y does is carry the SRY gene. If you did magically managed to lose your Y chromosome would that make you a woman? No! You would stay a man and literally nothing would change in your life. Same goes if you have two X chromosomes and one magically gets replaced with a Y.

I am also worried about the focus on genes given the existence of intersex people. It's especially weird since sex categories existed long before the creation of modern biology and at least for some people the term sex got redefined to exclude certain men and women who often don't even realize it.

The term gender affirming care seems accurate to me, as part of affirming their gender, trans people (who used to be called transsexual) attempt to transition so that their sex matches their gender identity. On the other hand, the term gender transition seems to include both sex changes and social transition.

I should also mention that estrogen when does not change how you sound, it does however cause other changes not related to one's appearance.


XY sex-determination system is based on the fact that the chromosomes one has at birth cannot change naturally throughout their lifetime, so magical or hypothetical scenarios where chromosomes can be altered doesn't align with our current scientific understanding, but if it makes you happier, sure, in a magic world humans would be able to change their sex.

Also, I don't see how the existence of intersex people should be somewhat problematic in this regard, intersex is a congenital condition, like people who are born with six fingers, that doesn't make the fact that humans have 5 fingers controversial.

To be honest I think I'm kind of stating the obvious and most people (trans people included obv) would agree with me, which is why they use "gender affirming care", "gender transition" and "transgender" (instead of transsexual which is in fact seen as an outdated term). If you don't care about true biological data (plural of datum), but only the opinions of trans people, I can ask my trans friends (one nonbinary and two trans women) what they think about it, even though I actually know them well, and they would agree with me.


Please help me understand this. You are in fact claiming that if somehow someone lost their Y chromosome that would make him a woman despite the lack of any physical or mental changes? I think you have a radically different view of sex compared to most people if so as it is fundamentally incompatible with our social structure.

The claim that most humans have 5 fingers or that most men have XY chromosomes is not controversial. However the claim that you must have 5 fingers to be a human is! This is what the XY sex-determination system is based on, they noticed that most men have XY chromosomes and that most women have XX chromosomes and declared intersex men and women who do not match this model to actually be of the other sex by making an inaccurate map from genes to sex (which is perceived socially and existed before said model).

I believe that the above 2 arguments together are enough to show that it is not an accurate nor a useful sex determination system from a logical perspective.

I think people nowadays avoid the term transsexual because it might be confused with a sexual orientation. I also feel like some people consider the term sex to be vulgar, which is why even these who consider gender=sex often prefer the term gender even when it does not refer to the grammatical gender. Finally the term transgender is often used by some in order in order to be inclusive to people who are not medically transitioning.

You are free to show this thread to your friends if you want. I am sure there are trans people in both camps of the argument.


I think I was very clear when I said: "XY sex-determination system is based on the fact that the chromosomes one has at birth cannot change naturally throughout their lifetime, so magical or hypothetical scenarios where chromosomes can be altered doesn't align with our current scientific understanding", if people manages to change their chromosomes I guess the sex-determination system would probably change but we don't live in that magical world so things are much easier (yeah!)

>However the claim that you must have 5 fingers to be a human is! This is what the XY sex-determination system is based on

No idea what you're talking about ahah, people with a congenital condition are still human and no one said otherwise.


[flagged]


Gender-affirming care, also known as transition (transition what? Their sex!), is done in order to make trans people more comfortable in society, in their inner self (hormones affect mood, behavior, feelings, etc) and in their body by changing their sex. This is why they are prescribed sex hormones as these are what guides gene expression and decide sex-dependent characteristics. While some trans people would be content with just changing their physical appearance most of them seem to actually aim to change their sex.


Gender-affirming care and gender transition (as the results on Google will call it if you accidentally search "sex transition"), none of this changes your sex, most mammals (including humans) have the XY sex determination system, taking estrogens or other hormones doesn't make you lose your Y and gain an extra X, nor does it make you start producing ova, the only difference is how you look and sound, which (hopefully) makes you more comfortable with your gender, and that's why it's called "gender-affirming care" and "gender transition".


[dead]


We are not discussing whether sex is a spectrum or not though. Was the nonbinary person that you talked with undergoing medical transition? If not this might explain their view.


The wider point made in that piece is, it's impossible to change sex. For humans and other mammals at least.


Let's agree to disagree. Out of curiosity, did you adopt this belief before or after talking with your friend?


(you are talking with another person, not me)


[flagged]


This isn't a color comment. It's a small correction to a factual error. No culture war to be had here.


[flagged]


Why are people allowed to post this hateful culture war shit on this website?



_clenches fists_ “Replying is against the rules!!”


They aren't being mean. They are just nitting that s/he/she/g


What the hell are you talking about?


me too, i was hoping to see a flyback transformer wound with a power drill or something




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: