I wrote a home-brew neural network around 2006, just to see what would happen. I'd read no papers on it, and just kind of made it up as I went along. The result was basically a cube of "neurons" which had stronger and weaker trigger points to their neighbors and would propagate "spark" to one or more neighbors based on the strength and direction of spark they got from their other neighbors. Each of the connections and weights had a "certainty" attached to it which would go up when the output was closer to the ideal and would go down when the output was garbage. When those certainty values hit zero, the weights for that neuron would be re-randomized. I quickly realized that the only thing this could really do was try to transform one 10x10 pixel bitmap into a different one. But it was fascinating that it actually seemed to "learn" patterns as they got baked in.
What would be called "attention" now, though, basically didn't exist in that system. Once you started training it on something new it lost everything.
For anyone wondering, it was written in Actionscript 3, and ridiculously, each neuron was bound to a display class that displayed as a semitransparent cube that lit up as the inputs propagated through them. A thoroughly ridiculous side project.
But other than scaling that from 1000 neurons to billions, I'm curious what has changed about the concepts of pathing or tolerance to make these models better? Maybe my concept of the principle behind modern LLMs is too archaic or rooted in a cartoon understanding of our own wetware that I tried to reproduce.
[edit: I'm describing an ancient home project... for anyone downvoting this, I'm more than receptive to hearing your reasons why it's stupid. I'm the first to admit it seems stupid!]
> But other than scaling that from 1000 neurons to billions, I'm curious what has changed about the concepts of pathing or tolerance to make these models better? Maybe my concept of the principle behind modern LLMs is too archaic or rooted in a cartoon understanding of our own wetware that I tried to reproduce.
Realastically... the problem is a data problem. The math is mostly there. Transformers et al would have been figured out in no time had we had the sort of data tools we have today. I'm talking about the ease with which you can take gigabytes of data, throw it in S3, and analyze it in minutes.
That, combined with cheap and accessible compute (cloud) and the maturation of CUDA meant it was all a matter of time before this took place.
I'd need to go back and look at the code, but it was something primitive but similar to backprop. There was an evaluation routine that compared what lit up on the back of the cube to the desired output. The farther off it was, the more the certainty got docked from any of the [connected] neurons one layer up from the bad pixel, and that dragged down the certainties on the next layer and so on. It didn't have the ability to back check which neurons were involved in which specific output node. But I guess it was a scatter shot attempt at that.
What would be called "attention" now, though, basically didn't exist in that system. Once you started training it on something new it lost everything.
For anyone wondering, it was written in Actionscript 3, and ridiculously, each neuron was bound to a display class that displayed as a semitransparent cube that lit up as the inputs propagated through them. A thoroughly ridiculous side project.
But other than scaling that from 1000 neurons to billions, I'm curious what has changed about the concepts of pathing or tolerance to make these models better? Maybe my concept of the principle behind modern LLMs is too archaic or rooted in a cartoon understanding of our own wetware that I tried to reproduce.
[edit: I'm describing an ancient home project... for anyone downvoting this, I'm more than receptive to hearing your reasons why it's stupid. I'm the first to admit it seems stupid!]