I learned about neural nets in roughly the late 80s and early 90s and thought they would transform science. I struggled to understand backprogation and gradient descent. My undergrad work was training the weights of a linear model using gradient descent to predict gene encoding regions. By the time I got to grad school in '95, a very small number of people were using MLPs to predict protein secondary structure and had reached about 75% accuracy and got stuck. I was told, at the time, there wasn't enough data or CPU cycles to train good networks, so why bother?
A few years later SpamAssassin came out and I tried to use it to train a spam classifier without any luck. And I tried my hand at a few protein structure SVM classifiers (failng miserably, I didn't really understand it was critical to have a balanced false and true training set).
A few years after that I landed at Google (around 2007) and very few people were doing machine learning, other than ads and search and the work that was being done was far from what I knew about (mainly supervised training with SGD on batched data). Eventually Google adopted the paradigm I enjoy (synchronized SGD using allreduce).
Nowadays, we have ample CPU and data to train amazing models (with the most interesting working being around large language models). It took a lot longer to get there than I expected.
I am sympathetic to Gates' position to this day. Some thoughts:
If you gave someone in 1994 the GPT-3 code and dataset, it would be impossible for them to regress and very difficult to run even if we regressed it for them. AI algorithms may not be limited by "what we tell them" but they ARE limited by the hardware we run them on.
NN models do converge to something (both in the sense of regressions converging, but also in the sense that adding more nodes eventually stops improving performance at a given task.) I suspect but cannot prove that in most cases what it converges to could be expressed more concisely and efficiently as something other than a NN. (I.e. that NNs can approximate any function does not imply they can do so efficiently)
So at the end of the day, the programmer needs to understand the algorithm well enough to know if a NN-based implementation of it would achieve sufficient performance on available hardware. If the answer is no, then the programmer still has to come up with something alone.
26 years ago... How common was this sentiment back then? How many had serious faith in the neural network approach, or at least were very excited about its potential?
Also, what are some current technologies and ideas that are unpopular but which have some dedicated fans who insist there's major untapped potential (even if maybe it can't be fully tapped into today)? Specifically, little-known things rather than pretty popular stuff like quantum computing.
My dad was a welder and HVAC technician by day and a bit of an intellectual by night. He was all over expert systems in the late 80’s. I can distinctly remember him excitedly explaining them to me and I just couldn’t fathom what was interesting about it at all. Then in the early 90’s he started pivoting to neural networks.
Then he got killed at work.
I didn’t hear about neural networks again for close to 20 years, but I can attest at least someone was stoked about them back then.
Thank you. It's one of those things where I loved the guy but it's taken a while to really unpack the layers he was operating at. I'm older now than he was when he passed and I marvel at the diversity of stuff he got into.
I got one of the earliest PhDs in neural networks (in 1992!), and my dissertation about improving back propagation. Even though I actually saw how well it worked, I somehow had no faith in the technology as it seemed like a toy due to the hopelessly slow processors we had at the time. I did implement back propagation on a Connection Machine CM2 but found it too complex and ultimately wound up running my code on a bunch of IBM RS 6000 workstations instead.
You know this, but among others Werbos (dissertation 1974, the one you were building on) would probably raise an eyebrow at your "earliest" assertion :)
NN have been cyclical for decades now and I expect they'll cycle again.
You are right. I got caught in the wave of excitement after the Rumelhart, McLelland and Hinton PDP book chapter of 1986. But as you say, there were a lot of things that had already been done in NN going back to perceptrons. My first exposure to neural networks was in 1987 in Byte magazine of all places. There was a great article by Bart Kosko on associative memories.
When I graduated, there were absolutely no jobs in machine learning and I used to get blank looks when I told people about my work. Fortunately, I was able to get a job in the systems field and worked there for many years. I was able to get back into machine learning about 8 years ago and am semi-retired now.
I'm not complaining as I was always well paid in my career. But sometimes, being too early in a field can be as bad as being too late :-)
Over the years, I have acquired a fair number of citations for my papers which were mostly published in IEEE Transactions in Neural Networks. But I am certain that my work did not inspire any magazine articles - the field was considered too far away from mainstream CS at the time. There was even grumbling from some faculty in my CS department that my dissertation should not have been in CS but rather in statistics :-)
I am still getting about a new citation every month for one of my papers that was published 27 years ago!
I was in college and interested in neural nets when this was written, and as it happens later graduated and worked with the author of this post. In academia at the time, I believe neural nets were seen as a fading trend, something that was a cute idea but wasn’t destined to work out because they didn’t work well enough and even if they did we didn’t have the compute for anything useful. Personally, I was fascinated, and I know several other people who kept hope as well, one who is now quite a well known AI / neural nets researcher. I didn’t pursue neural nets personally, but I did keep using related genetic algorithms and artificial evolution, which I think was viewed with the same mix of skepticism generally, with pockets of enthusiasts. Gates in the story was probably sharing what he’d been hearing from academics. Lawrence (author of the post) is an incredibly smart person, and I have no doubt was able to see past what many of the researchers were saying.
BTW, I recently happened across a paper from 1967 (53 years ago) that mentions neural networks in passing as if it was a popular idea at the time - the original paper on the Medial Axis Transform! Scroll to the 2nd page (“364”) near the top: “Consider a continuous isotropic plane (an idealization of an active granular material or a rudimentary neural net) that has the following properties at each point”
My classmate wrote back-propagating neural nets on his Amiga 1200 in 1994.
We had pretty high hopes back then. I and I think many others sort of assumed we would have had General AI by now. Or at least going in that direction. Another classmate said in 1992 that scientists had simulated the neural net equivalent to a worm or something. This topic came up re the neural net featured in the movie Terminator 2.
The current use of "neural nets" are both under- and overwhelming. But definitely over-whelmingly boring, to me. It's good, don't get me wrong. But from my viewpoint, it looks like we have found a trick, and now try to apply this same trick to as many new fields as possible.
> now try to apply this same trick to as many new fields as possible.
Right, this is normal and expected. While scientific growth looks like an exponential curve, the lateral application path is equally important to maintaining that curve!
We may get to AGI one day (I have my doubts) but I have to remind myself that all theses clunky things we see today are v1 -- and virtually everything can improve!
It is normal, but it's kind of disappointing that from my (uneducated) vantage point, we don't seem to be at all closer to AGI than we were in 1994.
It's like the difference between Kitty Hawk and a Boeing airliner. Yes, they are impressive feats of engineering, and the massive training hours tell the story.
But the Boeing Airliner is still less clever than a fruit fly, to merge the metaphors.
Edit:
to expand - it seems to get something as smart as a dog is almost the same as something as smart as a human.
1994 neural net
2020 GPT-3
.
.
.
.
.
unknown numbers of dots...
.
.
.
dog
human
In the early 90's I worked in a research lab that studied cortical EEG data. I experimented with using neural nets to categorize brain activity, in a way that statistical methods had already shown some success. The professor I worked for felt that neural networks would not perform better than statistical methods. Neural nets could be trained on the data and predict on the untrained set, but not better than the statistical method. So I didn't pursue it.
The funny thing to think about for people on this forum was what weak computing power was available to me at that time. As I recall I did this whole experiment in C (using the TLearn libraries) on a Masscomp workstation. The cool thing about the Masscomp was that it had a vector processing board that could do some parallel computations. Well I didn't use that but the lab had the hardware. So I was doing all these things on Motorola 68000 with less RAM and way less CPU than your watch.
> How many had serious faith in the neural network approach, or at least were very excited about its potential?
There were at least some economists that had hope back then. Here's a paper my advisor published in 1995, but that he had written several years earlier for his dissertation:
In 03 I was desperately trying to write neural net algorithms for the BeOS, but I had no idea how to train them (much less backprop).
I think distributed systems research is enabling personal user-level distributed networks (think like mastodon but to the point where a lay person can own their own node without much trouble). But will the economies of scale for commodities goods like motherboards make sense against behemoths buying up lots of hard drives? Who knows.
In 1984 Teuvo Kohonen told us exactly how neural nets works. Albeit in Finnish. In this video he had matrix of "learning vectors" which change as the net learns. "Meta-learning" happens when several learning matrixes are on top of each others. https://youtu.be/Qy3h7kT3P5I?t=2481
I worked with an intern back in 2015 that had built a Farnsworth fuser in his basement during high school. That of course is cool by itself, but then a while later it occurred to me that he probably made one of the most durable products ever assembled by humans. Those little helium atoms are going to be floating about the earth probably until it’s gobbled up by the sun, and then what?
They helium atoms last a long time but probably won't still be on Earth for those kind of timescales - helium and hydrogen have a tendency to "leak" out of our atmosphere and into space.
I'd gotten a copy of Patrick Henry Winston's "Artificial Intelligence" 2nd edition around 1990. The subject matter was fascinating to me but too far over my head for me to do anything practical. I picked it up a few years later and noodled around w/ neural networks a little bit but it never "clicked".
1994 is around the time I read Steven Levy's "Artifical Life"[1]. I definitely had the excitement the author has about genetic algorithms and classifier systems. They seemed much more approachable than neural networks, too. They appealed to the assembler programmer in me, I guess.
I'll echo what others are saying here. Brute force compute has opened the door for so many possibilities. Evolution has had so more "CPU" and "compute substrate" to play with than we can possibly imagine. I'm unsettled by the idea that we're training models we don't actually understand. I'm also not dismissive of the possibilities that they open up, though.
It really makes me appreciate how “brute force” like NNs are, especially back when MLPs were the cool kids. Even today, they are seriously compute constrained and memory constrained. Bigger actually is better. There’s all this debate about how things sold work and ideas going in and out of fashion, but once chips are strong enough, a lot of that doesn’t matter.
I’d bet this holds in all sorts of fields. Somebody has a great idea that really needs good steel, and they’re unfashionable (“that’s dumb! It’ll never work (today)!”) until we invent a better steel process. Or ideas that just took a microscope to verify. So many cycles of debate and orthogonal progress rendered unnecessary when a relatively unrelated change happens.
Heh, the URL is "plan" because this was a .plan, a file in your home directory that others could see using the "finger" utility. I remember having flame wars entirely over .plans. It was like a super primitive Twitter.
Lawrence was referring to a class of optimization algorithm that does not adapt to the data. In the terminology of neural networks, it is only inference, not training. It only samples the function stochastically, and tries to walk downhill, but there is no feedback loop to improve the speed of the walk, there is only knowledge of how good the samples are.
There are improvements within the GA field that try to approximate gradients or do other trickery to create a "feedback loop to improve the speed of the walk" while still being fundementally not much more than the traditional evolution loop. This is mostly nice in fields like Reinforcement Learning where computing gradients can be difficult.
GAs can be quite useful in situations where there are tons of very bad local minima/maxima and you desire the global minima/maxima. Unfortunately, neural networks have lots of really great local minima/maxima and the state space is so large that you'll likely never get the global minima/maxima (and you wouldn't know it if you did). This is why "Neuroevolution" of neural network weights hasn't really caught on.
I think the author is referring to the possibility of using a well defined fitness function rather than a vast array of training data.
One of the beautiful aspects of Evolutionary Computation is how you can use it as a generative mechanism. A fairly simple fitness function can produce complex and intriguing outputs.
>> It seems to me that if the programmer understands the algorithm, then the
algorithm is limited by the programmer!
Well, a learning algorithm is always going to be limited by something. In AI
literature any such something is grouped under "inductive bias" and there is
no machine learning algorithm that doesn't incorporate some kind of inductive
bias. Bayesian learners have their priors, distance learners have their
disance functions, Support Vector Machines in particular have their kernels,
and of course neural networks have their intricate architectures painstakingly
hand-engineered and fine-tuned to a particular domain, or even a specific
dataset [1]. Indeed it is probably impossible to have learning without
inductive bias [2] [3].
The opinion in the short piece above is representative of a current trend in
machine learning, of taking "the human out of the loop", which loosely
translates in trying to learn everything end-to-end, only from examples, while
pretending that no attempt is made at any point to guide the learner to find
a consistent hypothesis that explains the examples.
Unfotunately, in practicce, this ideal remains a fantasy. All the progress
achieved with deep learning in the last few years would not have been possible
without the discovery (by chance or concerted effort) of good biases that are
conducive to learning in specific domains, e.g. convolutional layers for image
recognition, or Long-Short Term Memory cells for sequence learning, etc.
And where did these good biases come from? Why, from human programmers. Humans
ourselves most likely come equipped with very strong, very useful biases.
We've used those biases and our ability to generalise to come up with powerful
abstractions, such as the laws of physics or mathematics. It took us literally
thousands of years to amass this fortune of knowledge (possibly millions,
during our evolution).
Why would we not use those finely-honed biases of ours and all the knowledge
we've collected to kickstart a new form of intelligence? After all, when we
want fire, we don't sit around waiting for thunder to strike a tree, anymore.
We can start fires on our own. Because of our intelligence- because to build
upon prior knowledge is intelligent.
__________________
[1] See for example the Neural Network Zoo, a collection of neural net
architectures:
> Why would we not use those finely honed biases of ours and all the knowledge we've collected to kickstart a new form of intelligence?
Because priors are hard to embed in a model. For example, CNNs are great for translation invariance, but rotation and scaling don't come out of the box. Why don't they simply add the rotation invariance to the model? Because it's hard to express.
Also, human priors are limited. If AlphaGo was to be limited to human priors it would never have surpassed us.
The best approach so far is to make a network as free from priors as possible (like the transformers) and let it learn from the data, in essence let it rediscover the convolution or other efficient operations from mountains of data.
That's a limitation of neural networks, where inductive biases are very difficult to represent. Other approaches don't have any such problem, e.g. all the other approaches I listed above have well-defined, clean and tidy representations for inductive bias.
As to AlphaGo, this may come as a shock, but AlphaGo was not the first system to surpass humans in anything. It was the first system to surpass humans _in Go_ but e.g. the first computer system to beat a human grandmaster in chess was Deep Blue [1], in 1997, the first computer system to win a world championship against human players was the checkers (draughts) player Chinook [2], in 1990, the first computer system to outperform humans in medical diagnosis was MYCIN in the 1970's and so on. All those were systems that used strong inductive biases.
And of course, AlphaGo itself was limited by human priors- e.g. piece moves, checkerboard dimensions and structure were hard-coded into its architecture and a search for moves was performed by MCTS.
In any case, the lack of good enough priors is not a reason to not use any priors- it's a reason to look for better priors.
Edit: Has any end-to-end approach rediscovered an entire state-of-the-art architecture, like CNNs or LSTMs?
Yes, on self play, but not the supervised model. The supervised model was decent but nowhere near super human. It was used as a starting point for the self play model.
Do you know how your brain works? I mean, in detail. No you do not, and you cannot. Nothing can truly understand itself. Even if the brain is composed of higher level "computation units" and you can understand how those units are linked you still don't really how those black boxes work. Even if you understand how one of those black boxes work, you can't really know the details of how they all work because if you did you would need a second brain to hold it all.
So, its fine that we don't understand in detail what every artificial neuron firing means....
I’m personally tired of this ass backwards argument.
Of course it’s possible to understand how brains work. It’s called science. We (collectively) are well on our to understanding the biochemistry of neurons and brain structures.
Everything exists in the same universe and follows the same physical laws.
Agreed. Even if it's impossible for someone to understand all of it, it's possible for scientists to understand parts of it, make models, and use those models to understand more, and improve them. After many iterations you'll get a full working model than can be used to invent better brains than ours.
That we're so early in this cycle doesn't mean it's impossible.
Let me try to explain another way. Lets say your brain had 10 neurons... Well, then you certainly could understand it. What if it had 1000? Sure... You could probably handle that. Ok, but the human brain has 100 billion neurons. Where are you storing that knowledge?
The good news is that there is a lot of evidence that computation in the brain is more modular than that... It's not necessary to understand each neuron... Just like it's not necessary to understand each artificial neuron in a large nueral network.
I'm not entirely convinced of this argument myself, that it's impossible for us to "truly" understand how brains work .
However, I entirely agree with the sentiment: We don't necessarily understand brains now, but we don't dismiss their capabilities just because we can't explain precisely how they do what they do. By the same token, complex tools like deep learning shouldn't be dismissed just because we can't satisfactorily interpret them.
A few years later SpamAssassin came out and I tried to use it to train a spam classifier without any luck. And I tried my hand at a few protein structure SVM classifiers (failng miserably, I didn't really understand it was critical to have a balanced false and true training set).
A few years after that I landed at Google (around 2007) and very few people were doing machine learning, other than ads and search and the work that was being done was far from what I knew about (mainly supervised training with SGD on batched data). Eventually Google adopted the paradigm I enjoy (synchronized SGD using allreduce).
Nowadays, we have ample CPU and data to train amazing models (with the most interesting working being around large language models). It took a lot longer to get there than I expected.