This is a bit lateral, but there is a parallel where Marvin Minsky will most lik...

masswerk · on March 9, 2023

This dismissal of Minsky misses that Minsky had actually extensive experience with neural nets (starting in the 1950s, with neural nets in hardware) and was around 1960 probably the most experienced person in the field. Also, in Jan 1961, he published “Steps Toward Artificial Intelligence” [0], where we not only find a description of gradient descend (then "hill climbing", compare sect. B in “Steps”, as this was still measured towards a success parameter and not against an error function), but also a summary of experiences with this. (Also, the eventual reversal of success into a quantifiable error function may provide some answer to the question of success in statistical models.)

[0] Minsky, Marvin, “Steps Toward Artificial Intelligence”, Proceedings of the IRE, Volume: 49, Issue: 1, Jan. 1961: https://courses.csail.mit.edu/6.803/pdf/steps.pdf

riku_iki · on March 9, 2023

Gradient descent was invented before Minsky. Imo, Minsky produced some vague writings, with no significant practical impact, but this is enough for some people to claim his founder's role in the field.

masswerk · on March 9, 2023

Minsky was actually a pioneer in the field, when it came to working with real networks. Compare

[0] “A Neural-Analogue Calculator Based upon a Probability Model of Reinforcement”, Harvard University Psychological Laboratories, Cambridge, MA, January 8, 1952

[1] “Neural Nets and the Brain Model Problem”, Princeton Ph.D dissertation, 1954

In comparison, Frank Rosenblatt's Perceptron at Cornell was only built in 1958. Notably, Minsky's SNARC (1951) was the first learning neural network.

riku_iki · on March 9, 2023

> when it came to working with real networks. Compare

my understanding is that that no one knows what that SNARK thing was, he built something on the grant, abandoned it shortly after that, and only many years later he and fanboys started using it as foundation of bold claims about his role in the field.

masswerk · on March 9, 2023

Well, his papers are out there to read.

riku_iki · on March 9, 2023

Yes, and I read them: https://dspace.mit.edu/bitstream/handle/1721.1/6103/AIM-048....

vague esssay without specifics

masswerk · on March 9, 2023

So you may like better,

> “Multiple simultaneous optimizers” search for a (local) maximum value of some function E(λ1, …, λn) of several parameters. Each unit Ui independently “jitters” its parameter λ1, perhaps randomly, by adding a variation δi(t) to a current mean value μi. The changes in the quantities λi and E are correlated, and the result is used to slowly change μi. The filters are to remove DC components. This technique, a form of coherent detection, usually has an advantage over methods dealing separately and sequentially with each parameter.

(In “Steps”)

:-)

riku_iki · on March 9, 2023

can you provide link, and what conclusions you derived from this text if your interest is meaningful discussion?

masswerk · on March 9, 2023

The link has been already provided above (opus cit), it's directly connected to the very question of gradients, providing a specific implementation (it even comes with a circuit diagram). As you were claiming a lack of detail (but apparently not honoring the provided citation)…

(The earlier you go back in the papers, the more specifics you will find.)

riku_iki · on March 9, 2023

You didn't give me any links.

And what are your conclusion from citation? You are claiming again that Minsky invented gradient descent?

masswerk · on March 9, 2023

For the link and claims, see the the very comment you initially answered to.

riku_iki · on March 10, 2023

That claim was answered: Minsky didn't invent gradient descent.

masswerk · on March 10, 2023

That claim was never made, but by you. The claim was, Minsky had practical experience and wrote about experiences with gradient descend (aka "hill climbing") and problems of locality in a paper published Jan. 1961.

On the other hand: who invented "hill climbing"? You've contributed nothing to the question, you've posed (which was never mine, nor even an implicit part of any claims made).

riku_iki · on March 10, 2023

Ok, Minsky "pioneering" is his writing about something invented before him. Anything else? :)

masswerk · on March 10, 2023

Well, who wrote before 1952 about learning networks? I'm not aware that this would have been already main stream, then. (Rosenblatt's first publication on the Perceptron is from 1957.)

It would be nice, if you contributed anything to the questions you are posing, like, who invented gradient descent / hill climbing or who can be attributed for this? what substantial work precedes the writings of Minsky on their respective subject matter (substantially)? why was this already mainstream or how and where were these experiments already conducted elsewhere (as in "not pioneering")? Where is the prior art to SNARC?

riku_iki · on March 10, 2023

> Well, who wrote before 1952 about learning networks?

steps which you referred not from 1952.

> Where is the prior art to SNARC?

We don't know what was the SNARC so can't say if there was prior art.

Any other fantasies? :-)

masswerk · on March 10, 2023

This is ridiculous. Pls. reread the threads, you'll find the answers.

(I really don't care about what substantial corpus of research on reinforced learning networks in the 1940s, which is of course not existent, you seem to be alluding to, without caring to share any of your thoughts. This is really just trolling at this point.)

riku_iki · on March 10, 2023

> you'll find the answers.

I think you perfectly understand that we are in disagreement about this, my point of view is that your "answers" are just fantasies about your idol without grounding into actual evidence.

What is your goal in this discussion?

masswerk · on March 12, 2023

Minsky is not my idol. It's just that it's part of reality that Minskys writings exist, that theses contain certain things and that they were published at certain dates, and that BTW Minsky happens to have built the earliest known learning network.

cma · on March 9, 2023

Take the amount of language a blind 6 year old has been exposed to. It is nothing like the scale of these corpsuses but they can develop a rich use of language.

With current models if you increased parameters but gave it a similar amount of data it would overfit.

riku_iki · on March 9, 2023

It could be because kids are gradually and structurally trained through trials, errors and manual corrections, which we somehow don't do with NN. He wouldn't be able learn language if only exercises he would be doing is to guess next word in sentence.

aaroninsf · on March 9, 2023

For me this is a prototypical example of compounded cognitive error colliding with Dunning-Kruger.

We (all of us) are very bad at non-linear reasoning, reasoning with orders of magnitude, and (by extension) have no valid intuition about emergent behaviors/properties in complex systems.

In the case of scaled ML this is quite obvious in hindsight. There are many now-classic anecdotes about even those devising contemporary scale LLM being surprised and unsettled by what even their first versions were capable of.

As we work away at optimizations and architectural features and expediencies which render certain classes of complex problem solving tractable by our ML,

we would do well to intentionally filter for further emergent behavior.

Whatever specific claims or notions any member has that may be right or wrong, the LessWrong folks are at least taking this seriously...