I think that Norvig hits the nail on the head near the beginning of his piece:
"I believe that Chomsky has no objection to this kind of statistical model [the Newtonian model of gravitational attraction]. Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions of parameters, not just one or two."
This is no more than an objection to problems of fitting your chosen model to data. If you only have a small number of free parameters, then you can fit your model with a reasonable amount of data. If you have a large number of parameters then you have to introduce some extra assumptions, as Norvig (of course) acknowledges slightly earlier (described as "smoothing", in context):
"For example, a decade before Chomsky, Claude Shannon proposed probabilistic models of communication based on Markov chains of words. If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (10^15) probability values to specify the model. The only feasible way to learn these 10^15 values is to gather statistics from data and introduce some smoothing method for the many cases where there is no data."
Thus, although both models are statistical, it is much easier to have confidence in Newton's law of gravitation than it is in a Markov model of some communication channel, because the data tell a clear picture. The imprecision of Newton's law in certain parts of the problem space (unobserved during his time) is a moot point - any such objections apply equally well to models with many parameters, and then you _still_ have to accept that you have made extra assumptions "outside" the scope of your model.
If you can explore your entire problem space, then you can build a complete "model". If not, then having more parameters than data _requires_ additional assumptions. Chomsky's point stands.
One small point to add: Chomksy's interest has been in identifying the abstract characteristics of the language center of the human brain, which, for various reasons, does not seem likely to work like a Markov model.
Analogously, one could look at the inputs and outputs of the human heart and potentially imagine a variety of physical structures that would explain them, and some of those physical structures would be biologically real/plausible and others would not.
Things like constraints on working memory, exposure to input, and cross-language studies have informed the constraints that Chomsky has proposed to determine what kind of model would best capture the essential quality of the brain system.
No, Chomsky's problem with statistical models is that you can't learn much about the underlying system from them, and so it isn't science, which has as a fundamental aim to understand the world.
Statistical models can be useful but they generally aren't meaningful to scientific progress.
I get Chomsky's point about the importance of fundamental models and understanding in principle, but I'm confused why it should apply to machine learning.
The problem with Chomsky's knowledge based approach is that it seems to be domain specific. You can learn the fundamentals of music but you can't apply it to physics. You can learn the fundamental mechanics of weather but you can't apply it to psychology etc..
General learning though requires exactly that and there the statistical model seems to be much better because it doesn't run into the domain problem.
For the purposes of this discussion, Norvig has defined a statistical model as:
"a mathematical model which is modified or trained by the input of data points."
He then illustrates how what would be considered a "scientific" model, Newton's law of gravitation, is a statistical model under his definition, but simple one with not many parameters. He contrasts this with a Markov model of a communication channel with a large vocabulary, which has many parameters. His argument is, then, that Chomsky dislikes statistical models with large numbers of parameters, as stated in the passage I quoted before.
My point was that Chomsky's concerns are, with reference to Norvig's argument, equivalent to concerns about model fitting, namely that to fit models with many more parameters than you have data, you require additional assumptions about your model structure. It is difficult (though not necessarily impossible) to learn about the system from your model, because in order to construct your model you have had to assume things about reality that you will not be verifying against observations.
In the opposite case, where you have many more data than parameters, you can fit your model with confidence, given only assumptions about your sampling (which you address by being a good experimentalist). This is what allows you to "learn about the underlying system" - you have a model that describes reality well by itself, without requiring additional assumptions about the nature of reality, so the structure of your model reflects something about the structure of reality, and you can explore your model as though you were exploring reality. Of course, sometimes it turns out the equivalence wasn't as good as we thought, but often it provides us with new directions of investigation.
Hopefully that clarifies the equivalence between the two statements - I apologise for not making it more obvious earlier.
It's more than just overfitting: e.g. if a neural net model with several hundreds or thousands of parameters can be trained on a dataset of billions of observations, and may even be OK at predicting behavior both in and out of sample, but it still remains a black box, since we generally do not know what phenomena particular parameters (or their combinations) represent.
On the other hand, IMO there is nothing wrong or unscientific with having empirically estimated relationships as part of the model -- I just see them as shortcuts whose purpose is to parcel the problem so as to allow other analysis, and as something to potentially investigate further to see why the relationship takes a particular form.
Some ML methods are more amenable to this type of analysis than others though.
To your first point - as you increase the number of parameters in a model you (quickly) begin to suffer from the curse of dimensionality. In a crude sense, the data requirement for a similar level of confidence is exponential in the number of parameters, so it can be difficult to use a high-dimensional model to understand a problem, even with a billion (or trillion...) observations. The best one can hope for is that, if there is a simple relationship hidden in the data, your high-dimensional model captures that relationship in a way that is amenable to extraction - I presume this is what you are saying in your third paragraph.
To your second point - I agree. I too do not reject that there is utility in constructing models that make no effort to match the form of the underlying reality. However, the fact remains that in such cases it is very difficult to use your model to gain deeper understanding, and as such these models simply aren't useful for a lot of science in their current form, precisely because they don't tell you anything about reality. Now if someone were to devise a way of extracting "intelligent", (meaning, sensible given existing understanding) simplified relationships from high-dimensional models, that might be a different matter...
The issue has nothing to do with model fitting or the number of parameters. The key question is whether the model is explanatory (of nature) rather than merely predictive. The case of a Markov model is a red herring because those Markov models are designs of human engineered communication systems rather than models of nature.
What is your definition for "explanatory", in this context?
I agree with your second sentence - the key question is indeed whether the model is explanatory rather than merely predictive. I offered a definition for explanatory as being when "the structure of your model reflects something about the structure of reality, and you can explore your model as though you were exploring reality", which isn't a terrible attempt, from my experience.
At the risk of repeating myself ad nauseam, the relationship to model fitting is found in the presence or absence of additional assumptions required for finding your fit. The difference is between having a very low-dimensional model that fits the data and requires few if any extra assumptions to fit (fitting the model being equivalent to "validating your theory", in this context), or a very high-dimensional model that fits the data (making no claim to "theory"), but by definition requires extra assumptions to get the fit.
In another sub-thread you said:
> If you model a system using the smallest possible mathematical model you don't, from that act alone, understand how the system works.
This is correct inasmuch as the understanding doesn't leap forth immediately, but if the model is a good representation of the data (an important if), then modelling a system in a parsimonious way possible does provide you with understanding, pretty much for free, by looking for systems with a similar structure and learning about their properties. As an example, if you have some random variable, and you realise that it might be modelled with a Poisson distribution, then (assuming you are correct) you immediately gain a lot of understanding, because there is an enormous amount of literature exploring the implications of such a model.
This is what substantiates the link between model fitting and the explanatory vs. predictive question. If you can successfully fit a small model to a problem, without adding assumptions, then that model gives you understanding, by virtue of being a good representation of the data, and having structure. That is simply not the case with the high-dimensional models used in machine learning.
I would be interested to see a counter example - a small model that fits a particular set of data well, but does not provide any explanatory power.
Model fitting is only a useful analogy only in the very abstract sense of "is there something missing in my model". But usually people mean model fitting in the sense of choosing parameters in some restrictive model schema (i.e. selecting from a set of machine learning models), but those typically don't provide any meaningful understanding of the system, unless you apply a much weaker definition of "explanatory" or "understanding" than is typical. From my perspective, when you fit a linear regression I don't think you have any understanding of why the parameters are what they are, in the same way that I don't think a coder who tweaks constants in a buggy program until it works has any understanding why those constants are what they are.
> I would be interested to see a counter example - a small model that fits a particular set of data well, but does not provide any explanatory power.
As I said above, a linear regression typically doesn't lend itself to understanding. In fact, I would be interested in an example of a statistical model that does provide any meaningful explanation. Most ML and AI researchers don't even seem to pursue scientific understanding as a goal; predictive power is their measure of success. That is Chomsky's criticism.
> No, Chomsky's problem with statistical models is that you can't learn much about the underlying system from them, and so it isn't science, which has as a fundamental aim to understand the world.
Science has the fundamental aim to make testable predictive models of the world; any "understanding" other than that represented by a testable predictive model is irrelevant to science, except insofar as it might provide intuition on which to found hypotheses of better predictive models.
Statistical models, to the extent that they are testable and predictive, are exactly the kind of thing that science is about.
Putting aside whether how we want to define "science" there are models which attempt to represent the underlying system (e.g. generative grammars) and other "models" which merely try to predict the input/output behavior of those systems (e.g. neural nets). The latter typically doesn't lend any understanding of the underlying system (e.g. we don't understand much about language from a deep neural net translation system), which isn't surprising because they are designed for prediction not for representing the internal behavior of the system.
To use one of Chomsky's examples, if you built a deep neural net that could predict the weather with 100% accuracy then you have performed an amazing feat of engineering that is incredibly useful to the world. But you haven't necessarily learned anything about the weather.
The problem there seems to be not one with statistical models, but with the fact that extracting a parisomonious mathematical model from a neural net is nontrivial (presumably, you can extract a mathematical model from the internal weights, etc., but for even known simple relationships neural net models produced without an a priori prediction of the general shape of the relationship tend to be very complex.)
Of course, were the model really 100% accurate, then converting it to a the most parsimonious perfectly accurate model would be simply a matter of reduction; the real problem is that real neural net models are not 100% accurate, and are often far more complex than more accurate models, and often aren't convenient to analyze to deduce the more-accurate and simpler model.
> extracting a parisomonious mathematical model from a neural net is nontrivial
Has this ever been done? It strikes me as impossible.
> Of course, were the model really 100% accurate, then converting it to a the most parsimonious perfectly accurate model would be simple a matter of reduction
There are entire sub-fields of physics focused on reduced order models and phenomenalogical prediction. It is not 'simply a matter of reduction', it is non-trivial and the resulting models are almost always highly imperfect.
"Impossible" is probably, in principle, not quite right, but when I said "nontrivial", I really did mean "almost certainly impractical in virtually all real cases".
> There are entire sub-fields of physics focused on reduced order models and phenomenalogical prediction. It is not 'simply a matter of reduction', it is non-trivial and the resulting models are almost always highly imperfect.
"Simply" there meant that it was just reduction, not that reduction is necessarily simple. The point was that in real-world cases, reduction to an equivalent model isn't the only problem.
The problem is not one of minimality, but of meaning and exposition. If you model a system using the smallest possible mathematical model you don't, from that act alone, understand how the system works.
Well, no, Chomsky's problem with statistical models is that they are based on data.
Chomsky is a Grand System Builder in the style of the medieval and renaissance thinkers. Data is, at best, irrelevant and, at worst, a confusing distraction that gets in the way of his program, which is elaborating on his system, basing it entirely on intuition.
One advantage of such grand systems is that they can be made to look like the mathematical model, based on "intuitively obvious" axioms and built by thinking deeply in the comfort of one's armchair.
It's sad to see HN descend to this level. You clearly aren't very familiar with anything that Chomsky's written. Maybe just refrain from commenting instead of making up nonsense?
Let's consider the supreme court handshake problem, but say there is an unwritten social law that forces judges to only ever initiate a handshake with judges less senior than themselves. If the seniority of two particular judges happened to be exactly equal, they would not shake hands at all.
Let's assume (I don't know if it is actually true or not) that in the history of the supreme court, there have never been two judges of the exact same seniority. In that case, a model learned from handshake data would not include the slightest hint of this unwritten social law.
I think what Chomsky is saying is that if we do not understand the generative principle behind any data, we cannot possibly know what circumstance might completely invalidate our model. There may not be a way to smooth this out.
Language understanding, contrary to things like speech recognition, does not lend itself very well to smoothing.
The way Norvig tells it, language lends itself perfectly well to smoothing and even Chomsky's favourite absurd phrase "colorless green ideas sleep furiously" is assigned a (very small but non-zero) probability in some statistical models of English.
Truth is you can't do much with statistical models of language without some sort of way to account for what's missing from your data, which is always most of language. On the other hand, anything you might do is never going to be enough when that's the case: that you're missing the majority of language from your training data.
Some NLP tasks do indeed lend themselves well to smoothing, but I was thinking of language understanding tasks like question answering where changing or misunderstanding a single word in an entire novel can easily make a correct answer completely incorrect.
I agree with your second paragraph, and if I understand Chomsky correctly, that is part of why he argues in favor of a generative grammar. I can't say that I completely understand how such a grammar would be linked to semantics and experience though.
It demonstrates that human languages all have a certain structure, so it must be an innate faculty of human beings (as opposed to acquired). A neural network could not tell you that. It also suggests further avenues of inquiry (Why that particular structure?)
I don't think much is understood about how linear externalizations of language are deserialized into symbolic structures, what those structures are, and how those are symbol structures are mapped into mental representations.
The issue Chomsky raises isn't that statistical models aren't predictive enough, but that you don't understand they don't typically yield any understanding of the underlying system.
Similarly (and I understand this argument went through physics long before quantum mechanics showed up and made the argument seem silly), the ideal gas law, say, is predictive, but doesn't yield any understanding of the underlying system.
> If the seniority of two particular judges happened to be exactly equal, they would not shake hands at all.
Somewhat similar to Buridan's Ass.
... an ass that is equally hungry and thirsty is placed precisely midway between a stack of hay and a pail of water. Since the paradox assumes the ass will always go to whichever is closer, it will die of both hunger and thirst since it cannot make any rational decision to choose one over the other.
If this is unwritten (I take this to mean a piece of culture people practice without ever formalizing), how would the judges deal with the novel scenario? That rule will not have existed until it starts being practiced.
The judges would simply apply the existing rule as it exists in their mind, but there is no trace of that state of mind in the handshake dataset.
Yes the machine could guess. It could even guess correctly. But I think what Chomsky means is that we have no valid scientific reason to believe that it would.
> I think what Chomsky is saying is that if we do not understand the generative principle behind any data, we cannot possibly know what circumstance might completely invalidate our model. There may not be a way to smooth this out.
No, what Chomsky is saying is that if you don't know the generative principle behind the data, then you don't know the generative principle behind the data. You don't understand a system just because you can faithfully predict it.
Can't you learn this rule by generating (simulating) the behaviour with various hypotheses?
So you'd have a model where handshakes are dependent on the relative heights, and it would fail because some senior judge who is short goes around and initiates.
And then you'd come to the correct one, if the data is there for it. Like a hypothesis that's never refuted.
Of course then there's noise. If it's a general rule, but on casual Fridays...
I think part of the problem with this approach is that there are infinitely many hypotheses (due to the infinitely generative nature of language) that are compatible with the behaviour, so to test and refute them all would be impossible. E.g. "only initiate handshakes with your juniors" is compatible with "only initiate handshakes with your juniors who have not been to Mars" etc. This example is silly and may seem trivial, but I think it illustrates the problem of trying to derive a principle or rule simply from observed behaviour.
But isn't that the nature of scientific hypotheses? They are tenuous, only valid until you come across some data that tells you "aha, there's another piece to this rule". The classic would be special relativity vs old Newtonian.
If you have a limited dataset containing the actual generating data, say heights, seniority, age, hair color, day of the week, gender, etc, can't you end up discovering at least the parts that are exposed in the data?
>> But isn't that the nature of scientific hypotheses?
I guess scientists don't guess completely at random though, do they? They use some sort of heuristics, or their instinct (I've no idea what is, no) to decide which hypotheses are worth pursuing.
Most statistical models aren't very good at taking background knowledge like that into account. Not to mention most are also sensitive to "noise" (which is to say, useful information that they can't make use of).
Again, this boils down to the rationalist vs. empiricist stance. Hidden variable vs. probabilistic approximation and so forth.
The partisan error is to consider these two stances as mutually exclusive. They are not, as illustrated by decoherence for example.
On the other hand, it is, I believe, undeniable that the rationalist endeavor is much more complex than the empiricist one and therefore also less linear. Less predictable and, please, let's not forget, less financially profitable.
Chomsky's position has always been to advocate for the rationalist stance in a world conveniently inebriated with its empiricist successes to the point of turning this one aspect of thinking into a belief system.
It is a waste of energy to argue for a monopoly of the yin over the yang or the other way around. They are complementary but not mutually exclusive. Privileging one over the other will result in an increase of ideological/mystical bias and essentially miss the whole point: it is their interaction, the transition from and to one another, that holds the key to a global understanding.
In this video[1] Varoufakis takes on modern economics and their models as a way of understanding the real world and predict what's going to happen.
On a higher level, I believe that he speaks for most (if not all) social sciences and what happens when they cross mathematical models who blindly try to understand and predict the real world through a flawed, limited mathematical model.
There is a problem with the simple economic models that don't claim to predict real world behavior because they capture only one aspect of reality (everything else being equal...). They are not really scientific because they are not falsifiable. Math works and it seems to reflect reality, but we have no epistemic mechanism to verify this.
If we can do this eventually, then perhaps we can make our way to models that capture a larger fraction of the truth and can be tested against reality.
Economic models that don't claim to predict real world behaviour are like thought experiments (with more maths). Their purpose is to help people understand that small isolated part of the whole, because a model that includes all the complexities of the real world would be incomprehensible.
Nonetheless, many of these models are still falsifiable. For example, the textbook model of the benefits of trade due to comparative advantage only contains two countries and two goods. Although extremely simplistic, the theory does have falsifiable predictions and there is good empirical evidence supporting the broad theory. But this simple model will not be able to accurately predict how much a particular country will benefit from free trade.
If you have big data about all transactions occurring in real-time, along with socioeconomic factors on all the actors, it is not difficult to verify models and see how well the relationship between their dependent and independent variables holds up in the real world.
That is a big if. Nobody has data for all transactions occurring in real time and nobody knows all the socioeconomic factors for all the actors.
Even if you did have all of that data, you would still be missing a lot of data. For example, an important factor in economics is asymmetric information. An actor's decision may be optimal given the information available to them at the time even if the decision is suboptimal given perfect information. So for an even more accurate model, you would also need to have data on what each actor knows at any given point in time.
Furthermore, an actor probably does not have the ability to calculate the optimal outcome given the information available and there are innumerable subjective factors in economic decision making.
Big data will help you find some kind estimate, but it will not suddenly make economic modelling simple.
Criticisms of mathematical modeling in economics were most definitely given by Mises and Hayek. Historian of economic thought Mark Blaug was also very critical of how neoclassical production theory (particularly Edgeworth's formulation) distorted the meaning of "competition" from a dynamic process to a quantity. Ironically, it is the same far-left heterodox demagogues like Varoufakis (though he's hardly the worst) who nevertheless use it in this sense to argue for mercantilist programs. Neoclassical when it suits them, heterodox when it doesn't.
Anyone who uses the term "neoliberalism" is a charlatan. It is an absurd conspiracy theory that has since blown into an amorphous meme out of Marxist historiographers who struggle with the explanatory power (or lack thereof) of their framework.
Mathematical models do have their uses in formalizing assumptions and analyzing dependencies, so they're not all bad, even if the Cowles Commission did go overboard.
(Also let's not forget that the modern methodology of economics came as a result of the Keynesian research program, particularly since Hicks (1937)'s introduction of IS-LM, first modeled by Keynes himself in 1933 as four simultaneous equations.)
What makes the use of the term 'neoliberalism' to be a conspiracy theory? I thought it was a blanket term for the sorts of laissez-faire capitalist ideologies that sprang up since the 1980s, and whose proponents did have a big thing for namechecking Adam Smith and other classical liberal economists of the 19th century. You're surely not going to deny that there has been some sort of ideological drive in favour of free market reforms, privatization and "free trade" agreements of the WTO/TTIP ilk, are you?
I don't think people who use the term think it refers specifically to shadowy cabal of free marketeers conspiring with each other over some hidden agenda. It's just a term for what a bunch of people, many of whom are in positions of power, happen to openly think and do.
One reason is that the term is exclusively used by the ideological opponents of this "neoliberalism", and as a result it is necessarily ill-defined. A lot of people describe themselves as libertarians of various types, or even as classical liberals, but I've never met a person calling herself a neoliberal. Outlining a difference between these ideologies might be a good start, and would at least give the term some meaning, without it though it is nothing more than handwaving at people the speaker doesn't like. Had that actually been done, I might even call myself a neoliberal one day -- but as it stands, I can't, because no one knows what it is :)
> A lot of people describe themselves as libertarians of various types, or even as classical liberals, but I've never met a person calling herself a neoliberal.
Which says a lot more about who you do (and don't) know than it says about anything else.
Thanks for the references, these are good articles, and I am happy to see that I was wrong and there are in fact some defenders of the term. Still, it is very predominantly seen in a critical context -- quite unlike, for example, "libertarianism" or "economic liberalism" or "classical liberalism".
But, even here it does not look well-defined at all: at best, authors simply classify specific policies as "neoliberal".
Even if everything you just said is true, not one word of it justifies the use of the term 'conspiracy theory' - which, in turn, is almost always a term used by the detractors of conspiracists. (And, of course, that doesn't mean the term 'conspiracy theory' isn't useful)
"Neoliberalism" is just an Americanism for economic liberalism, because the word liberalism is overloaded in the U.S., and is an autoantonym. That part of vezzy-fnord's comment is just a reflexive libertarian expulsion.
Not sure at all -- a lot of critique I've seen directed at "neoliberals" (and the term is exclusively used by critics), seems to target policies that are decidedly non-libertarian.
It is always used to critisize things from the left, which creates an impression that the term describes economically right-of-center ideologies, but that's IMO just because of the difference in vocabulary between different groups -- e.g. you may see the same thing described as neoliberal by the left, liberal by the right, and "statist" by the libertarians.
Thinking about it, "statist" is a similarly empty term used exclusively by opponents -- I've never seen anyone calling herself a "statist" either.
> Anyone who uses the term "neoliberalism" is a charlatan.
> It is an absurd conspiracy theory that has since blown into an amorphous meme out of Marxist historiographers who struggle with the explanatory power (or lack thereof) of their framework.
"neoliberalism" as a term is no more (and, arguably, much less) an invention of Marxist historiography than is "capitalism"; its true that both terms were coined by critics of the systems they describe, but that's not uncommon.
>> But isn't that the nature of scientific hypotheses? They are tenuous, only valid until you come across some data that tells you "aha, there's another piece to this rule".
Chomskian grammars (CFGs) are used widely in compilers and similar tools to model computer languages and even for limited subsets of natural language they don't do half bad.
The problem with phrase structure grammars is that they're very costly to develop and maintain and so far there's never been one such grammar that can model the whole of a natural language.
If there was a way to learn CFGs with good coverage from text they'd be in much wider use, but unfortunately grammar induction is hard.
Also, Google has its own political reasons not to want to use grammars. They champion neural networks and statistical AI. It's their schtick, innit.
Chomsky's aversion to statistical techniques is much deeper than most of this discussion focuses on.
Here's an enlightening quote from Chomsky:
"Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogeneous speech community, who know its (the speech community's) language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of this language in actual performance. (Chomsky, 1965, p. 3)"
Chomsky is uninterested in linguistic data of the kind used to build statistical language models; those are "linguistic performances" and he is only looking at "linguistic competence", the ability of an ideal speaker to "produce and understand an infinite number of sentences in their language, and to distinguish grammatical sentences from ungrammatical sentences."[1]
Now, I'm personally happy to criticise statistical techniques for their lack of explanatory power. But I'm not willing to go further and say that data is irrelevant. Chomsky is.
Chomsky is not saying that "data is irrelevant". Aspects contains a detailed discussion of lots of linguistic data points, as you'd know if you weren't just cherry-picking quotations.
It seems to me that overarching theories of a Platonic bent miss the embodied-ness, Dasein-ness of human activity.
Nevermind the debates about the ultimate nature of the human cognitive process; fact of the matter is that as observed, it's always-already wrapped in emotional-social thinking. Enough that there's reason to question the subject-object split altogether.
Now, maybe Chomsky is a kind of extreme social-cognitivist and his abstract generative trees apply to societies as learning and meaning-producing wholes. But on the face of facts, rather than metaphysical speculation as to the nature of personality, intentionality and individuality, it would seem to me that the statistical/machine learning approach already faces language as it happens: as embodied in media, social context and so on.
In other words: I fail to see much value in an abstract account of "pure language" as dissociated from the real communicative process as it happens right now as you read me. Sure, "insights" -- but it remains to be shown that "pure linguistics" is a worthwhile endeavor on the level of "pure quantum mechanics" as formal model.
> I fail to see much value in an abstract account of "pure language" as dissociated from the real communicative process
But this assumption makes assumptions of purity as well. The communicative process may just be a byproduct of a mental process which has little to do with communication. A mutation happens tens of thousands of years ago (say 50,000 years ago), a change happens in the Broca (and/or Wernecke) area of the brain, and suddenly a new mental process kicks off. This mental process can be modeled as a state machine, and has the abilities and limitations of a state machine. It also has known limitations of output, which Chomsky has talked about.
You're assuming the mutations which gave rise to the brain changes which created an internal language generator and parser have only one purpose - communication. But that's an assumption on your part. The ability to communicate may be just one byproduct of those changes which made things like communication possible.
Is there anything that isn't "observed"? Are you suggesting that everything except formal logic (or perhaps including?) cannot be objectively characterized? Have we been awarding the Nobel prize in Chemistry to too many chemists, while failing to recognize the emotional-social validity of alchemy?
Welp, a proposition in chemistry is cleanly falsifiable, so we can gradually (as we have historically) delimit "chemistry proper" from what we could call the general phenomenology of matter. Of course, taken to a metaphysical limit, it's not even the case that things are embedded in a context, but that they're the context itself (as Derrida famously put), but this is completely irrelevant to chemistry.
I don't believe this is the case with psychology or linguistics. Of course, I could be wrong; in particular, they may work as "applied pseudosciences" (much like economics, which is really useful) even though we will never arrive at their foundations.
EDIT: More succintly put: the objects of interest of chemistry always arrived as abstracted from context, while the objects of linguistics and psychology are the context themselves.
In my defense, It was late into the single hours and I couldn't sleep nor think right.
Maybe "embeddedness" is the better word. The symbol system that is language is never the whole story of communication with language; we hardly code expression that's verbal but nonlexical (tone of voice, prosody, etc.); nor we're able to incorporate the layers upon layers of material mediation (reproduction and transmission technologies, air, light) into our generative account of linguistic phenomena.
Which is maybe another way of putting that what you say isn't what you mean - it's what's out there.
Maybe it's still a stupid point, but I thought it was worth trying to restate it once more.
One can try to divine what Chomsky really thinks, believes and means, but one thing remain that always annoyed is is repated absolute stances and the way he express is viewpoint as being the only right one. Often expressed with disdain or dismissal of the opposition. You can disagree with someone, but doing it elegantly is of higher value to me.
This is generally the philosophical tradition: you present ideas, wholly and completely, as an explanation or a solution to a problem. You do not pre-suppose disagreements for your opponents (this will almost certainly end up being a straw man, if you let them make their own arguments they will make them more favorably than you could. It is a sign of respect). You express your idea as a firm, unyielding truth and wait until it is appropriately refuted, then reassess.
And when it is refuted, you assert something akin to:
"Linguistic theory is mentalistic, since it is concerned with discovering a mental reality underlying actual behavior. Observed use of language ... may provide evidence ... but surely cannot constitute the subject-matter of linguistics, if this is to be a serious discipline."
There is no value in that. If you have a unique perspecitve, you MUST be outspoken and opinionated. Being quirky also helps. Otherwise, no one gets to hear about your unique perspecitve.
If you don't consider your viewpoint the most correct of all viewpoints known to you, why would you even have it?
As for disdain, while I agree he can be overly dry sometimes, seeing the kind of stuff he often has to put up with, around political subjects anyway, I'd say he shows a lot of patience, too. If more people would do their part most of the more unpleasant subjects he debates people on wouldn't even be an issue. He wouldn't be able to use harsh words for, say, war criminals and their apologists, if those didn't exist in the first place. It's not actually his or anyone's job to try and help clean up the mess others are making, they do that on the side because it's required to look in the mirror.
^ This is what and who matters. Without a (decent) world, all the rest ain't happening anyway. He's the kind of person I'd imagine to be rather friendly to, say, a cleaning lady, and that to me are the more significant bits of the number that makes up grace. At least from what I see, he's sometimes a dick to people who are used to be lauded (and well paid), he's a champion of people who are used to get shit on.
Furthermore:
> "Courage is indispensible because in politics not life but the world is at stake." -- Hannah Arendt, "Between Past and Future"
We're now at the point where we're not just worried about dying, or getting jailed, or losing a friend or two; but about merely being offended. Crooks and madmen aren't content anymore with merely getting away with it, now they want respect, too. I'm not even sure how polemic or exaggerated that is. Criminals are tired of laundering money so to speak, they just want to whitewash the crimes themselves and do them in the open. You can respectfully disagree in a way that leaves room for you not actually being sure of what you're saying, and that's it.
Also, saying "this house is on fire" is not an "absolute stance" on fires or building safety, and if the house is on fire, and as long as no serious efforts are made to change that, it's simply principled to repeat this over and over, instead of coming up with something "new" for the sake of it. Hannah Arendt again:
> The ceaseless, senseless demand for original scholarship in a number of fields, where only erudition is now possible, has led either to sheer irrelevancy, the famous knowing of more and more about less and less, or to the development of a pseudo-scholarship which actually destroys its object.
I think this could also be applied to discussion of politics. Every time a government or the NSA does something, a bunch of people say "why does this surprise anyone?", as if something that sucked on day 1 would somehow be less alarming on day 50, with the same or even increased levels of suckage.
Same for Chomsky criticizing his own, which is still valid. Why does it have to be new, or elegant, or otherwise pleasing? If a society shits its pants, that's already not something pleasant to point out, but the more years pass, the longer it keeps sitting in and adding to it, the less pleasant it becomes. It's not anyone's job to add sugar to the medicine, or give it with a big smile, or anything. I think it's awesome if they do, but still perfectly fine if they don't. Insofar his or anyone's arguments have merit, you can always repackage it without any vitriol or snark, and they stay intact. That is what really matters, and that kind of elegance is also in the ear of the listener.
I hold some viewpoints that I know are less likely correct than others I know about, because the benefit of being right (if it is correct) is far larger than the cost of being wrong (if it's not correct). I find I have to embrace a viewpoint in order to think through all the consequences of it and arrive at an insight, which could either be a contradiction (and thus a proof of it's falsity) or a good strategy if it turns out to be true.
it's important to know that chomsky used the same ideas in the 70s. he defended the biologicism (not sure about english translation of the term) which is basicaly machine learning in sociology. he fell on his face and is probably now in a very good position to throw this criticism.
This seems a little ironic, I feel there's some incredibly useful things in AI/ML that rarely get applied to real world problems, instead academics just spend all their time trying to unravel the black box behavior and come up with some model for it.
Chomsky is wrong that statistical models of language provide no insight. They do.
Chomsky is right that language has meaning and that many modern statistical techniques essentially ignore this.
My take on the issue is that you can't separate linguistic command from true intelligence / cognition. There's a long tail of tricks that us intelligent people can use, but fundamentally they'll only be tricks. And if we truly get something resembling a perfect linguistic-aware AI by this long tail of tricks then we've probably accidentally created real cognition. Maybe after typing this all out I finally understand what Turing meant.
Here is a personal anecdote that will hopefully make clear what I'm talking about. When I first used LDA I was blown away at how context could disarm homonyms. This type of insight isn't exactly mind blowing after the fact, but it's still compelling enough to expect that statistical models produce insights that are broadly applicable without having to resort to proving it.
[Edit] No really, take the ideal gas law. To get a real understanding of a given situation, you would have to know the position and momentum of all of the molecules of the gas. But the ideal gas law says, no, you don't need all of that to get a very good approximation; all you need to know to get the pressure of a gas is the temperature, volume, and amount of gas.
Physics went through this same argument with thermodynamics and statistical mechanics. Predictive, statistical models pretty much won. And then along came quantum mechanics---there's nothing descriptive there at all.
We were discussing machine learning not models which contain statistical properties. And both statistical mechanics and quantum mechanics are descriptive, what they are not is determinate.
tl;dr
"Classical" (natural) scientist: "X is only understood once I found first principles that x can be reduced to."
"Modern" scientist: "X might be so complicated that there are no first principles that x can be reduced to. Rather, finding a neural network that can do x is the best I can do, and it explains why x can be done."
I think you should delete the "first" in both cases. Chomsky merely wants a principled model. He's not making any pretense that we're anywhere close to understanding language at a fundamental level. Norvig isn't really interested in principled models at all.
The language as a phenomena is obviously neither purely functional nor purely statistical. Purity is an abstract nonsense, an abstract category of abstractions.
It is obvious from serious psychological studies of the process of a language acquisition, that it is similar to training a neural network - there is some knowledge representation grows up in the brain, but the process of training/learning is possible due to having appropriate machinery in the brain.
It seems, like we have more that two apriory notions - of time and space, we, perhaps, have, apriory notions of a thing (noun), process (verb) and attribute (adjective) and even predicate at very least, as reflections of our perceptions of physical universe around us with sensory input procession machinery we happen to evolve.
It is a mutualy recursive process - we evolved our "inner representation" of reality constrained by senses, but nature selects, in some cases, those with more correct representations.
How these apriory notions maps to sounds - details of phonology and morphology is rather irrelevant - we evolved machinery for that. This is why, there is no fundamental, principal differences between human languages. The difference in in a degree, not in a kind.
It seems also that we learn not the rules (schools are very recent innovations), but "weights" by being exported to the medium of a local spoken language. Children do it on their own, at least in remote areas, like among nomads of Himalaya, no worse than Americans. This, by the way, is prof that we have everything we need to be Buddha or Einstein.
How exactly training occurs is absolutely unknown but it has nothing to do with probabilities. Nature knows nothing about probabilities, but it obviously "knows" rates - how often something happen. Animals "know" how often something happen.
Probabilities is an invention of the mind, which leads to so many errors in cases where not all possible outcomes and its caused are know, which is almost always the case. Nature could not rely on such faulty tool.
So, like every naturally complex system, it has both "procedures" and "weighted" data. Language capacity is hardwired, but grammar "grows" according to exposure.
To speak about hows, and especially how-exactlys in terms of either pure procedures or pure statistics is misleading. It is both.
And Mr.Chimsky is right - mere data, leave alone probabilistic models, describe nothing about principles behind what is going on. They does not even describe what's going on correctly, only some approximation to an overview of something unknown being partially observed.
The more or less correct model, as a philosophy, must be grounded in reality, especially in that part of it which we call the mind. It has been pointed out, that mind itself is possible because of hardwired apriory notions (grounded in physical universe) of succession and distance, so models should be augmented with these notions too. Pure statistics is nothing.
"I believe that Chomsky has no objection to this kind of statistical model [the Newtonian model of gravitational attraction]. Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions of parameters, not just one or two."
This is no more than an objection to problems of fitting your chosen model to data. If you only have a small number of free parameters, then you can fit your model with a reasonable amount of data. If you have a large number of parameters then you have to introduce some extra assumptions, as Norvig (of course) acknowledges slightly earlier (described as "smoothing", in context):
"For example, a decade before Chomsky, Claude Shannon proposed probabilistic models of communication based on Markov chains of words. If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (10^15) probability values to specify the model. The only feasible way to learn these 10^15 values is to gather statistics from data and introduce some smoothing method for the many cases where there is no data."
Thus, although both models are statistical, it is much easier to have confidence in Newton's law of gravitation than it is in a Markov model of some communication channel, because the data tell a clear picture. The imprecision of Newton's law in certain parts of the problem space (unobserved during his time) is a moot point - any such objections apply equally well to models with many parameters, and then you _still_ have to accept that you have made extra assumptions "outside" the scope of your model.
If you can explore your entire problem space, then you can build a complete "model". If not, then having more parameters than data _requires_ additional assumptions. Chomsky's point stands.