That some aspects of grammar can be captured by statistical analysis was never really in dispute. The OP is slightly confused about the hierarchy part. Chomsky never said that you couldn't discover by statistical analysis that languages have a hierarchical structure. He rather said that a baby, based on the data it has available, would not be able to determine statistically that certain rules of grammar are defined over this structure rather than over the linear sequence of words.
Don't waste your time arguing Chomsky supporters, it's a cult. They keep wringing non-falsifiable theories out of lengthy hallucinations, but it all reveals itself to be just Trek physics later on.
He's not just problematic in politics, he is in linguistics too, just the BS is harder to notice. I suppose his compiler theories were legit, and his pioneering spirit that lead to establishment of multiple fields of researches might be too, but leading theories he'd created are just as dubious as the first man made shed built on a discovered island would be.
And the problem is not just that early speculations in a novel field are often wrong, it's that his supporters don't care. They'd regurgitate those scientific theories(tm) ad infinitum and waste resources for the whole humanity. So don't bother trying to fix them and making them motivated.
First, Chomsky (and I) were talking about language acquisition in children where there aren't billions of examples, so it's completely irrelevant if some other system can do something, the question is how do humans do it.
Second, there isn't any evidence that LLMs have captured grammar rules in any meaningful sense, just as they can't do addition or any other recursive computation.
Sure, here are a couple of examples of ECP violations removing ambiguities.
1a. How often did you tell John that he should take out the trash?
b. How often did you tell John why he should take out the trash?
(1a) can either be a question about frequency of telling or frequency of trash
disposal, whereas (1b) can only be a question about frequency of telling. I asked
GPT-4 to explain how each sentence was ambiguous and it seemed to entirely miss the
embedded readings (the ones about frequency of trash disposal) for
both sentences, while finding some other ambiguities that were spurious
(such as suggesting erroneously that (1b) could be a question about how many
different reasons you gave John in a single instance).
Similarly, (2a) has both a de re and a de dicto reading, whereas (2b) has
only a de re reading:
2a. How many books did Bill say that Mary should read?
b. How many books did Bill explain why Mary should read?
That is, (2a) can be asked either in a scenario where Bill has said "read 10 books!" or
in a scenario where Bill has said "read Book A, Book B and Book C!" without
necessarily counting the books himself. (2b), on the other hand, only
has the second kind of interpretation. I've had mixed results with GPT-4 in this case (depending on exact choices of vocabulary, etc.), but it certainly makes some mistakes. For example, it says that (2b) can mean "John explained the reason for a certain number of books that Mary bought".
As the sibling comment points out, it would not show very much if
GTP-4 did correctly determine these ambiguities as it has had access to much more data
than a child. You would also need to show that the same statistical techniques
would work when applied to a realistic dataset.
As a side note, my instinctive reading is on the telling frequency. Sure, one can make a garden path sentence, but (for my own ESL eyes and ears) it would be more straightforward to say, "Did you tell John how often he should take out the trash?"
2b does not feel right on its own (and I am not an AI). I can understand it, but it feels like reverse engineering rather than reading a normal sentence.
The whole issue is the difference between (1a) and (1b), not whether the AI can understand one of the sentences under some of its available interpretations. Indeed, with GPT-4o, I get the same result as you for (1a), but also a description of a spurious parallel ambiguity in (1b). Part of the trouble here is the inconsistency of results depending on the exact phrasing of the question and random variation in GPT-4's responses. I wouldn't be surprised if it sometimes gives the right answers, but I don't think it does so reliably.
(2b) is what's known in classical terms as a 'subjacency violation', so yes, it sounds imperfect. Nonetheless, native speakers agree on which interpretations it can and cannot have. GPT-4 does not have the same capacity, as far as I've been able to determine. You sometimes have to be a little creative with scenario construction for sentences like (2b) to click.
"Ok, So Bill explained why Mary should read War and Peace, then he explained why she should read The 39 Steps, and then he explained why she should read some other titles that I can't remember. I wonder just how many books Bill explained why Mary should read."
But try constructing a scenario for the other interpretation and you'll find that it's still just as bad.
Well, prompting is not a nice addition to LLM; it is a necessary thing to do.
> Nonetheless, native speakers agree on which interpretations it can and cannot have. GPT-4 does not have the same capacity, as far as I've been able to determine.
This one is an expectation, not even a factual statement. A factual statement would be "95% of English native speakers with a college degree" or so. Among less educated, the numbers could be depressingly low, even for much more straightforward tasks.
Then, the question is how a given ML model fares against real data, not against some platonic ideal.
> The statistical model is confused by the superficial similarity between (a) and (b), just as Chomsky predicted decades ago.
Well, WE are statistical models as well. So, any too-broad claims on the inability to understand natural language by ANY statistical models are falsified the moment they are spoken (or written); unless you go into Penrose-style Neoplatonism.
---
Vide your doubt if I find a (counter)convincing example. Sure, any benchmark of human vs. model performance is an empirical verification. And yes, some artificial models may struggle with some tasks. For example, for the Winograd schema, there is a leap between GPT3 and GPT4: https://d-kz.medium.com/evaluating-gpt-3-and-gpt-4-on-the-wi....
What it's (to me) is ex-cathedra defining what is the English language. The actual natural language people use is full of utterances that are not correct (yet are easily understandable) and "correct forms" that are rarely used and, if so, misunderstood by anyone save for those who put conscious effort (linguists, lawyers, etc).
For any discussions, it is essential to know if we are working with a real (and highly probabilistic) natural language or what of its concrete models (i.e., an abstraction).
My prompts are in my comment. Your ChatGPT link sends me to a login screen.
Humans (or English speakers at least) aren’t confused at all by the pair of sentences in my last comment. If you’re just going to try to deny the plain facts about how these sentences are (and aren’t) interpreted by English speakers then that’s really just a kind of grammatical flat-Earthism. The judgments at issue aren’t remotely subtle.
> Well, WE are statistical models as well.
This is begging the question. Chomsky would deny this.
So, I think that we differ at a fundamental level.
While you prefer to work in a Neoplatonic world of ideas, I prefer empirical facts and convictions that all models are approximations.
English grammar is not fixed per se; it evolves with region (have you ever been to Singapore?) and time. Your judgment (or Chomsky's, or anyone), however founded, is not a fact. It is an opinion up for experimental scrutiny.
I don't say that your examples are incorrect. Still, measure the percentage of correct (or consistent) answers for humans against particular models. Otherwise, it might be maths, might be philosophy, might be arts, but it is not (empirical) science.
As far as I can see, your issue with Chomsky has nothing to do with the performance of modern LLMs. You just reject all the data that generative linguists take to be crucially informative as to the grammatical structure of natural languages. You would hold the same view if LLMs had never been invented. So it is really a common case of AI and cognitive science talking entirely at cross purposes.
> English grammar is not fixed per se; it evolves with region (have you ever been to Singapore?) and time.
Sure, but this is not the case for the examples I gave. There aren’t dialects of English where (b) has the interpretation that GPT-4o thinks it can have. It’s no use trying to muddy the empirical waters in the case of completely clear judgments about what English sentences can and can’t mean.
There is no example or standard that would satisfy you. Any failing example can be added to the training set in the next version and even if it couldn't it wouldn't matter because you could find a person somewhere that would also fail it.