Hacker News new | past | comments | ask | show | jobs | submit login
A statistical approach to model evaluations (anthropic.com)
66 points by RobinHirst11 39 days ago | hide | past | favorite | 49 comments



This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?


As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)

I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]

[0] https://en.wikipedia.org/wiki/F._D._C._Willard


Personal sad story, but hopefully relevant: during my recent PhD I worked on a problem where I used a Dirichlet Process in my solution. That paper has been bouncing around for the past few years getting rejected from every venue I have submitted it to. My interpretation is that most reviewers (there are exceptions - too few to impact the final voting) don't understand any non-DL theory anymore and are not willing to read up for the sake of a fair review. This is based on their comments, where we have been told that our solution is complex (maybe? - but no one suggests an alternative), exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase), and in one case, someone said the paper is derivative because it uses Blackwell-MacQueen sampling; their evidence? - they skimmed through a paper we had cited that also used the sampling algorithm. This is like saying a paper is derivative because it uses SGD.

I am on the review panel of some conferences too and it is not uncommon to be assigned a paper outside of my comfort zone. That doesn't mean I cut and bail. You set aside time, read up on the area, ask authors questions, and judge accordingly. Unfortunately this doesn't happen most of the time - people seem to be in a rush to finish their review no matter the quality. At this point, we just mechanically keep resubmitting the paper every once a while.

Sorry, end of rant :)


Ah, Dirichlet Processes, such lovely things.

Reading this paper, I was struck by how obvious most of the solutions were given my own background from grad school benchmarking quantum annealers and other classical solvers for spin lattices (mostly thermal sampling inspired approaches). I'd argue one could do an even better job than the analysis in Anthropic's paper, but it's astonishing how basic questions like "well how sure are we this is real?" just aren't asked seemingly in ML papers.

I developed a passion for Bayesian statistics approaches in grad school, and had a lovely time specifically thinking quite a bit about DPs, Bayesian bootstraps, etc. I'm sorry your paper is bouncing around. I think folks underestimate these days the value of really thinking about what you know and how you know it, and how to really model uncertainty, and definitely underrate non-DL approaches to problems.


Thanks, yes, lot of good ideas in ML seem to be slowly vanishing from the collective awareness. I have nothing against the current spate of methodologies which are empirically great - and if one needs proof, I am a "happy customer" at my day job which is mostly DL and a lot of LLMs - but it seems we are buying into a world where it is one versus the other. And this it need not be. Great ideas are great ideas irrespective of age and there is value in preserving them.

Anyway, since this thread surprisingly evoked a mini-discussion on Dirichlet Processes (DP), if someone needs an intro, I have tried to balance math and intuition in a description in my thesis: Section 2.2 in [1].

[1] https://drive.google.com/file/d/1zf_MIWyLY7nxEr5UioUQ7KhOQ1_...

EDIT: I looked at the description and I confess it still has a lot of math (since it is part of thesis). I will probably translate this to be more friendly and put it on my blog.


Just a note

> exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase)

Does not mean that the paper is invalid, but maybe the storyline is difficult to follow, the results not easy to interpret, or overall badly written or missing justifications. Even if you take into account the reviews to rewrite it, it doesn't mean the paper is clear and easy to understand.

As you noted, researchers need to read material outside of their confort zone, and the publications have shifted in focus. Before you could expect a reader to be familiar to the topic, now you need to educate him as clearly as possible.

I picked a random text inside the paper > The workings of the technique itself are presented at a high-level in Figure 2.

Annoying to read.

> Instead of learning the training distribution directly, which might be expensive because of the dimensionality of the data, we first project the data down to one dimension.

Why is that good enough? Justification missing

> This is done just once, and is shown in the left panel in Figure 2. Since we are solving for classification, we pick this dimension to be a numeric indicator of how close an instance is to a class boundary.

Why is it a good indicator, justification

> As a convenient proxy, we train a separate highly accurate probabilistic

Ok, references on previous research that show it can work?

So in essence, I don't say you need to explain everything, but the text could be more clear on the choices and why they make sense.

My gut feeling is that you know and understand what you are doing, but you miss too many justifications that proves your work valuable.

I didn't read the whole thing, so maybe I'm missing the picture, but from random sampling on the text I expect the rest to follow the same.

While I read the introduction, I don't want to read 'we did that and that and that'. But 'there was this issue, we solve it in this way because this reason '

And following issues->solution->why should give me enough understanding of what you are trying to achieve.

Follow-up sections should refine the solutions


Thank you for these comments. I appreciate them and I'll consider them in my next draft. However, I would like to point out a few things; just so that we have the larger picture in mind. Again, I do appreciate you took the time to look up the paper.

1. When I said we revise the paper between two submissions, I wasn't implying it was becoming "better". The message was that there is no general consensus around what should be expanded and what might be concise. Someone believes you should discuss prior work more, someone thinks the main algorithm requires more elaboration, someone wants you to talk more about BayesOpt etc., but you just have <10 pages in the main paper, and putting this stuff in the Appendix, or citing source, doesn't seem to be good enough in many cases (another comment in a sibling thread gives an example wrt GANs, and my experiences have been no different).

2. You say you randomly picked a few sentences to read; that's good for a casual discussion but that should not be how a review process functions. Some of the best reviewers I've encountered (and I hope I am continuing in that tradition) come back to say something like "I see what you're getting at, but your intro. doesn't sell it well enough; think about writing it like this ...". Rejecting based on random skimming is exactly one of the things I'm calling out. Let's face it - like a lot of things, high quality reviewing is hard. It isn't supposed to be quick or easy.

3. Predicting how much to elaborate: this is probably an extension of the first point, but I feel like this has become way harder in the recent years. The rule that mostly works seems to be that if its not a trending topic explain it as much as you can, because cited background material is overlooked. This is unfair for areas that are not trending - the goal of research should be to situate itself closer to "explore" on the "explore-exploit" spectrum, but the review system today heavily favors "exploit". And like I mentioned, a page limit means that the publication game stacked against people not working on mainstream ideas. This should not be the case.


1. When I said we revise the paper between two submissions, I wasn't implying it was becoming "better". The message was that there is no general consensus around what should be expanded and what might be concise. Someone believes you should discuss prior work more, someone thinks the main algorithm requires more elaboration, someone wants you to talk more about BayesOpt etc., but you just have <10 pages in the main paper, and putting this stuff in the Appendix, or citing source, doesn't seem to be good enough in many cases (another comment in a sibling thread gives an example wrt GANs, and my experiences have been no different).

That's exactly my point, the reviews do not converge because the message is too diffuse or not justified enough. I recently had a paper rejected because it was too difficult to understand, it was on 4 pages, now it's sent to a better journal and was expanded to 20 pages. The content was too big for a 4 pages content, we couldn't fit enough justifications. But in your paper you still have many places where the text could be shorter and clearer, gaining at least 1 page of content. Learning to write good research takes a lot of time, and a phd is the place where ideally this should happens. It's difficult, but you'll get there if you work on it enough! Read best paper awards of good conferences, notice how much material is there in the same number of pages, and reverse engineer what they did to make the paper clear, concise and easy to follow.

2. You say you randomly picked a few sentences to read; that's good for a casual discussion but that should not be how a review process functions. Some of the best reviewers I've encountered (and I hope I am continuing in that tradition) come back to say something like "I see what you're getting at, but your intro. doesn't sell it well enough; think about writing it like this ...". Rejecting based on random skimming is exactly one of the things I'm calling out. Let's face it - like a lot of things, high quality reviewing is hard. It isn't supposed to be quick or easy.

You cannot choose who will read. But even for the more throughout readers, if it's difficult to understand / missing justifications from the beginning, they will give a bad review, even if they read the whole thing. Reading should be like a conversation with the author, if I find the conversation with the author through the paper too sloppy or erratic, I will not understand the message, that's what happens when I ask more justifications on some part to the author. It's because I couldn't follow the logic enough or I was not agreeing with some part, so I require more justifications.

3. Predicting how much to elaborate: this is probably an extension of the first point, but I feel like this has become way harder in the recent years. The rule that mostly works seems to be that if its not a trending topic explain it as much as you can, because cited background material is overlooked. This is unfair for areas that are not trending - the goal of research should be to situate itself closer to "explore" on the "explore-exploit" spectrum, but the review system today heavily favors "exploit". And like I mentioned, a page limit means that the publication game stacked against people not working on mainstream ideas. This should not be the case.

I agree, there are no more general experts, everyone works in a very niche subfield, you don't get people that know the sota. Learning the good tradeoff is difficult. My threshold is: don't explain the math unless it's not self obvious why. For example for some equation I can give more insights on how it affects my method and if a parameter of the equation is very important to my method, a complete analysis of its effects and analogies and experiments to see its impact. I try to make the main story line as crystal clear as possible, if I deviate too much, it's better on a second paper. My experiments should reflect not trivial things. Finally I make sure the abstract corresponds to the text. I mainly don't work in deep learning, so by default my topics are extremely hard to find reviewers, I feel the pain. But it's my work to make them understand what I'm achieving and why it's important.

Hope that helps :)


I feel like we are debating slightly different perspectives, and with that lens I agree with what you say. Here is the difference in perspectives (and this is decoupled from this particular paper): your take is that today the reviews work in a certain way and here are some things we could do to maximize our chances of acceptance. My take is that reviews shouldn't work this way.

To take some examples:

1. > You cannot choose who will read.

Specifically no, but generally, yes. I'd expect the reviewer to understand ML. And if this is not the brand of ML they're familiar with, I'd expect them to put in the work to familiarize themselves during the review process, in the interest of fairness. After all are we not seeking out qualified reviewers for the review process? This is not just anyone who stumbles across a paper on the internet.

2. > message is too diffuse

Any message would appear diffuse/opaque/abstract to someone unfamiliar with the area. This is exactly why an objective review process must equalize such communicative biases. This is partly facilitated by the conference picking the right reviewers and with their review-assignments, and partly it is also the duty a reviewer to fill in whatever gaps of comprehension that remain.

3. > Read best paper awards of good conferences, notice how much material is there in the same number of pages, and reverse engineer what they did to make the paper clear, concise and easy to follow.

Good general advice but you are preaching to the choir. I do read best papers from various conferences and I run reading groups where we discuss papers from ongoing conferences. I run an applied ML research group in the industry - this pretty much comes with the job. Further, I don't think that best papers are head-and-shoulders above non-best papers; they are often voted to the top because they solve a broadly known problem, or they further the understanding of such a problem. Writing plays some role here, but is not the discriminative factor.

4. Requiring justifications. Yes, there is a rebuttal phase for that.

Just to be doubly clear, I am not saying papers (and this paper) can't be improved. But that is not the argument I am making.


I totally understand your take, and you are right in some aspect. Reviews SHOULD NOT WORK IN THIS WAY. I'm totally agreeing on that, but consider this, the population is increasing, the number of topics are increasing, we are no more in a system where the review system work as it was designed to.

When it was designed, only a few people got the chance to read, and information was not easily available. So, people became experts, and knowing the sota was mandatory. Now, due to the high quantity of (good and bad) researches, you cannot expect the review system to work properly.

But you are still stuck in this system. So consider what is important:

- do you want to write and hope that by chance the right people will read it, and they will be educated enough in your topic and have enough time at their disposal. Or

- Do you think your idea is good and should be more known.

If it's the later, it's your work to make your idea as clear as possible so that any (good) researcher can understand it, and therefore use it. We must work in the reality of the current system if we want to spread interesting ideas to the community. The publication system is a social system, and it evolves with the people inside, you want to write to spread knowledge. How can you do that if the probability is that the reader will not fully understand?

The time is very limited and I always have many things to do, I only read what I filter as worthy enough. That filter is based on the quality of writing. If some paper is important but badly written, it will automatically fall in my 'if I have time to read it' and most of the time it will never reach the 'to read' category, because there are many many paper well written with good ideas inside.

We work in a biased system. It's extremely difficult to find reviewers, we do what we can with what we have.

I also was infuriated when I got a review saying 'you didn't explain structure from motion' in a conference with a topic on structure from motion. But the reality is this. If I want my papers to be read, I must adapt to my audience.

> Read best paper awards of good conferences

With that sentence, I did not mean 'read it for knowledge' but read it with the lenses of the writer, why did they present the topic in this way, what makes this paper clear and another on the same topic not clear at all. Reverse engineer the writing style. It's not about knowledge of the content of the paper, it's about communication. Best paper do not always have the best ideas inside, but they are presented in a way that even if the topic is difficult they provide insights on it. And often those insights are what readers want to read. The maths are not important, the important thing is the insight you give to the readers. That insight can be translated in their field of they internalized it enough.


  > I also was infuriated when I got a review saying 'you didn't explain structure from motion' in a conference with a topic on structure from motion. But the reality is this. If I want my papers to be read, I must adapt to my audience.

Honestly, I hate this take. I don’t think it’s good for science or academia. Papers do not need be readable to everyone. The point is to be readable by other experts in your niche. Otherwise, I don’t know who to write to, and that’s exactly the same problem the parent is having.

Writing to too broad of an audience also makes papers unnecessarily long. You have to spend more time motivating the work and more time on the background. This has spinoff effects where reviewers can demand you cite them, contributing to the citation mining nightmare. I’ve seen 8 page papers with 100+ references (the paper I referenced has 78). This is more what we expect from a survey paper. When background sections are minimal you can’t justify asking unless you are critical to the exact problem being solved.

Every paper rewrite is time and money that should be better spent on research or other activities. Every rewrite is an additional submission into the next conference. I don’t think labs are submitting 20+ papers per round because they wrote 20 papers in that 3 months (with some exceptions) but rather because they wrote a bunch and are recycling works from the previous few years. This increases reviewer load as well.

The question then is how people enter a topic. Truth be told, I don’t think it’s any easier than when papers had under 30. For reference, that one Cybenko paper we all know has under 30 references but is 10.5 pages. What I think we should do instead is allow citing of blogs and encourage people to write tutorials. I think this would actually be a really useful task for 2nd and 3rd year PhD students. You learn a lot when writing those things and that’s the stage where you should be entering expert level at your niche. The problem is that we have no incentivize to do any of the other critical tasks in academia. This is why I personally hate it. We are hyper focused on this novelty thing but in reality that doesn’t exist and is highly subjective. It just encourages obscurification, which we’ve routinely seen from high profile labs.

We work in ML, how are we not all keenly aware of reward hacking and knock-on effects?! I honestly think the fact that we cannot get our house in order is evidence that we can’t safely build safe AGI yet. This is certainly magnitudes easier of a task, not to mention has significant reward (selfishly, it highly benefits us too!). Everyone feels that something is off but no one wants to do anything about it. We’ve only implemented half added measures that are more about signaling. Can’t let an LLM review for you? But the author is responsible for proving the reviewer used one? We’re all ML experts… we all know this isn’t possible except in the most cases. It’s as if you got shot while blindfolded and the police won’t investigate until you bring them evidence of who shot you and with what kind of gun. It shouldn’t matter if a review is bad because it was written by an LLM or because it was by a human. Just like it shouldn’t matter if you were shot or stabbed.

  > The maths are not important, the important thing is the insight you give to the readers.
I also hate this take. The math often __is__ the insight. I agree that a lot of papers have needless math (look at any diffusion paper or any paper with attention copy pasting the same equations). But other works need them. The reason to use math is the same reason we program. If there was an easier way to communicate, we would (note: math isn’t just symbols, it can be words too). Math and programming are hard because they are languages that are precise and dense. The precision is important when communicating. Yes, it might take longer to parse but it is unambiguous when interpreted (it is also easier to parse when you’re trained and in the habit. Just like any other language).

I think we lost our way in academia. We got caught up in by excitement. We let the bureaucrats take too much control and dictate the universities. We got lost in our egos (definitely not new) and too focused on prestige and fame. Our systems should be fighting these things, not enabling or encouraging them. Yes, the people at the top benefit from these systems, but the truth is that even they benefit from fixing things.


Thank you for taking the time to read this already long thread.

> Honestly, I hate this take. I don’t think it’s good for science or academia I completely understand, I don't say that I like the system as it is, just that we are stuck in it because there is no real alternative.

I would also love to have experts reading papers, but the sad truth is that it is often a first year phd student doing the review outside of its field.

> The question then is how people enter a topic.

I really like the idea of encouraging PhDs to write blogs and supporting material for their paper, or at least reference to good quality blogs that lead them to insights. But it takes time, time that usually they don't have, teaching assistants in particular spend most of their time working with students, projects to follow, etc..

If you have a solution to how to do it properly I'm interested. I had myself some ideas on how to solve this, but it would require time and money that I don't have.

> We are hyper focused on this novelty thing but in reality that doesn’t exist and is highly subjective.

Thanksfully I don't work on fashionable research, of course I have some people working on fashionable dl, but we stay focused on the why and not just putting boxes one with the other and hope it works.

> The math often __is__ the insight.

When the math is the insight, it should be there and followed by the insight in text and analysis. I don't say don't put math, but put interesting math. Nobody cares about copy pasted math from any other paper, with possible mistakes inside.

But I keep my view, the math is a mean, but ultimately what we want to transmit is the insight behind, the math can be recreated at will when the insight is there.

I love math, and some fields are more mathematical than others, and they profit from it.

> We let the bureaucrats take too much control and dictate the universities.

Agree totally.


Sorry the topic is obviously a bit sensitive to me haha. Thanks for the tone, I can tell you'd make a good mentor and manager.

  > If you have a solution to how to do it properly I'm interested.
I have purposed solutions to lots of this stuff actually. But it does require push from others and it’s necessary for those in prominent academic positions push them. I think the issue is that there's a lot of interconnected parts to all of this. I doubt that there's an optimal solution, which in those settings I think flexibility is how one should error on. It gives us the room to adapt more easily to local changes in environment. But there will always be some downside and I think it is easy to focus on those and not compare against the value of the gains.

For blogs:

I think we should just count these as publications. People are just looking at citations anyways (unfortunately). We should at least count these as citable.

There’s a good knock-on effect to this one too. It can encourage replications (the cornerstone of science), tutorials (which is teaching and like a stepping stone towards books. This also helps students write better), and can help us shift towards other publication media in general. Like why do we publish works about video or audio in a format that can’t play video or audio? Its absurd if you ask me. The only reason we do is momentum.

I think it is also important too just be writing. To be a good scientist you must also be a philosopher. It is not just about doing the work, it is about why. The meta-science is just as important as the science itself. I think so is the fun and creativity that we have more room for in blogs (I think there should be more room for this in papers too). Research is just as much of an art as it is a science, and we need to be working the creative muscles just as much as the analytical ones. I also think it helps with the discourse between papers. I mean Andrew Gelman's and Scott Aaronson's blogs are famous for having all of these things. They are valuable to their communities and I think even a broader audience. But as the environment is now, this is disincentivized. I think more people want to do it and are motivated, but it is easy to push off when there is little to no reward (sometimes even punishments, like an advisor or boss saying "spend less time writing and more time researching"). If you're going to "slack off", you might as well do something that is more recovering, right? [0]

For reviewers/review system:

Again, incentive structures. The problem right now is that it’s seen as a chore. One that if you do poorly it doesn’t matter. So my first suggestion is we throw out the conference and journal structure as we know it.

The purpose of these structures was because we didn’t have the ability to trivially distribute works. You hire editors to make sure you don’t publish garbage because it’s expensive to publish and they correct spelling and grammar to make sure it’s providing the most value (communication). There may be conversations to improve the works, but not to down right throw them away. Everyone here is well aligned in goals. They're all on the same team! But none of that happens now. We have a system where it is reviewers vs authors. This should not be an adversarial setting. In fact, a few spelling mistakes are justification to reject a paper now (ask me how I know). The purpose was always about communicating work, not about optimizing for what is the best type of work and most important. Truth be told, no one knows and we can only tell later down the line.

There’s two common critiques I hear with regard to this:

  1) How do we discover work?
  2) How do we ensure integrity of the work?
Well who actually goes to the conference or journal websites to read them? We all use the arxiv versions, which are more up to date. We're also reading preprints, and especially in ML this is the way to keep up to date. I only go to grab the bibtex because the authors only have the arxiv one on their GitHub or webpage (pet peeve of mine). We pretty much discover from other papers, google, peers, twitter, and well you’re a researcher you know. The “getting published” is just a byline in a tweet and a metric for the bureaucrats.

The physicists created arxiv because it was what they were doing already. Which was you publish your draft in your big lab and others read it and critique it and you take that feedback to improve. There's always mean people, but mostly everyone is on the same side here. We just extended who had access to the library, that's all.

Discovery is a problem. But what about integrity and verifiability? I find this a feature, not a bug (and you'll be able to infer how this couples with writing directly to niche peers instead of broader groups). Sometimes if you try to take too much control, you end up getting less control. The truth is that the validity of a work is at least doubly intractable. You can't figure it out just from reading the paper. The only way verification happens is through replication. This cannot be done by reading alone. Works are (often, but not always) falsifiable through review, but not verifiable. The distinction matters.

And I actually think the increased noise here is a good thing. Too many conflate "published" (in a conference or journal) as a mark of credibility. Both outsiders and insiders (other academics). It is too lazy of a metric. Many of our problems arise from such an issue and I'd say that oversimplification of a metric is a corollary to Goodhart's Law. We researchers can all read papers and determine pretty quickly if it is fraudulent or not, at least if it is in our field. But outsiders can't. They can't even tell the credibility of a conference or journal, and there's too many scam ones these days. This creates an extra problem where the science journalists, who are also caught in the ridiculous expectation of generating work in infinitesimal amounts of time, end up writing on works with poor understanding of them (and the context surrounding them). Adding noise here pushes them to reach out to experts which will increase the quality overall, as the expert talking to them will not just filter crap but also be able to point to important nuance, things that a novice would miss. Especially when those things are buried in technical language or equations :)

In addition to this it removes power from the lazy bureaucrats AND harms the frauds. I think it is easy to believe that fraudulent work would flourish under this environment, but I think the opposite. Yes, arxiv has plenty of fraudulent works on it, but they are small in comparison. The frauds go to the fraud journals. Their scheme only works because they are able to have someone give their crap a mark of approval. When there is no seal of approval, one must go ask the experts. It is just a shift in the trust structure, not a destruction of it. There'll be collusion rings, but we already have those (including in high profile venues!). I do suspect there may be a bit more at first, before it stabilizes, as everyone adapts. But I think we already do most of this stuff naturally, so it won't be that hard.

But I do think we should keep conferences. There is value in meeting in person (I also think universities should encourage more inter-departmental collaboration, as well as extra-departmental and extra-university collaboration. I do think it is silly that we aren't having deep conversations with our neighbors). But I think these should be more invitations. You have invited speakers and the rest is you focusing on getting people to talk and facilitate these collaborations. That's one of the primary goal of conferences, building these social networks. Social media helps, but there's a big difference sitting face to face. I also think they should have training sessions (like they do) and workshops should be focused around these, not around publication. So less stuff is "published" in conferences, because publishing is just releasing work!

There's obvious downsides to this and there definitely is a lot of room for refinement, but I think this is a fairly good structure. At the end of the day we need to have faith in one another. The structure of science has always been "trust, but verify." But we stopped doing the latter and pigeonholed our measures. So now we have neither trust nor verification. I think it has all been good intention. I'll save you the clique, but what is a clique if not something everyone can state but few people actually follow? I get the desire to remove all noise, but I think such a goal is fruitless, it is impossible. So instead I think it is about finding the optimal noise. Instead of trying to get rid of it, we should embrace it. I hope as ML researchers we can recognize this, as noise is critical to our works. That without it, it all falls apart. Because, it is a feature, not a bug. It is a necessary attribute for generalization, and I think it isn't just for machines.

[0] Personally I find that the big reason for stress is that we remove creativity, flexibility, and a lot of the humanity from the work. We're over burdened by deadlines, there's too much work to do, and the truth is that the work is difficult to measure. Progress is difficult to see a priori, so you can constantly feel "behind". This just leads to slowdown and burnout. We're all here out of passion (at least I hope so! It'd be insane to do a PhD or research without the passion!). The structure should be made to gently push us back on track for when we get too lost by some other fascinating rabbit hole (who knows if it goes anywhere. But is going down it a bad thing?). But if we can't have fun, we are less effective at our jobs. If we can't take time to think about the meta, we get worse at our jobs. If we aren't talking to our peers, we get worse at our jobs.


Is a preprint of your paper available?

I looked at your blog a bit and was able to find this, which may be it?

> Learning Interpretable Models Using Uncertainty Oracles

https://arxiv.org/abs/1906.06852

https://doi.org/10.48550/arXiv.1906.06852



I copied the DOI for convenience but they’re the same paper.

I have no formal math background really so I can’t speak to your methods but I appreciate that you have shared your work freely.

Did you have any issues defending your thesis due to the issues you described above related to publishing?

Noticed a typo in your abstract:

“Maybe” should be “may be” in sentence below (italics):

> We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe ~ 100% improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models.


Yes, it did come up during my defense, but it was deemed not to be a concern since I had one prior paper [1] (the original one in this thread of work, the paper I linked above was an improvement over it), and my advisor (co-author on both papers) vouched for the quality of the work.

Thank you for pointing out the typo - will fix it!

[1] https://www.frontiersin.org/journals/artificial-intelligence...


Thank you for writing the blog post on Jensen's inequality. It is one of the best introductory material on this topic I've seen.


Thank you!


  > Sorry, end of rant :)
Don't be. Issues like this are the reason I haven't defended yet. The fact that an AC didn't laugh at that "critique" is itself indicative of a problem. They're as checked out as the reviewers. I was doing work in a more mathy area and could not get assigned reviewers that understood what was being done. To try to get something through I tried a more popular domain, won a bet with my advisor that I could get SOTA on a very popular dataset in a few months, but I have no compute left. I can beat big labs on one dataset with far less compute, but how can I compete when reviewers want dozens? Even if others weren't held to that standard... There's not enough compute for that. You can always have "more experiments"

For review, I set aside hours for each paper, and more the further out of my domain that they are (I'm also very happy to increase my score with a rebuttal and mark lower confidence (I frequently write what would change my mind to help authors). My best post rebuttal ever was "The authors answered all my questions, but due to the lack of novelty I'm lowering my score"). I'll keep doing this, but to be honest, after I defend I have no intention to push to conferences or journals. I just fail to see the value. It has caused me to spend more time rewriting and taking away from research. It just makes me upset and isn't making me a better researcher. I crave for someone to actually _criticize_ my work. I have a good citation count and h-index. My best paper is "unpublished", has hundreds of citations, resulted in a very popular blog post, and years later people are still messaging me for using it in their work. I don't think I'm a top researcher, but I don't think I'm well below the pack.

I just hate that my research directions are pigeonholed. That you need to do topics that people care about. That you need to evaluate with large scale. As if we can't have conclusions beyond the empirical. As if this isn't about communicating our work. That I need to write to those that are not "peers" (niche domain experts, as opposed to broader domain experts). As if experiments aren't proxies, but are demonstrations of a product. I think this significantly slows down the progress to AGI since it causes us to railroad to build from large models from big companies, and there is so little interest in anything else. How can we explore more architectures, learning methods, and all that if we're required to get SOTA out of the gate?

I don't want to say too much about my work since it is still bouncing around in review and I don't want to dox myself. But I'll say something about a work that I __reviewed__. It was for Neural PDEs. Review was for a workshop, and it was clear to me that this paper was rejected from the main conference. What was not clear is why. Until I got to see the reviews form my peers. Their complaints had the standard "novelty" and "not well written" (it was very well written btw), but the kicker for them was that the datasets were synthetic... Like... what?! Why does that even matter? They're solving equations! Luckily they had low confidence and I got the paper through. I wasn't surprised when a few months later I stumbled upon the paper again and found out it was from Welling's group.

  > At this point, we just mechanically keep resubmitting the paper every once a while.
I really wonder how long it will take conference organizers to recognize that the noise in the review process is a significant contributor to the increasing number of submissions. This seems a rather obvious connection but I rarely hear it discussed. Not to mention that it can damage papers quality (this certainly happened to mine, and I suspect yours). Reviews can improve the papers if the review contains actual critiques. But hey, why do work when no one questions a reject?

I feel like mine was more ranty lol. But it helps to not feel alone.


In my view, in most cases ACs now perform one of two roles (1) summarize what reviewers have said, and/or (2) if there is a high variance in scores, they request the reviewers to sort it out among themselves, and THEN summarize :). I am exaggerating here - sometimes they do engage, but these cases are far in between. Among my personal experiences, this has only happened twice: (a) during an ARR review where I was an author, we had requested the committee to intervene because one set of reviews went wildly off-track (along the lines of "more experiments", like you mention), (b) as a reviewer, I had pointed out some glaring benchmarking omissions, and the AC took time to understand the concern and decided to bring it up for discussion.

  > I crave for someone to actually _criticize_ my work
Yes! But if you are not doing mainstream DL, good constructive criticism is almost impossible to obtain. I feel like many reviewers expect half the paper to be a tutorial if it is not a trendy topic. Which I find is unfair for multiple reasons: (1) for a trendy topic much more complex topics are unexplained, because it is assumed the reviewer has heard of it, (2) yes, the previous point makes sense, but this is what cited materials or the Appendix is for, (3) most conferences have a page limit for the main paper, so you cannot go about rambling and arbitrarily explaining ideas, and (4) this is supposed to be a rigorous review process, it is not supposed to be easy. It shouldn't come as a shock that some papers take (sometimes a lot of) work to review!


  > I feel like many reviewers expect half the paper to be a tutorial if it is not a trendy topic. 
I tried to push a paper using a GAN to a workshop. I was asked to spend more time explaining what a GAN is "for those unfamiliar with the topic." I was baffled. Sure, ML moves fast, but that's catastrophic forgetting right there... (and in a fucking workshop?!)

I honestly believe that if someone is not intimately familiar with GANs then they should not be reviewing for a generative vision workshop.

  > I feel like many reviewers expect half the paper to be a tutorial if it is not a trendy topic. 
The difficulty of this gets harder with different topics. Generative works should show samples. But how much? How representative? These are crucial to evaluating the work but they devour your text limits.

It is always easy to ask for more. But with page limits there are clearly limits. I think it is too subjective. I wish we went more the direction of math papers which are often VERY short. Use as much space as you need. No more, no less. I think the formats are just too limiting (not to mention that paper isn't great for a lot of topics like video, point cloud, audio, pose estimation, and many others. But momentum is a powerful force.


The front matter in Vladimir Vapnik’s book “Statistical Learning Theory” (first edition published 1995) has this quote:

*

During the last few years at various computer science conferences, I heard reiteration of the following claim:

“Complex theories do not work; simple algorithms do.”

One of the goals of this book is to show that, at least in the problems of statistical inference, this is not true. I would like to demonstrate that in the area of science a good old principle is valid:

“Nothing is more practical than a good theory.”

*

It’s seen in page xii of the front matter at: https://link.springer.com/content/pdf/bfm:978-1-4757-3264-1/...

Vladimir was a friend during this time, and I think about this quote a lot with regards to ML tinkering.


I haven't had a chance to read that, but that quote suggests I should (especially considering the author and the editors).

I often refer to "Elephant Fitting" w.r.t these systems. I suspect you understand this, but I think most think it is just about overfitting. But the way problem isn't about the number of parameters, but that parameters need to be justified. As explained by Dyson here[0]. Vladimir's quote really reminds me of this. Fermi likewise was stressing the importance of theory.

I think it is a profound quote, and you were (are?) lucky to have that friendship. I do think abstraction is at the heart of intelligence. François Chollet discusses it a lot, and he's far from alone. It seems to be well agreed upon in the neuroscience and cognitive science communities. I think this is essential to understand in our path forward to developing intelligent systems, because there are plenty of problems that need to be solved in which there is no algorithmic procedure. Where there is no explicit density function. Intractable, doubly intractable, and more problems. Maybe we're just too dumb, but it's clear there are plateaus where luck is needed to advance. I do not believe our current machines would be capable of closing a gap.

[0] https://www.youtube.com/watch?v=hV41QEKiMlM


Thank you for the link.

Yeah, what we are doing now in ML isn’t really “engineering” in the best sense of the word. We don’t have a theoretical machinery that can predict performance of ML designs. (Like the way you can for a coding scheme in communications theory.) We have a lot of clever architectures and techniques, but not a design capacity.


As someone who had questions about some of what you said and feels legitimately scared to ask what you meant out of fear of being judged:

> I've been told I'm "gatekeeping"

I mean...when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.

> I swear, the big reason models are black boxes are because we _want_ them to be.

Talk is cheap.

> I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

I agree "fuck theorists" is in no way constructive. But, Deep Mind has objectively helped move the field forward. And your criticism of "get good" stuff? Did you not just tell people to "learn math" rather than help them to understand it yourself? That's the _exact_ meaning of the phrase "get good" on the internet. At best you're both being about as toxic (at least from your own description).


  > when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.
Gatekeeping is controlling access. Not to be confused with hurdles. I'm more than happy to have more people in "the party." No one is being excluded in the way that isn't also true for any other field. You unfortunately need some level of expertise to be able to understand discussions between experts. But am I stopping you from getting that expertise? No, in fact I'm very happy to lend a hand! Those aren't gates, they're hurdles. You don't need a specific PhD or to go to a good school or anything. It's about the knowledge. If you need a helping hand to get over, ask, because others may not know or may not know if you're struggling fruitlessly or struggling as part of the process of improving.

But yes, hurdles exist and they are not bad. I sure as hell don't want someone that can't do calculus designing rocket engines. And you probably don't want a rocket engineer performing surgery on you. Call them what you will, but it's not a bouncer at the door telling you you're not "pretty enough", which is what gatekeeping is generally used to refer to.

  > Talk is cheap.
Sure, but we actually know a lot more about the inner workings of networks than most people realize. Sure, they aren't transparent, but that doesn't mean they are completely opaque either.

But I have no idea how to respond to this comment. What I said was fairly broad and this response is broader. Are you asking for "proof"? Of what? Interpretability? Is not the article proof of it to some degree? Or Golden Gate Claude?

  >  Did you not just tell people to "learn math" rather than help them to understand it yourself? 
No? I think you misunderstand. Mind you, this is hacker news. Would you like some books for reference? A roadmap? If you have suggestions for how I should phrase my venting differently, I'm all ears. But it feels like that would be out of left field to just drop a bunch of random books and requires a lot of words to explain how all these things connect. I've written many "walls of text" here and frankly, anything longer than a paragraph often gets skipped over. It's fine, it's HN after all.

  > you're both being about as toxic
Are you aware of the things I'm referencing? It seems like you are not. Given that, I think you should reserve your judgement and accusations until you know more about the context. (e.g.

So I will add more context to clarify my complaints, for any of those interested. I specifically called out Mission Impossible Language Models[0], so what's that about? I suggest reading the paper. The authors create a continuum of difficulties in impossible languages. The hardest being a random word ordering. The claim is that LLMs can't learn impossible languages just as well as natural languages. It's fairly easy to understand the error in this work. They use perplexity, which is sometimes called "surprisal." You take it conditioned on the previous words and you calculate what is likely to come next. But perplexity doesn't tell you that the model didn't learn the language, or even efficiently. The metric isn't going to work for a one-to-one comparison with a structured language. The reason being that there is naturally more entropy in the impossible language. Frankly, because there are more words that are equally likely to come next. It's comparing coin flips to dice throws.

Let's use an example: our target sentence will be "My name is godelski." In a random shuffle language we have 4! (24) ways to represent that sentence that are all __equivalent__. That's the key part. In natural language, all I can think of is 2 ("Godelski, my name is" as a highly unlikely alternative). So in natural language if we have "My name" and are predicting the next word, "is" is pretty likely. But in the random language "is" is just as likely as "<name>". This isn't proof that the language isn't learned, it is just that the language isn't structured. "My name is godelski" and "My name godelski is" are equivalent sentences in a random ordering language. But actually, this gets even harder because the tokenization was trained on natural word order. If you look at Table 1 you'll see how this gets messy (notice that "bookshelf" is the tokens " books" (space intentional) "he" "lf"). The picture gets clearer when you look at how they prepared the data (it isn't shuffled each time the model gets the data, it is shuffled once and then the model is trained on that. This is not the same as the random language and unless you're really lucky, there's going to be certain patterns more common than others and so that'll just make it more difficult for the model. The dataloader should shuffle sentences, which will teach the model to ignore the patterns. You should also measure perplexity against all valid predictions, not a single one. This one is a killer for me).

Side note:

  > fear of being judged:
You're always going to be judged. Stand up for yourself and what you believe in. Don't be afraid of being wrong either. Truth is, you're always wrong. It's a matter of degree. The problem isn't being wrong, it is not being able to change your mind. Even if things get heated between people, there typically isn't ill will between them if they believe the other person is capable of changing their mind.

Clearly, you have judged me quite harshly.

[0] https://arxiv.org/abs/2401.06416


Doesn't seem like any further discussion will be worthwhile or constructive for me. Sorry, hope you understand.


That's okay. I'm not offended. And I hope you know I have no hard feelings. Despite disagreeing, I do respect your opinions. These are hard problems to solve and I think there's no perfect solutions.


Thanks!


What's your objection to Mission Impossible Language Models?


The real problem with the paper is not any of the mathematical details that others have described it is more fundamental. Chomsky's claim is that humans have a distinctive property that they seem to not be able to process certain synthetic language constructions --- namely linear (non-hierarchical) languages --- as well as synthetic human-like (hierarchical) languages and they use a different part of the brain to do so. This was shown in experiments (see Moro, Secrets of Words, I think his nature paper also cites the studies).

Because the synthetic linear languages are computationally/structurally simple LLMs will, unlike humans, learn them just as easily as real human languages. Since this hierarchical aspect of human language seems fundamental/important LLMs therefore are not a good model of the human language faculty.

If you want to refute that claim then you would take similar synthetic language constructions to those that were used in the experiments and show that LLMs take longer to learn them.

Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).

The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.


> Instead you mostly created an abstraction of the problem that no one cares about: that there exist certain synthetic language constructions that LLMs have difficulty with. But this is both trivial (consider a language that requires you to factor numbers to decode it) and irrelevant (there is no relation to what humans do except in an abstract sense).

Thanks for your feedback. I think our manipulations do establish that there are nontrivial inductive biases in Transformer language models and that these inductive biases are aligned with human language in important ways. There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't. It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.

Furthermore, this alignment is interesting even if it isn't perfect. I would be shocked in GPT language models happened to have inductive biases that perfectly match the structure of human language---why would they? But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do. As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits. Something similar is happening now with neural language models as models of language learning and processing, a very active research field. So I wouldn't say that neural language models can't shed any light on language simply because they're not a perfect match for a particular aspect of language.

As for using languages more directly based on the Moro experiments, we've discussed this extensively. There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set, where the control condition is a real language instead of a toy language, without introducing confounds of various kinds. We're open to suggestions. We've had very productive conversations with syntacticians about how to formulate new baselines in future work.

More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully, to the point that they can be used to test the inductive biases of neural models. It's not as simple as hierarchical vs. linear, since there are purely linear phenomena in syntax such as Closest Conjunct Agreement, and also morphophonological processes can act linearly across constituent boundaries, among other complications.

> The one language that you use that is most similar to the linear languages cited by Moro, "Hop", shows very little difference in performance, directly undermining your claimed refutation of Chomsky.

I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences.


> these inductive biases are aligned with human language in important ways.

They aren’t, which is the entire point of this conversation, and simply asserting otherwise isn’t an argument.

> It seems that GPT language models do favor real language over the perturbed ones, and this shows that they have a simplicity bias which aligns with human language. This is remarkable, considering that the GPT architecture doesn't look like what one would expect based on existing linguistic theory.

This is a non-sensical argument: consider if you had studied a made up language that required you to factor numbers or do something else inherently computationally expensive. LLMs would favor simplicity bias “just like humans” but it’s obvious this doesn’t tell you anything and specifically doesn’t tell you that LLMs are like humans in any useful sense.

> There's no universal a priori sense in which Moro's linear counting languages are "simple" but our deterministically shuffled languages aren't.

You are missing the point, which is that humans cannot as easily learn Moro languages while LLMs can. Therefore LLMs are different in a fundamental way from humans. This difference is so fundamental that you need to give strong, specific, explicit justification why LLMs are useful in explaining humans. The only reason I used the word “simple” is to argue that LLMs would be able to learn it easily (without even having to run an experiment) but the same would be true if LLMs learned a non-simple language that humans couldn’t.

Again it doesn’t matter if you find all the ways that humans and LLMs are the same —- for example that they both struggle with shuffled sentences or with a language that involves factoring numbers —— what matters is that there exists a fundamental difference between them exemplified by the Moro languages.

> But it is still worthwhile to probe what those inductive biases are and to compare them with what humans do.

Why? There is no reason to believe you will learn anything from it. This is a bizarre abstract argument that doing something is useful because you might learn something from it. You can say that about anything you do. There is a video on YouTube where Chomsky engages with someone making similar arguments about chess computers. Chomsky said that there wasn’t any self evident reason why studying chess playing computers would tell you anything about humans. He was correct, we never did learn anything significant about humans from chess computers.

> As a comparison, context-free grammars turned out to be an imperfect model of syntax, but the field of syntax benefited a lot from exploring them and their limits.

There is a difference between pursuing a reasonable line of inquiry and having it fail versus pursuing one that you know or ought to know is flawed. If someone had pointed out the problems with CFG at the outset it would have been foolish to pursue it, just as it is foolish to ignore the Moro problem now.

> There are nontrivial challenges in scaling those languages up to the point that you can have a realistic training set

I can’t imagine what those challenges are, I don’t remember the details but I believe Moro made systematic simple grammar changes. Your Hop is in the same vein.

> where the control condition is a real language

Why does the control need to be a real language? Moro did not use a real language control on humans. (Edit: Because you want to use pre-trained models?).

> More generally our goal was to get formal linguists more interested in defining the impossible vs. possible language distinction more carefully

Again you’ve invented an abstract problem to study that has no bearing on the problem that Chomsky has described. Moro showed that humans struggle with certain synthetic grammar constructions. Chomsky noted that LLMs do not have this important feature. You are now trying to take this concrete observation about humans and turning it into the abstract field of the study of “impossible languages”.

> It's not as simple as hierarchical vs. linear

There are different aspects of language but there is a characteristic feature missing from LLMs which makes them unsuitable as models for human language. It doesn’t make any sense for a linguist to care about LLMs unless you provide justification for why they would learn anything about the human language faculty from LLMs despite that fundamental difference.

> I wouldn't read much into the magnitude of the difference between NoHop and Hop, because the Hop transformation only affects a small number of sentences, and the perplexity metric is an average over sentences

Even if this were true we return to “no evidence” rather than “evidence against”. But it is very unlikely that Moro-languages are any more difficult for LLMs to learn because, as I said earlier, they are very computationally simple, simpler than hierarchical languages.


I see you're one of the authors.

I disagree with the conclusions of the paper. Maybe I have some misunderstandings, and if so, please do correct me. But my reading of it, is that the experiments and evaluations are insufficient to formulate the conclusion made. I think the results even make sense with Chomsky's claim. (I'll stick to the random shuffle for clarity)

It does not appear that the evaluations are considering all possible valid outputs for the next token. Perplexity is not actually a measure of language performance, though it is wonderful that it has worked out so well so far (I suspect due to the structure in languages). The perplexity being higher is not necessarily indicative of poorer performance. I view this as analogous to sequences of coin flips (our natural language) to sequences of dice rolls (our shuffle). One naturally has more randomness than another. A model that successfully learns the former will have lower perplexity than the model that learns the latter.

To properly evaluate we need to consider if the model is able to produce valid sentences, and consistently. With our coin and dice analogy let's assume we have a sequence of 3 events. Our model conditions on a single flip of heads and we can estimate likelihoods for the sequences HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Our successful model will tell us that the last 4 are not possible, but that the others are equally likely. Now if we compare to a dice roll, conditioned on a roll of a 1, then the model is not invalid for suggesting higher entropy. That is exactly what we want our model to do. There are just more _valid_ answers. In the same way if we're predicting (conditionally) next token, then we should expect a higher perplexity in the "more impossible" languages, but that does not tell us the success of learning the language (I would also expect these models to take longer to converge due to this, just as with coins and dice. I'll leave "learn just as well" to Chomsky, as this is ambiguous).

Entropy isn't enough. Our metric needs to be based on the mass distribution. To compare against one another, we'd have to normalize the values to their distributions. A direct comparison to one another will always lead to the random shuffle model having higher perplexity (just as with coins and dice), so it is an unfair comparison. Without the normalization we'd expect to find exactly what is shown in Figure 2.

As I understand the writing and the code, you do not compare against all valid tokens, but rather the fixed ones. I'm just seeing the perplexities counted in the usual way (I see loop over batch, but not for valid permutations). I see the line in the text

  > dataset shuffling during training.
So I assume that this means the dataloader is shuffling the selected sentences? I don't see this in the code but I'm happy to trust you if you say yes. But the code makes me think this was generated beforehand (I'm having dependency issues so can't verify). But if you are generating the perturbations beforehand, then I think the results are irrelevant because you haven't been implicitly teaching the model that ordering doesn't matter. The fact that results get worse for the models without positional encoding is suggestive of concern here. If position does not matter, why does the positional information increase the model's ability to learn? It should be irrelevant to a non-deterministic shuffled language. I am also suspect since the "no shuffle" model appears to have identical learning capabilities w.r.t Fig 2 and 6. (I'm also seeing a lot of reference to error bars but it isn't clear to me what the variance is. Is the bar smaller than the markers? Scaling could really help here as well as placing horizontal bars at the bounds given the visualization of the markers in the legends).

As for limitations, I am also suspect there's a bias introduced due to tokenization. Since the tokenization embedding is generated from the expected ordering. I think this adds additional complexity that could be reduced, but not eliminated, by shuffling words instead of tokens. Not eliminated because tokens are only dependent on single words, but the sentences themselves. Word pairs and sequences matter.

Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.


> Fwiw, I don't agree with Chomsky. Clearly LLMs are extracting structure in language and I think it is obtuse to claim that a system designed for pattern matching won't identify these patterns. One doesn't need reasoning or abstraction to converge to this, one simply needs sufficient sampling and for structure to exist. Clearly structure exists in the language, so we should expect a sufficient pattern matcher to be able to extract these patterns.

Chomsky has never said that LLMs can't extract patterns from language. His point is that humans have trouble processing certain language patterns while LLMs don't, which means that LLMs work differently and therefore can't shed any light on humans.


Thanks for the feedback! The point about perplexity is totally valid for the nondeterministic shuffle baseline. This seems to have misled a lot of people. But for all the other baselines, we're applying some one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the distribution being learned.

As for tokenization, good point: it's worth retokenizing based on the altered datasets to see if that changes anything. I think it might not, because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.


Thanks for the reply!

  > one-to-one transformation function to the original training set, and so not increasing the inherent entropy of the _distribution being learned_.
I disagree. Entropy of the model? The language? The sentence? The tokens? The distinction is subtle but important. A one-to-one mapping is not structure preserving. For a trivial example: {a,b,c} -> {c,a,b} doesn't preserve alphabetical ordering. The distribution the LM is learning is that of the intractable(?) distribution of the language itself. Certainty the entropy here changes, and I believe your results demonstrate this. I think the entropy would only be the same if we preserved all structure[0], but my understanding of the impossible languages is that they remove (all) structure. I'm not sure if that'd yield worthwhile results, but I think it could be a good sanity check -- or at least assumption check -- to do deterministic permutations based on syntactic structure. E.g. replace all S,V,O -> O,V,S for all sentences.

  > because the tokenizers we use are based on the frequency distribution of "words" identified by whitespaces, but we'll have to check.
Maybe I was mislead by Table 1. Noting that "bookshelf" -> {books,he,lf}. That isn't whitespace delimited. The examples only show "bookshelf" being preserved in the Shuffle (s=21) and the HOP cases (well... split by the hop token). Partial reverse showing {books, [R], ., lf, he}. I think this is a good example of token bias as our token "he" can hold multiple meanings, but I suspect that the statistics changes when we consider the word structure and how that these tokens will appear conditionally. I think this is a good example where I think Figure 2 doesn't do a great job at proving the conclusion. The differences are small, so are they completely offset by the bias? HOP seems even more convincing as the results converge and this bias should be much more easily accounted for. What is unclear to me here is why there's a significant difference in the beginning of training. I am a bit surprised that training from scratch would have this initialization bias.

(Also, just to note, I wouldn't reject this paper were it to come across my desk, but I bias towards accepting works. I am picking on you because you won best paper and a lot of the narrative that formed around the paper)

[0] I think we have "natural experiments" here w.r.t learning other languages and translation. Though not al structure is preserved. Some are lost and some are gained. But this again can be affected by the tokenizer and if you are doing things like stripping accent. Clearly that removes structure, but isn't going to have a big effect on English.


> A one-to-one mapping is not structure preserving.

That's just the point we were making: if you mess with the structure of natural language, then language models don't learn it as well any more.

The transformations do preserve the entropy in the sense that the lowest achievable perplexity is the same for everything except the nondeterministic shuffle. Since applying a one-to-one function preserves the entropy of a discrete random variable, in this case the random variable over documents in the training and test sets. In principle, a universal approximator should be able to learn to invert any one-to-one transformation we apply, although of course in practice a GPT architecture doesn't achieve this.

The transformations do mess up the local entropy in strings. But I think that's part of the point. Human language seems to be structured in a way so that things are locally predictable. When you screw up that structure, then languages are harder to learn.

We are working on a followup with more transformations including syntactic ones, as you might imagine. It's surprisingly hard to come up with manipulations that target specific aspect of linguistic structure, while fitting the criteria that (1) they don't affect the lowest achievable perplexity, (2) they clearly change the language from "possible" to "impossible" in a way that all or most linguists would agree with, (3) they can actually be implemented using the data we have---for example a transformation that relies on detailed syntactic parses would require that we parse the whole dataset, which is not only time consuming but also introduces possible confounds from errors in the parser, etc. We're talking to a lot of people, if you have ideas we'd be happy to hear them!


As I said in another comment the only relevant synthetic language that would refute Chomsky's claim are the ones we have human experiments for. Specifically those of Moro.

I believe the relevant papers are referenced here on page 4. (Tettamanti et al., 2002; Musso et al., 2003; Moro, 2016)

https://acesin.letras.ufrj.br/wp-content/uploads/2024/02/Mor...


I was going to write more but I wanted to just simplify the comment. I think we agree more than disagree and that I've not communicated effectively. So I want to focus on my main point about the metric (and one other part).

In the intro when you reference Chomsky it says

  | [Chomsky] make very broad claims to the effect that large language models (LLMs) are equally capable of learning possible and impossible human languages.
My objection here is measuring success of learning a language, not to the difficulty of the learning process.

What I'm trying to say is that in the shuffled languages that when we are doing next token prediction, the perplexity is necessarily higher. This must be true because of the destruction of local structure. BUT this does not mean that the language wasn't learned successfully. For example, in the next token setting, if we are predicting the sentence "My name is godelski" then certainly the perplexity is higher when "Name godelski my is", "godelski name is my", etc are also valid sequences. The perplexity is higher BUT the language is successfully learned.

My point is that we have to be careful about how we define what it means for the language to be successfully learned.

(I'm not sure there is a tractable measure for the success of learning a language, I don't know of a proof in either direction. But I do know that perplexity is a proxy and we must be careful about the alignment of any measures, as there is always a difference. Even in seemingly trivial things we cannot measure anything directly, it is always indirect (e.g. we measure distance with a ruler, which is an approximation based on the definition of a meter, but is not a meter itself. Though this is going to typically be very well aligned))

  > a universal approximator should be able to learn to invert any one-to-one transformation we apply
I agree, but I think we need to be a bit careful here. Extending universal approximation theorem to discrete distributions introduces new conditions, which the usual form is that the function we're approximating must be continuous, closed, and bounded. But I think we also need to be careful with how we look at complexity. Yes, a bijective function is going to have the same complexity in both directions, but this will not hold if there is any filtering. But the part where we really have to be careful about is the difference in difficulty of _learning_ D and _learning_ T(D). Even if T is simple, these learning processes may not be equally simple. It's subtle but I believe important. As a clear example of this, we will often scale data (one might call this normalization, but let's not be ambiguous and let's make sure it is bijective), and it will be clear that learning the scaled data is simpler than learning the unscaled data. So while yes, a universal approximator is __capable__ of learning to invert any bijection, this does not mean that the difficulty in __learning__ to invert it is easy.

I do really appreciate the chat and the explanations. I'm glad to know that there is a followup and I'm interested to see the results.


Maths also mean different things. Your average number theorist or algebraic geometer will most likely not touch statistical techniques day-to-day. Reading this Anthropic article was helpful because I am constantly catching up on my statistical background


I don't know what it's true to suspect, since clearly a lot of very smart people are working in the field, in places.

It is empirically true that none of the industry discourse around leaderboards and benchmarks uses any of the techniques this article discusses.


AI engineers just use "vibes" currently haha


All things considered, although I'm in favor of Anthropic's suggestions, I'm surprised that they're not recommending more (nominally) advanced statistical methods. I wonder if this is because more advanced methods don't have any benefits or if they don't want to overwhelm the ML community.

For one, they could consider using equivalence testing for comparing models, instead of significance testing. I'd be surprised if their significance tests were not significant given 10000 eval questions and I don't see why they couldn't ask the competing models 10000 eval questions?

My intuition is that multilevel modelling could help with the clustered standard errors, but I'll assume that they know what they're doing.


I wouldn't be surprised if the benefits from doing something more advanced aren't worth it.


Well, I think it's usually more complicated than that. An over-simplification is that there's no free lunch.

If you use a robust sandwich estimator, you're robust against non-normality and etc, but you lower the efficiency of your estimator.

If you use bayes, the con is you have a prior and the pro is you have a prior + a lot of other things.

And strictly speaking, these are benefits on paper based on theory. In practice, of course, the drawback to using a new advanced technique is that there may be a bug(s) lurking in the software implementation that might invalidate your results.

In practice, we generally forget to account for the psychology of the analyst. Their biases, what they're willing to double-check and what they're willing to take for granted. There's also the issue of bayesian methods being somewhat confirmatory, to the point that the psychological experience of doing bayesian statistics makes one so concerned with the data generating process and of the statistical model, that one might forget to 'really check their data'.


I have been promoting this and saying it since at least 2018. You can see my publication record as evidence!!!

"Random seed xxx is all you need" was another demonstration of this need.

You actually want a wilcoxon sum rank test as many metrics are not gaussian especially as they get to thier limits!! I.e. accuracy roughly 99 or 100! Then it becomes highly sub gaussian.


Since when the heck did evals change what they referred to. Evals were what you did to check if the output of a model was correct. What happened ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: