But if you have a large set of problems to which you already know the answer, th...

godelski · 2025-02-08T01:19:33 1738977573

I don't think this makes sense and I'm not quite sure why you went to ML, but that's okay. I am a machine learning researcher, but also frustrated with the state of machine learning, in part because, well... you can probably see how "proof by empirical evidence" is dialed up to 11.

Sorry, long answer incoming. It is far from complete too but I think it will help build strong intuition around your questions.

Will knowledge transfer? That entirely depends on the new problem. It also entirely depends on how related the problem is. But also, what information was used to solve the pre-transfer state. Take LLMs for example. There's lots of works that have shown them being difficult to train for solving calculations. Where they will do well on problems with the same number of digits but this will degrade rapidly as number of digits increase. It can be weird to read some of these papers as there will sometimes be periodic relationships with the number of digits but that should give us information about how they're encoding the problems. But that lack of transferability indicates that despite the problem solving and what we'd believe is actually just the same problem, doesn't mean it is. So you have to be really careful here, because us humans are really fucking good at generalization (yeah, we also suck, but a big part is our proficiency makes us recognize where we lack. But also, this is more a "humans can" more than "humans do" type of thing. So be careful when comparing). This generalization is really because we're focused around building causal relationships, while on the other hand the ML algorithms are build around compression (i.e. fitting data). Which, if you notice, is the same issue I was pointing to above.

  > Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale.

This entirely depends on the problem. We can construct simple problems that both illustrate success as well as failure. What you really need to think about here is the information gain from the answer. If you check how to calculate that, you will see the dependence (we could get into Bayesian Learning or experiment design but this is long enough). But let's think of a simple example in the negative direction. If I ask you to guess where I'm from, you're going to have a very hard time pinning down the exact location. Definitely in this example there is a efficient method, but our ML learning algorithms don't start with prior knowledge about strategies and so they aren't going to know to binary search. If you gave that to the model, you baked in that information. This is a tricky form of information leakage. It can be totally fine to bake in knowledge, but we should be aware of how that changes how we evaluate things (we always bake in knowledge btw. There is no escaping this). But most models would not have a hard time if instead we played "hot/cold", because the information gain is much higher. We've provided a gradient to the solution space. We might call this hard and soft labels, respectively.

I picked this because there's a rather famous paper about emergent abilities (I fucking hate this term[0]) in ML models[1], and a far less famous counter to it[2]. There's a lot of problems with [1] that require a different discussion but [2] shows how a big part of the issue is how many of the loss landscapes are fairly flat and so when feedback is discrete the smaller models just wonder around that flat landscape needing to get lucky to find the optima (btw, this also shows that technically this can be done too! But that would require different training methods and optimizers). But when giving them continuous feedback (i.e. you're wrong, but closer than your last guess), they are able to actually optimize. A big criticism of the work is that it is an unfair comparison because there are "right and wrong" answers here, but it'd be naive to not recognize that some answers are more wrong than others. Plus, their work shows a clear testable way we can confirm or deny if this works or not. We schedule learning rates, there's no reason you cannot schedule labels. In fact, this does work.

But also look at the ways they tackled these problems. They are entirely different. [1] tries to do proof by evidence while [2] uses proof by contradiction. Granted, [2] has an easier problem since they only need to counter the claims of [1], but that's a discussion about how you formulate proofs.

So I'd be very careful when using the recent advancements in ML as a framework for modeling reasoning. The space is noisy. It is undeniable that we've made a lot of advancements but there is some issues with what work gets noticed and what doesn't. A lot does come down to this proof by evidence fallacy. Evidence can only bound confidence, it can unfortunately not prove things. But this is helpful and well, we can bound our confidence to limit the search space before we change strategies, right? I picked [1] and [2] for a reason ;) And to be clear, I'm not saying [1] shouldn't exist as a paper or that the researchers were dumb for doing it. Read back on this paragraph, because we've got multiple meta layers here. It's good to place a flag in the ground, even if it is wrong, because you gotta start somewhere, and science is much much better at ruling things out than ruling things in. We more focus on proving things don't work until there's not much left and then accept those things (limits here too, but this is too long already).

I'll leave with this, because now there should be a lot of context that makes this much more meaningful: https://www.youtube.com/watch?v=hV41QEKiMlM

[0] It significantly diverges from the terminology used in fields such as physics. ML models are de facto weakly emergent by nature of composition. But the ML definition can entirely be satisfied by "Information was passed to the model but I wasn't aware of it" (again, same problem: exhaustive testing)

[1] (2742 citations) https://arxiv.org/abs/2206.07682

[2] (447 citations) https://arxiv.org/abs/2304.15004

FieryTransition · 2025-02-09T15:12:22 1739113942

Thanks a lot for the detailed reply, it was better than I had hoped for :)

So knowledge transfer is something incredibly specific and much more narrow than what I thought. They don't transfer concepts by generalization, but they compress knowledge instead, which I assume the difference is, that generalization is much more fluid, while compression is much more static, like a dictionary where each key has a probability to be chosen, and all the relationships are frozen, and the only generalization that happens, is the generalization which is an expression of the training method used, since the training method freezes it's "model of the world" into the weights so to say? So if the training method itself cannot generalize, but only compress, why would the resulting model that the training method produces? Is that understood correctly?

Does there exist a computational model, which can be used to analyse a training method and put a bound on the expressiveness of the resulting model?

It's fascinating that the emergent ability of models disappear if you measure them differently. Guess the difference is that "emergent abilities" are kinda nonsensical, since they have no explanation of causality (i.e. it "just" happens), and just seeing the model getting linearly better with training fits into a much more sane framework. That is, like you said, when your success metric is measuring discretely, you also see the model itself as discrete, and it hides the continuous hill climbing you would otherwise see the model exhibit with a different non-discrete metric.

But the model still gets better over time, so would you expect the model to get progressively worse on a more generalized metric, or does it only relate to the spikes in the graph that they talk about? IE, they answer the question of "why" jumps in performance are not emergent, but they don't answer why the performance keeps increasing, even if it is linear, and whether it is detrimental to other less related tasks?

And if you wanted to test "emergent" wouldn't it be more interesting to test the model on tasks, which would be much more unrelated to the task at hand? That would be to test generalization, more so as we see humans see it? So it wouldn't really be emergence, but generalization of concepts?

It makes sense that it is more straightforward to refute a claim by using contradiction. Would it be good practice for papers, to try and refute their own claims by contradiction first? I guess that would save a lot of time.

It's interesting about the knowledge leakage, because I was thinking about the concept of world simulations and using models to learn about scenarios through simulations and consequence. But the act of creating a model to perceive the world, taints the model itself with bias, so the difficulty lies in creating a model which can rearrange itself to get rid of incorrect assumptions, while disconnecting its initial inherent bias. I thought about models which can create other models etc, but then how does the model itself measure success? If everything is changing, then so is the metric, so the model could decide to change what it measures as well. I thought about hard coding a metric into the model, but what if the metric I choose is bad, and we are then stuck with the same problem of bias as well. So it seems like there are only two options, it either converges towards total uncontrollability or it is inherently biased, there's doesn't seem to be any in-between?

I admit I'm trying to learn things about ML I just find general intelligence research fascinating (neuroscience as well), but the more I learn, the more I realize I should really go back to the fundamentals and build up. Because even things which seem like they make sense on a surface level, really has a lot of meaning behind them, and needs a well-built intuition not from a practical level, but from a theoretical level.

From the papers I've read which I find interesting, it's like there's always the right combination of creativity in thinking, which sometimes my intuition/curiosity about things proved right, but I lack the deeper understanding, which can lead to false confidence in results.

godelski · 2025-02-10T01:04:02 1739149442

Well fuck... My comment was too long... and it doesn't get cached -___-

I'll come back and retype some of what I said but I need to do some other stuff right now. So I'll say that you're asking really good questions and I think you're mostly understanding things.

So give you very quick answers:

Yes, things are frozen. There's active/online learning but even that will not solve all the issues at hand.

Yes, we can put bounds. Causal models naturally do this but statistics is all about this too. Randomness is a measurement of uncertainty. Note that causal models are essentially perfect embeddings. Because if you've captured all causal relationships, you gain no more value from additional information, right?

Also note that we have to be very careful about assumptions. It is always important to uncover what assumptions have been made and what the implications are. This is useful in general problems solving and applies to anything in your life, not just AI/ML/coding. Unfortunately, assumptions are almost never explicitly stated, so you got to go hunting.

See how physics defines strong emergence and weak emergence. There are no known strongly emerging phenomena and we generally believe they do not exist. For weakly emerging, well it's rather naive to discuss this in the context of ML if we're dedicating so little time and effort to interpretation, right? That's kinda the point I was making previously about not being able to differentiate an emergent phenomena from not knowing we gave it information.

For the "getting better" it is about the spikes. See the first two figures and their captions in the response paper.

More parameters do help btw, but make sure you distinguish the difference between a problem being easier to solve and a problem not being solvable. The latter is rather hard to show. But the paper is providing strong evidence to the underlying issues being about the ease of problem solving rather than incapacity.

Proof is hard. There's nothing wrong with being empirical, but we need to understand that this is a crutch. It is evidence, not proof. We leaned on this because we needed to start somewhere. But as progress is made so too must all the metrics and evaluations. It gets exponentially harder to evaluate as progress is made.

I do not think it is best to put everyone in ML into the theory first and act like physicists. Rather we recognize the noise and do not lock out others from researching other ideas. The review process has been contaminated and we lost sight. I'd say that the problem is that we look at papers as if we are looking at products. But in reality, papers need to be designed with understanding the experimental framework. What question is being addressed, are variables being properly isolated, and do the results make a strong case for the conclusion? If we're benchmark chasing we aren't doing this and we're providing massive advantage to "gpu rich" as they can hyper-parameter tune their way to success. We're missing a lot of understanding because of this. You don't need state of the art to prove a hypothesis. Nor to make improvements on architectures or in our knowledge. Benchmarks are very lazy.

For information leakage, you can never remove the artist from the art, right? They always leave part of themselves. That's okay, but we must be aware of the fact so we can properly evaluate.

Take the passion, and dive deep. Don't worry about what others are doing, and pursue your interests. That won't make you successful in academia, but it is the necessary mindset of a researcher. Truth is no one knows where we're going and which rabbit holes are dead ends (or which look like dead ends but aren't). It is good to revisit because you table questions when learning, but then we forget to come back to them.

  > needs a well-built intuition not from a practical level, but from a theoretical level.

The magic is at the intersection. You need both and you cannot rely on only one. This is a downfall in the current ML framework and many things are black boxes only because no one has bothered to look.