Yes, there is hope for a high-level heuristic understanding. Here's my attempt t...

throwaway314155 · 2025-03-05T05:10:43 1741151443

Great explanation, thanks. I have some followups if you have the time!

a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance?

b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities?

c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness?

programjames · 2025-03-05T05:20:29 1741152029

Boo, go read the other comments that explain all of this instead of wasting people's time.

throwaway314155 · 2025-03-05T08:46:44 1741164404

> I have some followups *if you have the time*

Emphasis mine. No one should feel obligated to answer my questions. I had hoped that was obvious.