Ability to win a gold medal as if they were scored similarly to how humans are scored?
or
Ability to win a gold medal as determined by getting the "correct answer" to all the questions?
These are subtly two very different questions. In these kinds of math exams how you get to the answer matters more than the answer itself. i.e. You could not get high marks through divination. To add some clarity, the latter would be like testing someone's ability to code by only looking at their results to some test functions (oh wait... that's how we evaluate LLMs...). It's a good signal but it is far from a complete answer. It very much matters how the code generates the answer. Certainly you wouldn't accept code if it does a bunch of random computations before divining an answer.
The paper's answer to your question (assuming scored similarly to humans) is "Don’t count on it". Not a definitive "no" but they strongly suspect not.
The type of reasoning by the OP and the linked paper obviously does not work. The observable reality is that LLMs can do mathematical reasoning. A cursory interaction with state of the art LLMs makes this evident, as does their IMO gold medal scored like humans are. You cannot counter observable reality with generic theoretical considerations about Markov chains or pretraining scaling laws or floating point precision. The irony is that LLMs can explain why that type of reasoning is faulty:
> Any discrete-time computation (including backtracking search) becomes Markov if you define the state as the full machine configuration. Thus “Markov ⇒ no reasoning/backtracking” is a non sequitur. Moreover, LLMs can simulate backtracking in their reasoning chains. -- GPT-5
> The observable reality is that LLMs can do mathematical reasoning
I still can't get these machines to reliably perform basic subtraction[0]. The result is stochastic, so I can get the right answer, but have yet to reproduce one where the actual logic is correct[1,2]. Both [1,2] perform the same mistake and in [2] you see it just say "fuck it, skip to the answer"
> You cannot counter observable reality
I'd call [0,1,2] "observable". These types of errors are quite common, so maybe I'm not the one with lying eyes.
Why don't you use a state of the art model? Are you scared it will get it right? Or are you just not aware of reasoning models in which case you should get to know the field
Sorry, I've just been hearing this response for years now... GPT-5 not SOTA enough for you all now? I remember when people told me to just use 3.5
- Gemini 2.5 Pro[0], the top model on LLM Arena. This SOTA enough for you? It even hallucinated Python code!
- Claude Opus 4.1, sharing that chat shares my name, so here's a screenshot[1]. I'll leave that one for you to check.
- Grok4 getting the right answer but using bad logic[2]
- Kimi K2[3]
- Mistral[4]
I'm sorry, but you can fuck off with your goal post moving. They all do it. Check yourself.
> I am being serious
Don't lie to yourself, you never were
People like you have been using that copy-paste piss-poor logic since the GPT-3 days. The same exact error existed since those days on all those models just as it does today. You all were highly disingenuous then, and still are now. I know this comment isn't going to change your mind because you never cared about the evidence. You could have checked yourself! So you and your paperclip cult can just fuck off
That's very weird, before I wrote my comment I asked gpt5-thinking (yes, once) and it nailed it. I just assumed the rest would get it as well, gemini-2.5 is shocking (the code!) I hereby give you leave to be a curmudgeon for another year...
Try a few times and it'll happen. I don't think it took me more than 3 tries on any platform.
To convince me it is "reasoning", it needs to get the answer right consistently. Most attempts were actually about getting it to show its results. But pay close attention. GPT got the answer right several times but through incorrect calculations. Go check the "thinking" and see if it does a 11-9=2 calculation somewhere, I saw this >50% of the attempts. You should be able to reproduce my results in <5 minutes.
Forgive my annoyance, but we've been hearing the same argument you've made for years[0,1,2,3,4]. We're talking about models that have been reported as operating at "PhD Level" since the previous generation. People have constantly been saying "But I get the right answer" or "if you use X model it'll get it right" while missing the entire point. It never mattered if it got the answer right once, it matters that it can do it consistently. It matters how it gets the answer if you want to claim reasoning. There is still no evidence that LLMs can perform even simple math consistently, despite years of such claims[5]
[5] Don't let your eyes trick you, not all those green squares are 100%... You'll also see many "look X model got it right!" in response to something tested multiple times... https://x.com/yuntiandeng/status/1889704768135905332
There should be no need whatsoever to convince your competitors and/or bureaucrats that allowing your new connector to be produced is in their interest. Only one should be convinced: the person buying the device.
We tried that for 40 years. The result is drawers full of chargers.
But clearly there is a price for the standardisation, it makes progress slower. On the other hand it makes everyone's lifes easier. Just as with e.g electrical outlets in the house there is a time for exploration and innovation, and there is a time for standardisation. And we are ready for standardisation now, USB-c is good enough.
USB-c is absolutely not good enough. The connectors are often incompatible due to tiny manufacturing tolerances, cables from different manufacturers often fall out of the port after longer term use, don't make good connection so you have flaky charging, the cables and connectors look the same but are actually incompatible due to supporting only USB 2/3/4 or thunderbolt, whether displayport/hdmi alt mode is supported, etc. This small short-term gain at the cost of locking in USB-c forever was a terrible idea, brought to you by the same hypercompetent group that mandated cookie banners.
They were mandated by the EU. You don't get to pass crap laws of the form "show a banner or do {vague/impossible/unacceptable thing}" and then complain when 100% of people show a banner. That kind of inane immaturity is why the EU is so far behind and falling further.
Please don't fulminate on HN. You may not owe cookie banners better, but we're trying for a better style of conversation here. Please make an effort to observe the guidelines, which seek to make HN a place for curious conversation, not rage.
> We tried that for 40 years. The result is drawers full of chargers.
Which is a fine? The industry eventually converged to just a handful of common standards on its own.
You can’t innovate without being able to experiment. Which is only possible if there are actual people using your product. Thinking that a committee of bureaucrats can replace that is silly.
One standard for chargers is the only acceptable outcome and it wouldn't have gotten there without regulation.
What need is there to experiment with chargers? Wire go in, power go through - it's really not that complicated, the only important thing is standardization.
The "bureaucrats" are a proxy for the person buying the device. That's literally the point of representative democracy. The average person doesn't want to make a million decisions on technical standards, so they elect somebody they trust to make them for them.
This visualization is wildly inaccurate. The supposed 1000 pixels are actually 100x100 pixels, which is 10,000 not 1000. Secondly, on many screens they are not actually pixels. For example, on a macbook pro you're likely seeing 40,000 pixels in actuality.
> and the main pressure against that is for companies or people to leave.
Has there been any serious research in this area that supports that conclusion. My impression, which is completely uninformed I admit, is that we often talk about companies leaving due to high tax burdens, but that it rarely happens. It's a politically signal, more than a factual systemic driver.
Sure, a bunch of companies have relocated to tax havens, but we're not going to solve that by regressing to a 2% universal tax rate.
A country recognizes that the rate of company creation has gone done (or some similar metric). They identify the tax rate as a reason for this. They want the tax rate to be lower to ameliorate this. They leave the agreement.
Now presumably there are penalties or such in place for this type of agreement, so it would need to be weighed as onerous enough to accept any such penalties. If it is just one country that feels this way then it might be a non-starter, but if the global minimum tax gets to a point where many countries feel this way, it would probably be viable to coordinate to leave the agreement all at once, with the remainers having little power at that point.
Citation needed, corporate taxes have been going down for decades.
> companies or people to leave.
"We can't ever tax anyone because else they would just leave; ergo nothing can or should be done about rampant inequality" is not only false, it is extremely dangerous and accelerates the fall of our democracies.
Yes and that is bad, because cartels are bad. Competition between political systems is good, for much the same reason that competition between companies is good.
How does it hurt investment? Those tiny nations are only helping eliminate an inefficient form of taxation. The main problem is that only multinationals can make use of it.
Hiding huge amounts of money in tax havens is actively detrimental to the economy. I believe the goal of any economy should be to better our lives, not hoard wealth and sit on it.
Without taxation, the infrastructures needed to maintain a healthy economy are unsustainable. We need to ensure that what companies benefit from public services is taken back so it can be reinvested.
Money is not "hidden" or "hoarded" in tax havens, nor does hoarding money affect the economy negatively. Taxation is necessary for infrastructure (though that is actually a small fraction: about 3% of US federal taxes goes to infrastructure), and I did not say or imply that taxation is unnecessary. The question of what the best level of taxation is, and what the best place to levy those taxes is, cannot be decided based on high level slogans. One thing is clear though: standardizing tax rates is bad because it removes the competitive aspect between countries. It is good if people and companies are able to move to the places that give them the best public services for the least cost, for the same reason that it is good that salaries are not standardized between companies, and the same reason why it is good that airline ticket prices are no longer standardized by the IATA and CAB.
Oh please, as far as USA goes, taxes went down especially for companies and rich. And the country is in the process of creating new massive deficit just by a massive tax cut.
I don't understand how east-west arrays differ much from just a flat area. At the end of the day, don't they capture all sunlight in some large square? The east-west array only captures a bit differently around the outer edges. Can somebody explain? Is solar panel efficiency that dependent on incidence angle?
The universe is already modeled that way. Differential equations are a kind of continuous time and space version of cellular automata, where the next state at a point is determined by the infinitesimally neighboring states.
Nice post. Would the larger amount of code result in different performance in a scenario where other code is being run as well, or would the instruction cache be large enough to make this a non-issue?
reply