Is there a metric I can look at in engine evaluations to determine when a situation is "risky" for white or black (e.g., the situation above) even if it looks equal with perfect play?
I've always been interested in understanding situations where this is the case (and the opposite, where the engine favours one side but it seems to require a long, hard-to-find sequence of moves.
Playing out the top lines helps if equality requires perfect play from one side.
You can measure the sharpness of the position, as in this paper section 2.3 "Complexity of a position". They find their metric correlates with human performance.
I think this is something a bit different. That sort of assessment is going to find humans perform poorly in extremely sharp positions with lots of complicated lines that are difficult to evaluate. And that is certainly true. A tactical position that a computer can 'solve' in a few seconds can easily be missed by even very strong humans.
But the position Ding was in was neither sharp nor complex. A good analog to the position there is the rook + bishop v rook endgame. With perfect play that is, in most cases, a draw - and there are even formalized drawing techniques in any endgame text. But in practice it's really quite difficult, to the point that even grandmasters regularly lose it.
In those positions, on most of every move - any move is a draw. But the side with the bishop does have ways to inch up the pressure, and so the difficulty is making sure you recognize when you finally enter one of those moves where you actually need to deal with a concrete threat. The position Ding forced was very similar.
Most of every move, on every move, led to a draw - until it didn't. Gukesh had all sorts of ways to try to prod at Ding's position and make progress - prodding Ding's bishop, penetrating with his king, maneuvering his bishop to a stronger diagonal, cutting off Ding's king, and of course eventually pushing one of the pawns. He was going to be able to play for hours just constantly prodding where Ding would have stay 100% alert to when a critical threat emerges.
And this is all why Ding lost. His final mistake looks (and was) elementary, and he noticed it immediately after moving - but the reason he made that mistake is that he was thinking about how to parry the other countless dangerous threats, and he simply missed one. This is why most of everybody was shocked about Ding going for this endgame. It's just so dangerous in practical play, even if the computer can easily show you a zillion ways to draw it.
This is really interesting because i ran into a pokemon bot the other day were its training led to calibration of 50% winrste at all levels of play on Pokémon showdown. It was a complete accident.
It's not hard to make a chess bot that plays at a 1300 strength, i.e. its rating would converge to 1300 if it were allowed to compete. But it will not play like a 1300-rated human. It would play like a superhuman genius on most moves and then make beginner-level blunders at random moments.
Making one that realistically plays like a human is an unsolved problem.
Of course, you are right. But (the linked site) at least has a bot that plays the opening like a human of chosen rating perfectly. It stops working after the opening-stage (since it just copies moves from humans in the lichess game database), but it is still very impressive. For later game stages, some other method would have to be used (unless we play multiple orders of magnintude more games on lichess).
Now that i think about it, i remember the people in the alphago documentary talking about the bot giving its moves percentage scores in both how high winning % the move had and how high % chance that a human would have made the same move that it just played. I wonder why they never showed what a full game of the most human-like moves from alphago would look like. Maybe it actually worked, by feeding it all the pro games in existence, and training it to play the high human % instead of the higest win probability moves like they did in the end.
I think this can be achieved with some ease with a machine learning model. You will have to train it on games between 1300-rated players and below. A transformer model might work even better in terms of the evenness of play (behaving like a 1300 rated player throughout the game).
Take the computer which beats Magnus and restrain it to never make the best move in a position. Expand this to N best moves as needed to reach 1300 rating.
Even 1300s sometimes make the best move. Sometimes the best move is really easy to see or even mandatory, like if you are in check and MUST take that checking piece. Sometimes the best move is only obvious if you can look 20 moves ahead. Sometimes the best move is only obvious if you can look 5 moves ahead, but the line is so forcing that even 1300s can look that far ahead.
Despite decades of research, nobody has found a good way to make computers play like humans.
Then I can't refrain from asking: and what's the style of LLMs? For example the ChatGPT which is apparently rated around 1800? That should be completely different from that of a classic chess engine.
LLMs can be trained on chess games, but the tree of possible board states branches so fast that for any given position there is simply very little training data available. Even the billions of games played on chess.com and lichess are only a drop in the bucket compared to how many possible board states there are. This would have to be split further by rating range, so the amount of games for any given rating range would be even lower.
This means that the LLM does not actually have a lot of training data available to learn how a 1300 would play, and subsequently does a poor job at imitating it. There is a bunch of papers available online if you want more info.
LLMs already do play at elo ~1400-1800. The question was how does their style feels like to someone who can appreciate the difference between a human player and a chess engine (and the different styles of different human players).
I can’t speak for ChatGPT, but your intuition is correct that LLMs tend to play more like “humans” than Stockfish or other semi-brute force approaches.
You've identified a potential strategy by which a computer can play like a 1300-rated player, but not one where it will "play like a 1300-rated human". Patzers can still find and make moves in your set of N (if only by blind chance).
Yeah, you would have to weigh the moves based on how "obvious" it is, such as how active the piece has been, how many turns until it leads to winning material, or other such 'bad habits' humans fall for.
I think there's a real difference between "a computer"— in this context meaning an algorithm written by a human, possibly calibrated with a small number of parameters but not trained in any meaningful sense, and a "chess model" which works as you describe.
I think the chess model would be successful at producing the desired outcome but it's not as interesting. There's something to be said for being able to write down in precise terms how to play imperfectly in a manner that feels like a single cohesive intelligence strategizing against you.
Yes, the Leela team has worked on a term they call Contempt.
(Negative contempt in this case would make the engine seek out less sharp play from whites perspective)
In the first link the authour talks about using contempt to seek out/avoid sharp lines.
lc0 and nibbler are free, so feel free to try it out if curious.
Humans cannot out-FLOP a computer, so they need to use patterns (like an LLM). To get the human perspective, the engine would need to something similar.
There are several neural network based engines these days, including one that does exclusively what you describe (i.e. "patterns only", no calculation at all), and one that's trained on human games.
Even Stockfish uses a neural network these days by default for its positional evaluation, but it's relatively simple/lightweight in comparison to these, and it gains its strength from being used as part of deep search, rather than using a powerful/heavy neural network in a shallow tree search.
Have you tried Maia? I haven't myself (there isn't one in my ballpark level yet), but supposedly it plays more human due to being trained mostly on human play, not engine evaluations or self-play.
I think that's not quite the point. Leela has an advantage over AB chess engines, where it has multi-PV for "free", meaning it will evaluate multiple lines by default at no cost to performance (traditional engines, like Stockfish, will lose elo with multi-PV). This allows us to know at a glance if a position is "draw/win with perfect play" or if there is margin for error. If Leela shows multiple moves where one side maintains a winning advantage/losing disadvantage/equality, we can use that as a computer-based heuristic to know if a position is "easy" to play or not.
Yes and no – the number of playable lines does not necessarily tell us how "obvious" those lines are to find for a human.
To give a trivial example, if I take your queen, then recapturing my queen is almost always the single playable move. But it's also a line that you will easily find!
Conversely, in a complex tactical position, (even) multiple saving moves could all be very tricky for a human to calculate.
I wonder if there’s a combined metric that could be calculated. Depth of the line certainly would be impactful. A line that only works if you do 5 only moves is harder to find than a single move line. “Quiet” moves are probably harder to find than captures or direct attacks. Backwards moves are famously tricky to spot. Etc
And also, humans vary wildly in their thinking and what's "obvious" to them. I'm about 1950 and am good in openings and tactics (but not tactics for the opponent). Others around the same rating are much worse than that but they understand positional play much better - how to use weak squares, which pieces to exchange and so on. To me that's a kind of magic.
Not really because it’s subjective to the level of player. What’s a blunder to a master player might only be an inaccuracy to a beginner. The same applies for higher levels of chess player. I’ve watched GothamChess say “I’ve no idea why <INSERT GM> made this move but it’s the only move,” then Hikaru Nakamura will rattle off a weird 8-move sequence to explain why it’s a major advantage despite no pieces being lost. Stockfish is a level above even Magnus if given enough depth.
> Stockfish is a level above even Magnus if given enough depth.
"a level" and "if given enough depth" are both underselling it. Stockfish running on a cheap phone with equal time for each side will beat Magnus 100 games in a row.
I believe it’s something like 500 elo points difference at this point between Magnus and Stockfish running on cheap hardware. Computers are so strong the only way to measure their strength is against other, weaker computers, and so on until you get to engines that are mere “grandmaster” strength.
Bear in mind that, beyond the “top” elo ratings, that it’s purely an estimate of relative strength. The gap between a GM and me is far greater than the gap between a GM and Stockfish, even if the stated elo difference is the same.
By this I mean, you can give me a winning position against Magnus and I’ll still lose. Give a winning position to Magnus vs Stockfish and he might draw or even win.
True, what is considered a “winning” position is different at different elo levels. The better someone is, the smaller their mistakes are relative to perfect play.
I wish top players like Magnus would do more exhibition games against top computers. They don’t have to all start with equal material or an equal position.
That’s fair, I was leaving wiggle room for things like being able to force the engine into doing stupid things like sacrifice all its pieces to avoid stalemate.
Maybe the difference between the eval of the best move vs the next one(s)? An "only move" situation would be more risky than when you have a choice between many good moves.
That's it exactly. Engines will often show you at least 3 lines each with their valuation, and you can check the difficulty often just from that delta from 1st to 2nd best move. With some practical chess experience you can also "feel" how natural or exoteric the best move is.
In the WCC match between Caruana and Carlsen, they were at one difficult endgame where Carlsen (the champion) moved and engines calculated it was a "blunder" because there was a theoretical checkmate in like 36(!) moves, but no commentator took it seriously as there was "no way" a human would be able to spot the chance and calculate it correctly under the clock.
Not necessarily. If that "only move" is obvious, then it's not really risky. Like if a queen trade is offered and the opponent accepts, then typically the "only move" that doesn't massively lose is to capture back. But that's extremely obvious, and doesn't represent a sharp or complex position.
Those are tree search techniques, they are not metrics to assess the "human" complexity of a line. They could be used for this purpose but out of the box they just give you winning probability
If this is a faithful reimplementation of the AlphaZero algorithm (and I haven't looked through the code to confirm whether or not it is) then you'd expect equal performance to the published results after enough iterations of training. But the author probably doesn't have the resources to train agents on the same scale as Google did, and so performance in your own usage would largely come down to how long you can afford to train fr.
As the paper notes, two of the authors are from DeepMind, the chemicals they are synthesizing are those predicted to be stable by a DeepMind model, and the entire work arises out of an extensive collaboration between DeepMind and the Berkeley lab: https://www.nature.com/articles/d41586-023-03745-5
I've always been interested in understanding situations where this is the case (and the opposite, where the engine favours one side but it seems to require a long, hard-to-find sequence of moves.
Playing out the top lines helps if equality requires perfect play from one side.