Hacker News new | past | comments | ask | show | jobs | submit | tomatovole's comments login

Is there a metric I can look at in engine evaluations to determine when a situation is "risky" for white or black (e.g., the situation above) even if it looks equal with perfect play?

I've always been interested in understanding situations where this is the case (and the opposite, where the engine favours one side but it seems to require a long, hard-to-find sequence of moves.

Playing out the top lines helps if equality requires perfect play from one side.


You can measure the sharpness of the position, as in this paper section 2.3 "Complexity of a position". They find their metric correlates with human performance.

https://en.chessbase.com/news/2006/world_champions2006.pdf


I think this is something a bit different. That sort of assessment is going to find humans perform poorly in extremely sharp positions with lots of complicated lines that are difficult to evaluate. And that is certainly true. A tactical position that a computer can 'solve' in a few seconds can easily be missed by even very strong humans.

But the position Ding was in was neither sharp nor complex. A good analog to the position there is the rook + bishop v rook endgame. With perfect play that is, in most cases, a draw - and there are even formalized drawing techniques in any endgame text. But in practice it's really quite difficult, to the point that even grandmasters regularly lose it.

In those positions, on most of every move - any move is a draw. But the side with the bishop does have ways to inch up the pressure, and so the difficulty is making sure you recognize when you finally enter one of those moves where you actually need to deal with a concrete threat. The position Ding forced was very similar.

Most of every move, on every move, led to a draw - until it didn't. Gukesh had all sorts of ways to try to prod at Ding's position and make progress - prodding Ding's bishop, penetrating with his king, maneuvering his bishop to a stronger diagonal, cutting off Ding's king, and of course eventually pushing one of the pawns. He was going to be able to play for hours just constantly prodding where Ding would have stay 100% alert to when a critical threat emerges.

And this is all why Ding lost. His final mistake looks (and was) elementary, and he noticed it immediately after moving - but the reason he made that mistake is that he was thinking about how to parry the other countless dangerous threats, and he simply missed one. This is why most of everybody was shocked about Ding going for this endgame. It's just so dangerous in practical play, even if the computer can easily show you a zillion ways to draw it.


Nice paper. I’d like if someone re-ran the numbers using modern chess engines… the engine they used is exceedingly weak by modern standards.


Actually is is so weak, that it would be stomped out 1000:0 by modern engines. I like the methodology too, but the conclusions are not defendable.


Making a computer play like a 1300-rated human is harder than making a computer beat Magnus Carlsen.


This is really interesting because i ran into a pokemon bot the other day were its training led to calibration of 50% winrste at all levels of play on Pokémon showdown. It was a complete accident.


It's not hard to make a chess bot that plays at a 1300 strength, i.e. its rating would converge to 1300 if it were allowed to compete. But it will not play like a 1300-rated human. It would play like a superhuman genius on most moves and then make beginner-level blunders at random moments.

Making one that realistically plays like a human is an unsolved problem.


Of course, you are right. But (the linked site) at least has a bot that plays the opening like a human of chosen rating perfectly. It stops working after the opening-stage (since it just copies moves from humans in the lichess game database), but it is still very impressive. For later game stages, some other method would have to be used (unless we play multiple orders of magnintude more games on lichess).

Now that i think about it, i remember the people in the alphago documentary talking about the bot giving its moves percentage scores in both how high winning % the move had and how high % chance that a human would have made the same move that it just played. I wonder why they never showed what a full game of the most human-like moves from alphago would look like. Maybe it actually worked, by feeding it all the pro games in existence, and training it to play the high human % instead of the higest win probability moves like they did in the end.

https://www.chessassess.com/openings


So like a 1500 rated human?


I think this can be achieved with some ease with a machine learning model. You will have to train it on games between 1300-rated players and below. A transformer model might work even better in terms of the evenness of play (behaving like a 1300 rated player throughout the game).


> I think this can be achieved with some ease with a machine learning model.

What evidence lead you to think that, and how surprised would you be to be wrong?



ah that makes sense. thanks!


Playing a chess bot that works this way feels like playing a Magnus Carlsen who's trying to let you win.


But that doesn’t imply that that bot played like an average human.

Making a computer have a 50% score against a 1300-rated human is way easier than making it play like a 1300-rated human.

For the former, you can take a top-of-the-line program and have it flip a coin in every game whether to make a random move every move or not.


Definitely, but it seems like it's now possible: https://www.maiachess.com/


Take the computer which beats Magnus and restrain it to never make the best move in a position. Expand this to N best moves as needed to reach 1300 rating.


Even 1300s sometimes make the best move. Sometimes the best move is really easy to see or even mandatory, like if you are in check and MUST take that checking piece. Sometimes the best move is only obvious if you can look 20 moves ahead. Sometimes the best move is only obvious if you can look 5 moves ahead, but the line is so forcing that even 1300s can look that far ahead.

Despite decades of research, nobody has found a good way to make computers play like humans.


Then I can't refrain from asking: and what's the style of LLMs? For example the ChatGPT which is apparently rated around 1800? That should be completely different from that of a classic chess engine.


LLMs can be trained on chess games, but the tree of possible board states branches so fast that for any given position there is simply very little training data available. Even the billions of games played on chess.com and lichess are only a drop in the bucket compared to how many possible board states there are. This would have to be split further by rating range, so the amount of games for any given rating range would be even lower.

This means that the LLM does not actually have a lot of training data available to learn how a 1300 would play, and subsequently does a poor job at imitating it. There is a bunch of papers available online if you want more info.


LLMs already do play at elo ~1400-1800. The question was how does their style feels like to someone who can appreciate the difference between a human player and a chess engine (and the different styles of different human players).


I can’t speak for ChatGPT, but your intuition is correct that LLMs tend to play more like “humans” than Stockfish or other semi-brute force approaches.


ChatGPT will hallucinate and make impossible/invalid moves frequently, so I don't see how it could have a chess rating


That's not the case. Depending on the version, (Chat)GPT seems to be able to play between ~1400 and ~1800 elo, very rarely making invalid moves.


You've identified a potential strategy by which a computer can play like a 1300-rated player, but not one where it will "play like a 1300-rated human". Patzers can still find and make moves in your set of N (if only by blind chance).


Yeah, you would have to weigh the moves based on how "obvious" it is, such as how active the piece has been, how many turns until it leads to winning material, or other such 'bad habits' humans fall for.


This won't work. With that strategy, you can make a computer make play like a 1300 player, but not a 1300 human player.


That's kind of what they do for "training" bots and it produces something which plays NOTHING like a 1300-rated human.


I assume you could just give the computer a large set of 1300 rated games and train it to predict moves from that set :)


I think there's a real difference between "a computer"— in this context meaning an algorithm written by a human, possibly calibrated with a small number of parameters but not trained in any meaningful sense, and a "chess model" which works as you describe.

I think the chess model would be successful at producing the desired outcome but it's not as interesting. There's something to be said for being able to write down in precise terms how to play imperfectly in a manner that feels like a single cohesive intelligence strategizing against you.


The metric is to play the position against Stockfish. If you draw it again and again, it is trivial, otherwise, not so simple :-)


Yes, the Leela team has worked on a term they call Contempt. (Negative contempt in this case would make the engine seek out less sharp play from whites perspective) In the first link the authour talks about using contempt to seek out/avoid sharp lines. lc0 and nibbler are free, so feel free to try it out if curious.

https://github.com/LeelaChessZero/lc0/pull/1791#issuecomment... https://lczero.org/blog/2023/07/the-lc0-v0.30.0-wdl-rescale/...


You can evaluate on lower depth/time.

But even that isn't a good proxy.

Humans cannot out-FLOP a computer, so they need to use patterns (like an LLM). To get the human perspective, the engine would need to something similar.


There are several neural network based engines these days, including one that does exclusively what you describe (i.e. "patterns only", no calculation at all), and one that's trained on human games.

Even Stockfish uses a neural network these days by default for its positional evaluation, but it's relatively simple/lightweight in comparison to these, and it gains its strength from being used as part of deep search, rather than using a powerful/heavy neural network in a shallow tree search.

[1] https://arxiv.org/html/2402.04494v1

[2] https://www.maiachess.com/


Definitely. And Google's AlphaZero did it years ago.

I don't think the patterns are very human, but they are very cool.


Have you tried Maia? I haven't myself (there isn't one in my ballpark level yet), but supposedly it plays more human due to being trained mostly on human play, not engine evaluations or self-play.


I have not.

Thank you.



This is great, but I think that % is about the "correctness" of the move, not how likely it is to be played next.


I think that's not quite the point. Leela has an advantage over AB chess engines, where it has multi-PV for "free", meaning it will evaluate multiple lines by default at no cost to performance (traditional engines, like Stockfish, will lose elo with multi-PV). This allows us to know at a glance if a position is "draw/win with perfect play" or if there is margin for error. If Leela shows multiple moves where one side maintains a winning advantage/losing disadvantage/equality, we can use that as a computer-based heuristic to know if a position is "easy" to play or not.


Yes and no – the number of playable lines does not necessarily tell us how "obvious" those lines are to find for a human.

To give a trivial example, if I take your queen, then recapturing my queen is almost always the single playable move. But it's also a line that you will easily find!

Conversely, in a complex tactical position, (even) multiple saving moves could all be very tricky for a human to calculate.


I wonder if there’s a combined metric that could be calculated. Depth of the line certainly would be impactful. A line that only works if you do 5 only moves is harder to find than a single move line. “Quiet” moves are probably harder to find than captures or direct attacks. Backwards moves are famously tricky to spot. Etc


And also, humans vary wildly in their thinking and what's "obvious" to them. I'm about 1950 and am good in openings and tactics (but not tactics for the opponent). Others around the same rating are much worse than that but they understand positional play much better - how to use weak squares, which pieces to exchange and so on. To me that's a kind of magic.


Not really because it’s subjective to the level of player. What’s a blunder to a master player might only be an inaccuracy to a beginner. The same applies for higher levels of chess player. I’ve watched GothamChess say “I’ve no idea why <INSERT GM> made this move but it’s the only move,” then Hikaru Nakamura will rattle off a weird 8-move sequence to explain why it’s a major advantage despite no pieces being lost. Stockfish is a level above even Magnus if given enough depth.


> Stockfish is a level above even Magnus if given enough depth.

"a level" and "if given enough depth" are both underselling it. Stockfish running on a cheap phone with equal time for each side will beat Magnus 100 games in a row.


I believe it’s something like 500 elo points difference at this point between Magnus and Stockfish running on cheap hardware. Computers are so strong the only way to measure their strength is against other, weaker computers, and so on until you get to engines that are mere “grandmaster” strength.


Bear in mind that, beyond the “top” elo ratings, that it’s purely an estimate of relative strength. The gap between a GM and me is far greater than the gap between a GM and Stockfish, even if the stated elo difference is the same.

By this I mean, you can give me a winning position against Magnus and I’ll still lose. Give a winning position to Magnus vs Stockfish and he might draw or even win.


True, what is considered a “winning” position is different at different elo levels. The better someone is, the smaller their mistakes are relative to perfect play.

I wish top players like Magnus would do more exhibition games against top computers. They don’t have to all start with equal material or an equal position.


That’s fair, I was leaving wiggle room for things like being able to force the engine into doing stupid things like sacrifice all its pieces to avoid stalemate.


Maybe the difference between the eval of the best move vs the next one(s)? An "only move" situation would be more risky than when you have a choice between many good moves.


That's it exactly. Engines will often show you at least 3 lines each with their valuation, and you can check the difficulty often just from that delta from 1st to 2nd best move. With some practical chess experience you can also "feel" how natural or exoteric the best move is.

In the WCC match between Caruana and Carlsen, they were at one difficult endgame where Carlsen (the champion) moved and engines calculated it was a "blunder" because there was a theoretical checkmate in like 36(!) moves, but no commentator took it seriously as there was "no way" a human would be able to spot the chance and calculate it correctly under the clock.


Not necessarily. If that "only move" is obvious, then it's not really risky. Like if a queen trade is offered and the opponent accepts, then typically the "only move" that doesn't massively lose is to capture back. But that's extremely obvious, and doesn't represent a sharp or complex position.


Yes, it’s called Monte Carlo Tree Search (MCTS used by AlphaZero) instead of AlphaBeta search (which is what classical chess engines used)


Those are tree search techniques, they are not metrics to assess the "human" complexity of a line. They could be used for this purpose but out of the box they just give you winning probability


If multiple lines have equal-ish winning probability, rather than a single line, then you can sort of translate it to "human" complexity.


Not really – that's the point, engines, for all their awesomeness, just do not know how to assess the likelihood of "human" mistakes.


This is litterally the best comment I've seen today!


Do you have evaluations for how well the trained agents do (e.g. for chess, go, etc)?


If this is a faithful reimplementation of the AlphaZero algorithm (and I haven't looked through the code to confirm whether or not it is) then you'd expect equal performance to the published results after enough iterations of training. But the author probably doesn't have the resources to train agents on the same scale as Google did, and so performance in your own usage would largely come down to how long you can afford to train fr.


This is not true for all crypto coins. A small proportion of mining hash functions are deliberately designed to thwart ASICs, e.g. https://github.com/tevador/RandomX. It's a cat-and-mouse game, though: CryptoNight was designed to fulfill the same role, but ultimately failed: https://monerodocs.org/proof-of-work/cryptonight/.


The linked paper (https://www.nature.com/articles/s41586-023-06734-w) in the quoted tweet appears to be from the Ceder Group at UC Berkeley, not DeepMind. Is there a different link I'm missing?


As the paper notes, two of the authors are from DeepMind, the chemicals they are synthesizing are those predicted to be stable by a DeepMind model, and the entire work arises out of an extensive collaboration between DeepMind and the Berkeley lab: https://www.nature.com/articles/d41586-023-03745-5


Missed that -- thanks!


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: