I am curious why the author chose a genetic algorithm rather than standard backprop to distill the evil. Logistic regression seems like a pretty reasonable choice and it’ll be a lot faster than a genetic algorithm. Add an L1 penalty for sparsity.
In the past I’ve tried distilling not just the eval function but the outcome of the search (something like depth 20) in a neural net. It kind of works, but not very well until the net is pretty big. Deepmind did something similar.
Reading over your article again, it seems like you made the same “mistake” as I did. In other words, the evals you’re seeing in the Lichess PGN aren’t the raw outputs of the Stockfish evaluation function, they’re the output of the search, which is a highly nonlinear function of millions of evals. If I recall from their docs, it’s something like depth 18, so your chances of distilling it with 20k parameters is essentially zero.
(I put mistake in quotes here because Deepmind showed that with a server farm of TPUs it is possible to distill the search as well. So it’s not impossible per se.)
But that’s ok! You’ve created a super fast evaluation function. Instead of trying to distill the depth 18 output, it will be much more realistic to distill the depth zero output, the NNUE. If you rerun those FENs in Stockfish you can pretty quickly create an easier dataset.
As part of my undergrad work, I used similar principles to the article (steady state genetic algorithms) to create a bot capable of playing Reversi where the fitness function was loosely defined as a set of "weights" on each square across the 8x8 board. These were used as part of the evaluation function in the classic minimax algorithm.
We trained over the course of 5-6 days, and the end generation was capable of beating an intermediate player but got completely destroyed by experts. It was a fun project!
I especially enjoyed the link to The Bitter Lesson by Rich Sutton, which I hadn't read before. Now I wonder what "discoveries" have been built into today's AI models and how they might come to be detrimental.
Probably LLM maximalist ideas that suggest infinite “scaling laws” for LLMs (they are not laws), leading to ridiculous conclusions like building a $1 trillion cluster is the fastest way to AGI. People like Leopold Aschenbrenner are in this camp.
Imagine if LLMs were the only way we had to play chess. You’d need a centralized server and peak performance wouldn't even best a grandmaster. You spend $1 trillion building a super cluster because that’s all you know.
< - - This is where AI is today.
And then some startup creates Stockfish, a chess engine better than any LLM or grandmaster and can run on a smartphone.
In the past, yes. That was essentially the approach up until the 1980s. Computers were too slow to run millions of evaluations per move, so you got a lot of efforts in a more heuristic-driven direction.
Examples of heuristic engines are Hans Berliner’s CAPS-II and PARADISE by David Wilkins. PARADISE used a database of 200 rules.
There are also engines that try to retroactively apply rules to Stockfish evals but they’re fairly confusing in my experience. It’s like being told the answer and having to come up with a reason after the fact.
All this I guess comes after laying the ground work, like implementing a bitboard representation or something else, and implementing the logic for being able to tell, whether a move is valid or invalid. That in itself is already a lot of work. iiuc the idea here is "merely" writing the engine, and take everything else as a given.
The game’s implementation itself was furnished with the competition by Sebastian Lague.
I completely agree that writing the move logic, validation, etc… is a difficult undertaking especially when it comes to optimisation which is what allows the bots built on top to perform well.
That makes sense for such a competition. Thanks for making this even clearer!
Of course another interesting competition could be to develop _all_ of a chess game and engine and see how low in code people can go. But that would be an entirely different competition.
There are open components for generic purposes, e.g. draw the board given a PGN, show the moves, check if moves are legal and so on, but they aren't suitable for engine internals. There are several tricks with bits to store positions, hashing schemes for transposition tables and the likes that are key to unlock the maximum performance possible, and they are usually tied to the engine implementation.
This is something that's been on my bucket list for a while now, although I'd do it using "normal" programming instead of DL.
i feel like if you can bruteforce every possible board configuration for the next 3 turns and then make the move that leads to more desirable outcomes, that ought to be enough to thrash most amateurs. Probably not good enough to beat somebody who actually understands chess tactics on a deeper level, but I expect most non-masters are just "winging it" and making things up as they go along, so a machine that can think farther ahead than you would win more often than not.
But like I said, this is all just me fantasizing about a project I haven't even started yet so my hypothesis could easily be wrong.
> bruteforce every possible board configuration for the next 3 turns and then make the move that leads to more desirable outcomes
I'd recommend familiarizing yourself with the minimax algorithm - since that's exactly what it does where "desirable outcome" is essentially the tunable evaluation aspect of the equation.
I did exactly this when the Laser chess variant was posted on Hacker News about last year. It does indeed work pretty well. Players eventually learn patterns that look good in 3 moves but not with a greater depth, though.
It's how real chess engines like Stockfish work, but they are highly optimized so they can search extreme depths.
> make the move that leads to more desirable outcomes
This is kind of begging the question, right? How do you know what a “desirable” outcome is, other than mate-in-N positions? (In your case, N<=3.) The point of an evaluation function is to take a position and return a value which describes how desirable it is.
It's basically just a depth-first tree traversal with some heuristics to weigh different turns against each other (where a turn is defined as a possible board configuration that can be reached from the current configuration).
First you'd have a heuristic that ranks outcomes from turn 3, which would just mean weighing the pieces the computer lost against the pieces the opponent lost. Then you rank each turn-2 node based on how desirable the turn-3 leaf-nodes it leads to are and how likely the opponent is to actually make the moves that lead to it. Then you'd do the same for turn-1 and then pick the move that scored highest.
yeah, maybe. nothing's ever as easy in reality as on paper, but i've always considered my pet projects as long-term endeavors. i can keep working on the same thing for years at a time, and the reason i haven't actually done the chess thing despite often fantasizing about it is that i don't know if i can commit to it without adversely impacting my other pet projects.
> To be exact every participant has a maximum of 1024 tokens at his disposal to craft the best chess bot they can.
I'd be quite tempted to try linear genetic programming with a variant of a Brainfuck-style UTM for this kind of constraint. Assuming 1024 tokens = 1024 bytes = 1024 instructions, I think there is some degree of performance that is feasible.
This is really interesting to me because hand-coding BF to do something like chess is absolutely beyond me. I wouldn't even know where to begin. A LGP-coded BF program would almost certainly outperform anything I could design with my own hands.
This is highly specific to C# which was the language imposed on all participants.
But I agree that some languages might be especially adept to these kinds of tasks and it would be interesting to see which.
They usually struggle to always generate valid moves.
And that's pivotal.
If you have a program which always makes valid moves and gives up when it has lost you wrote a proper chess playing program. It may play badly, but it plays.
Your question sounds flippant but it’s actually quite deep. Why aren’t LLMs good at chess? The answer is likely that to be good at chess at some level requires search (and probably some ordering constraint on your evaluation function too that I’m not smart enough to figure out).
LLMs aren’t searching. They are memorizing. This explains their extremely poor performance on out of domain positions, whereas Stockfish can easily understand them.
I think chess is a great example to use to demonstrate that LLMs don’t actually know anything, and will be entirely unsuitable to replace humans on anything except thoroughly solved and documented problems.
In the past I’ve tried distilling not just the eval function but the outcome of the search (something like depth 20) in a neural net. It kind of works, but not very well until the net is pretty big. Deepmind did something similar.