I also made some pages (around a decade ago) which implement various optimisation algorithms, including GA (with just mutation, with just crossover, and with both). The task they're solving is less interesting though: find the brightest point on a randomly-generated image. In fact, being so low-dimensional, the enumeration algorithm is probably the best choice!
I think it's interesting and deserves a longer dive, particularly in car representation and crossover/mutation functions, as well as:
> The optimality of a particular candidate solution (the fitness function) is determined by how far it travels before a) a mass touches the ground or b) time runs out.
How do you sort among the millions of possible solutions that automatically die upon touching ground.
Is there a reason that everyone seems to train qwop RL on the actual game rather than reverse engineering the model? I can't find a single gym which implements the movement itself in python, they all rely on node or interacting with the browser itself. (Even the most feature rich, popular one https://pypi.org/project/qwop-gym/)
At a certain level you can't guarantee the outcome of calculations will be 1:1. For example python and node both perform floating point calculations differently, so "identical" simulations will drift apart over time.
Sure, but it should be possible to run enough tests to show the error over X frames doesn't cross Y. Then you can learn on that model and validate on the real thing (and use it to fine-tune if needed). The chosen action has very little effect after a few frames anyway.
In this same vein, don't go down the rabbit hole of Image libraries. PIL resizing is different than Pytorch resizing. The effects are actually large enough to affect your models. Yes, even one trained at fp16. It can also result in people being displaced on leaderboards even when just changing things during evaluation.
I think a lot of people, including a lot of ML researchers, would be surprised by these subtle effects and their influences. It really makes metrics a bit fuzzier than they already are (which is already pretty fuzzy not that we're in high quality domains). But at the same time it __shouldn't__ surprise people given how strong the lottery effect is. Evaluation is just fucking hard but we all light to get caught up in leaderboards. Just just can't quantify concepts as "most realistic image" or "most realistic language." Hell, that's a big reason why we invented RLHF.
I'm not sure how this is a rabbit hole. If your model specializes to a specific resizing algorithm then there is your opportunity to generalize the model over resizing algorithms. In fact, simply duplicating your data set by using multiple resizing algorithms should improve performance even if you only end up using a single one.
Sure, that's a reasonable idea but it isn't as easy as "la de da, add augmentation". I see this effect in models that already use noise injection btw. I should also add that certain resizing operations have led to my model collapsing rather than fixing it. It is not uncommon to see additional augmentation __HURT__ models rather than help them.
But I think what you're missing is that these effects are not well known. How many works do you see use PIL based resizing? How many use noise injection? Any augmentation? Are people considering that CUDA FMA is different than CPU FMA? That you get different results on different chipsets? I have no idea about TPUs, but I bet you there are differences. The rabbit hole I'm discussing is about more than the resizing operation, they exist everywhere.
What happens is that different works end up with different measurements and it becomes non-trivial to compare them. Remember, in the paper you only see benchmarks, you don't see the code and you just have to trust. But I can tell you that this particular mistake is the norm rather than the exception, and not in a consistent way. So say you write a generative model in tensorflow and the same one in pytorch, you get different results when measuring FID even if the model weights are identical. It's rather reasonable for a researcher/user to believe that the underlying libraries are implementing things in the best way. That's the point of these libraries in the first place. It's also not unreasonable for users to think that these effects constitute noise and that an iterative converging algorithm will find a signal through that noise. Remember, these are effects that can cause changes in leaderboard positions and thus, cause your work get rejected. It's the same work and has the same value, but if you only look at leaderboards you aren't comparing properly. It wouldn't be a big deal 4-5 years ago when our models were much worse, but now that we're pushing the limits of any metric, these types of effects can dominate.
The point is that there are extremely subtle effects going on that any reasonable person would not assume are happening. I'm willing to be that you didn't know these resizing differences existed until my previous comment. Brushing it off with the obviousness of hindsight is a terrible way to measure a priori understanding. It is the whole problem of not knowing what you don't know. We live in a specialized world, it is not expected that a ML researcher knows all the nuances of a specific library, cuda versions, compiler versions, and so on. Similarly it's not expected for programming language people to even know ML. You'd have to do like 5 PhDs before getting anywhere!
The tldr is just be cautious and take to heart that evaluation is hard and nefarious. A callous approach to evaluation, assuming everything is "obvious" or "simple," will quickly lead you to making errors.
Hell you could at the very least download the game and hook into it with a cheat engine or something. These researchers are just playing online through a clicker bot.
That's a lot of hacks to avoid hacking into the .swf
I mean, damn, you'd think they'd come up with a better methodology than having a Python bot playing online, in real time, one instance at a time, with 50ms maximum speed, ORCing the screen to see the score.
A bit strange that they write that timing is essential, yet don't actually allow the GA to experiment with timing. They could have added to their representation a symbol for "no change". Surprised they didn't.
They actually did, one of the approaches they tested included holding down combinations of keys for variable durations (the "Representation 4: Bitmask-Duration Sequence"), however they didn't get any viable runners using that approach. Their conclusion was that with the constraints they were operating under (ie. only 1,000 fitness evaluations etc), the solution space was too broad for that approach to find any viable strategies. Main takeaway for me was that this form of genetic algorithm works best & converges most quickly the simpler you can make the search space.
https://peteshadbolt.co.uk/posts/ga/
Now back to life thanks to Ruffle.