Many people are criticising their backtest but I don't understand why. Their test data is sequential to their training data, considers time, and doesn't overlap. They can't overfit to their test data. In any other area of ML this would be an acceptable scheme, why is this unacceptable here?
What people are complaining about is not the overfitting, but the unrealistic assumptions in the backtest. In the real world there is slippage, latencies/jitter, special market open regimes, hidden orders, market impact, front-running, variable fees, and all kinds of other complexities. Their transaction costs are apparently also an unreasonable assumption. Sophisticated simulators used in professional trading firms can account for such things to some extent, but most academic papers conveniently ignore these complexities and just assume they can trade at whatever price the data tells them. It's completely unrealistic.
To answer your original question about overfitting, they can still overfit to test data by running a lot of experiments with different hyperparameters, architectures and parts of the data, and only report what has worked. There are also more complex ways that test data can leak into training data (see the book Advances in financial ML for a good overview). You can already see this is likely the case just from the variance in their results and trades. They also don't compare to baselines. It's not unlikely that the results are just random and they fail to report those experiments that didn't work. Of course, you cannot prove this without having an exact log of all things they ever did to the data. But again, that's not the main issue here.
> they can still overfit to test data by running a lot of experiments with different hyperparameters, architectures and parts of the data, and only report what has worked.
but this is a different accusation from accidentally overfitting or leaking, i.e. it would mean that they're dishonest and cherrypicked their data in such a way that it hides overfitting and leakage. This criticism can be levelled at every ML paper, but in this case they detail their architecture, provide the code, and provide a Jupyter notebook to let people try it themselves.
> just assume they can trade at whatever price the data tells them. It's completely unrealistic.
I think that this is a fair assumption for highly liquid markets and relatively small trades, and if it's a fair assumption then all of your criticisms (slippage etc) don't apply to the extent that they'll break the approach. Also, if the approach works then trade size (fees aside) and being frontrun also wont apply because presumably large HFT firms can use it.
Overall I think your criticisms are valid, but imo they don't invalidate a promising approach, they're just the next thing to test.