It's quite unreasonable. He could have optimized it more for fooling humans in Gibberish generation, but that would not show the general effectiveness of the approach. The power shows (quantifiably) in compression: 1.57 bits per character wikipedia is quite hard to beat. Of course, Markov Chains are essentially universal models, so the training algorithm is the crucial distinction.
I believe Markov Chains as a model quickly become inefficient (specially memory-wise) as you increase the complexity (long range correlations) of your prediction. It's an unnecessarily restrictive model for high complexity behavior that state of the art RNNs skip entirely.
There's very little difference between a contextual predictive model like this and the guts of a compressor.
If your prediction is good enough that you can always come up with two possible predictions for each character, each of which has a 50% chance of being correct, then obviously you can compress your input down to one bit per character by storing just enough information to tell you which choice to pick. More generally, you can use arithmetic coding to do the same thing with an arbitrary set of letter probabilities, which is exactly what you get as the output of a neural network.
When the blog post says the model achieved a performance of "1.57 bits per character", that's just another way of saying "if we used the neural network as a compressor, this is how well it would perform."
It's a compression of Wikipedia in the sense that the NN generates probability estimates of the next character given the previous; the gibberish is simply greedily asking the NN repeatedly what the most-likely next character is. However, plug it into an arithmetic coder and start feeding in an actual Wikipedia corpus, and hey presto! a pretty high performance Wikipedia compressor, which works well on Wikipedia text but not so well on other texts (like this one, with its lack of brackets).
I believe Markov Chains as a model quickly become inefficient (specially memory-wise) as you increase the complexity (long range correlations) of your prediction. It's an unnecessarily restrictive model for high complexity behavior that state of the art RNNs skip entirely.