> This is derived from GPT-2-small. So we already know that the state of the art...

pmoriarty · on March 17, 2019

I've just been reading through your methodology in the first part of your article (dealing with prose works like Twain, Austen, etc), where you mention you strip off the beginning of Project Gutenberg books, which contain boilerplate.

I'd like to suggest that you also strip the ends of their books, as they also contain boilerplate. In addition, I'd suggest stripping out introductions. The early results you got from Shakespeare sounded like they may have been taken partly from the intructions, which weren't written by Shakespeare at all, but by much later authors.

I also noticed that you ran out of memory at one point and reduced the neuron count as a result. You might want to consider doing some quick runs on AWS (or one of their competitors), where you can get plenty of memory (and also faster machines). That way you won't have to compromise your NN architecture for lack of resources.

Something else to consider is using some other optimization techniques like GA or GP to optimize the NN architecture or NN parameters, and also to maybe have multiple NN's vote on the results. Such metaheuristic and ensemble techniques have shown promising results.

Yet another thing to consider is using something called Dynamic Subset Selection to effectively train on the most difficult portions of the training data. I have not used this technique with NN's, but it's worked well with GP, and saves a lot of time.

gwern · on March 17, 2019

If I were to revisit those specific experiments, I wouldn't use AWS as it is very expensive. Fortunately, I now have my own workstation with 2x1080ti (which have ~5x more VRAM than the mobile GPU I was using at the time, IIRC).

There are a lot of hyperparameter optimization methods, but HO is only worthwhile if you can afford a lot of runs and usually delivers relatively small gains compared to scaling up your model/dataset. Right now, it seems like it would be a better approach to continue scaling up the Transformer and/or switching to Transformer-XL than it would be to attempt hyperparameter tuning of GPT-2-small finetuning training.