If I were to revisit those specific experiments, I wouldn't use AWS as it is ver...

If I were to revisit those specific experiments, I wouldn't use AWS as it is very expensive. Fortunately, I now have my own workstation with 2x1080ti (which have ~5x more VRAM than the mobile GPU I was using at the time, IIRC).

There are a lot of hyperparameter optimization methods, but HO is only worthwhile if you can afford a lot of runs and usually delivers relatively small gains compared to scaling up your model/dataset. Right now, it seems like it would be a better approach to continue scaling up the Transformer and/or switching to Transformer-XL than it would be to attempt hyperparameter tuning of GPT-2-small finetuning training.