Are there any papers benchmarking a transformer NN architecture in comparison to something like a pointer-generator network? I'm doing a bit of work in this area (i.e. reimplementing papers), and I'm curious if GPT2-like models can derive greater semantic meaning.