Probably from multiple modalities as well as extending the sequence lookback length further and further.
They have low perplexity now, but the perplexity possible when predicting the next word on page 365 of a book where you can attend over the last 364 pages will allow even more complexity to emerge.
They have low perplexity now, but the perplexity possible when predicting the next word on page 365 of a book where you can attend over the last 364 pages will allow even more complexity to emerge.