I wouldn't look for hidden reasons. Recent image generators are already too good...

I wouldn't look for hidden reasons. Recent image generators are already too good with face generation (thanks to CelebA-like datasets and early researchers). And now the emphasis is on the multimodality of the model within a domain. There, almost every picture demonstrates some aspect of it. Somewhere there is text on the picture (old AI used to output bullshit instead of letters), somewhere there are humorous references to old images (for example, a cosmonaut on a pig).