The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.
The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence