prior generations usually take fewer steps than vanilla SDXL to reach the same quality.
But yeah, the inference speed improvement is mediocre (until I take a look at exactly what computation performed to have more informed opinion on whether it is implementation issue or model issue).
The prompt alignment should be better though. It looks like the model have more parameters to work with text conditioning.
in my observation, it yields amazing perf at higher batch sizes (4 or better 8). i assume it is due to memory bandwith and the constrained latent space helping.
But yeah, the inference speed improvement is mediocre (until I take a look at exactly what computation performed to have more informed opinion on whether it is implementation issue or model issue).
The prompt alignment should be better though. It looks like the model have more parameters to work with text conditioning.