I think it must be getting slammed; I was able to get a couple of descriptions out of it, but that was balanced by probably 2 times as many instances of the above error.
Rekognition API released similar image to text API and it's much more reliable than this. At least the demo works smooth and response fast.
https://rekognition.com/demo/concept
Even leaving aside the reliability issue (which can be chalked up to the fact that this one is a demo of a non-commercial project that got overloaded), you're comparing two entirely different things.
For this image, the University of Toronto software generates sentences like "a cow is standing in the grass by a car", whereas Rekognition only produces a ranked list of categories. ("sports_car", "car_wheel", etc.)
The errors are fascinating. "a cow and a car are looking at the camera." "a band plays a group of music [...]". You could almost call them metaphors instead of errors.
The demo is clearly designed for the small community of machine learning researchers to play around with it to better evaluate the papers they wrote. They aren't selling a product and probably have a hard time justifying using a lot of computing resources to host the demo. Furthermore, the models are probably optimized for result quality, not speed.