When you talk about mapping (patterns of the) the input pixels to discrete class...

When you talk about mapping (patterns of the) the input pixels to discrete classes, I don't think thats entirely what people do. We have the ability to make "distributed representations" of concepts e.g. word2vec, GloVe, etc which contain the idea that a sheep is pretty similar to a dog. The classes are far from discrete.

I'm pretty sure people train image recognizers to output these representations, which can include states like 51% certainty dog, 48% certainty sheep, 1% other, and if you aren't sure, take the best choice.

Its such an intuitive idea to combine these things that if it hasn't happened in the literature yet (I only looked for 1 minute), its because 1000 people have tried it, failed to improve the state of the art, and didn't publish.

On the other hand, we're generally pretty bad at inferring and generalizing 3d structure out of images, so I tend to blame that.