People rarely train on billions of images, we're usually around the scale of ~million. This already works quite well in many respects. A back of the envelope calculation assuming about 10fps vision gives ~1B images by age of 5. And humans aren't necessarily starting from scratch as our machine learning systems do.
It's not clear if people can calculate what an object might look like in different viewing angles, but even if they could if you would want to in an application, and even if you did there's quite a bit of work on this (e.g. many related papers here http://www.arxiv-sanity.com/1511.06702v1). At least so far I'm not aware of convincing results that suggest that doing so improves recognition performance (which in most applications is what people care about).
That's not a fair comparison. By that time the toddler can also ask questions, generate new labels using adjectives, label novel instances as compositions of previously acquired knowledge and generate sentences representing complex internal states. They are not limited to observed labels. In fact there is very little supervised learning in the form of [item, label, loss]. Beyond that, with enough stimulation and simply from interacting with each other, children can even spontaneously generate languages with complex grammar; without labeled supervision.
They'd also have gained the ability to do very (seriously) difficult things like walking, climbing objects, the rudiments of folk physics, picking things up and throwing things. They'd have some rudimentary ability modeling other agents.
It's good to be happy with current progress and I do not suffer from the AI-effect but being too lenient can hamper creativity and impede progress by occluding limitations.
Even if we assume that a 5 year old has seen 1000-1500 pictures of say, cats, in his lifespan, it is still far less than the number of images required to train a CNN to label them as accurately as a human can.
And of course, I am not talking about just viewing angles. There are several other factors, but I only mentioned the ones which I could think of.
A human is very good at one-shot learning but CNNs are actually not too terrible either (and this is also an active area of research, e.g. see http://www.arxiv-sanity.com/1603.05106v2). A human might take advantage of good initialization while CNNs start from scratch. Human might have ~1B images by 5 (CNNs get ~1M) of continuous RGBD video and possibly taking advantage of active learning (while CNNs see disconnected samples, which has its pros and cons, mostly cons). i.e. we're disadvantaged in several respects but still doing quite well.
It depends on what we mean by vision. Crows for an example, do the sort of things low level things CNNs are capable of. But for full visual comprehension, they are actively making predictions about physics off a probabilistic world model (learned in part from causal interventions) that feed back into perception.
I've not yet looked carefully into it, but I expect that sort of feedback should drastically reduce the amount of required raw data. Machines might not (at first) get to build predictive models from interactions, but even our best approaches to transfer and multi-task learning are very constrained compared to the free form multi modal integrative learning a parrot is capable of. With very little energy spend.
This is good, it means there are still a lot of exciting things left to work out.
And that posts some advantages we still need to do a lot of work on. Mammals and birds are able to learn online from a few examples per instance, shift to changes in underlying distributions relatively quickly and do so unsupervised.
This is the right perspective. It seems the OP believes actual photographs are privileged in some way. In reality, any visual input from our eyes counts as training data, as you said.
You seem to forget that the photos are labelled, which counts as supervised learning. What us humans excel at is unsupervised learning, which is difficult for machines. But yes, I agree that humans have the advantage of continuous video access.
Disclaimer: I don't really know what I'm talking about, this is just what I think:
Since we know that things usually don't change instantly (for example, a cat won't suddenly change into a dog), if we assume 10fps vision, 1500 pictures of cats would mean looking at a cat for 2 and a half minutes total in 5 years. And since we know cats won't change into something else, if we see a cat walking somewhere, we'll still know it's a cat, giving us the labels we need for the training.
I think that if we assume 30fps (which still seems kind of low), and we assume that the human looks at a cat for 15 minutes (which still isn't much) that's already 27000 pictures.
But I argue that those are pictures of the same cat, and most of the images will be very, very similar. Of those thousands of pictures, only a few, noticeably different pictures would matter.
It's not clear if people can calculate what an object might look like in different viewing angles, but even if they could if you would want to in an application, and even if you did there's quite a bit of work on this (e.g. many related papers here http://www.arxiv-sanity.com/1511.06702v1). At least so far I'm not aware of convincing results that suggest that doing so improves recognition performance (which in most applications is what people care about).