People rarely train on billions of images, we're usually around the scale of ~mi...

Aeolos · on July 21, 2016

How many years does it take for a human toddler to be able to form sentences to describe an object he is seeing?

We can train a CNN to do that in a few days.

Cybiote · on July 21, 2016

That's not a fair comparison. By that time the toddler can also ask questions, generate new labels using adjectives, label novel instances as compositions of previously acquired knowledge and generate sentences representing complex internal states. They are not limited to observed labels. In fact there is very little supervised learning in the form of [item, label, loss]. Beyond that, with enough stimulation and simply from interacting with each other, children can even spontaneously generate languages with complex grammar; without labeled supervision.

They'd also have gained the ability to do very (seriously) difficult things like walking, climbing objects, the rudiments of folk physics, picking things up and throwing things. They'd have some rudimentary ability modeling other agents.

It's good to be happy with current progress and I do not suffer from the AI-effect but being too lenient can hamper creativity and impede progress by occluding limitations.

cynicaldevil · on July 21, 2016

Even if we assume that a 5 year old has seen 1000-1500 pictures of say, cats, in his lifespan, it is still far less than the number of images required to train a CNN to label them as accurately as a human can.

And of course, I am not talking about just viewing angles. There are several other factors, but I only mentioned the ones which I could think of.

karpathy · on July 21, 2016

A human is very good at one-shot learning but CNNs are actually not too terrible either (and this is also an active area of research, e.g. see http://www.arxiv-sanity.com/1603.05106v2). A human might take advantage of good initialization while CNNs start from scratch. Human might have ~1B images by 5 (CNNs get ~1M) of continuous RGBD video and possibly taking advantage of active learning (while CNNs see disconnected samples, which has its pros and cons, mostly cons). i.e. we're disadvantaged in several respects but still doing quite well.

Cybiote · on July 21, 2016

It depends on what we mean by vision. Crows for an example, do the sort of things low level things CNNs are capable of. But for full visual comprehension, they are actively making predictions about physics off a probabilistic world model (learned in part from causal interventions) that feed back into perception.

I've not yet looked carefully into it, but I expect that sort of feedback should drastically reduce the amount of required raw data. Machines might not (at first) get to build predictive models from interactions, but even our best approaches to transfer and multi-task learning are very constrained compared to the free form multi modal integrative learning a parrot is capable of. With very little energy spend.

This is good, it means there are still a lot of exciting things left to work out.

jm547ster · on July 21, 2016

Every second a human opens their eyes, they are seeing a constant stream of changing "pictures" on which to train on.

Cybiote · on July 21, 2016

And that posts some advantages we still need to do a lot of work on. Mammals and birds are able to learn online from a few examples per instance, shift to changes in underlying distributions relatively quickly and do so unsupervised.

kimolas · on July 21, 2016

This is the right perspective. It seems the OP believes actual photographs are privileged in some way. In reality, any visual input from our eyes counts as training data, as you said.

cynicaldevil · on July 21, 2016

You seem to forget that the photos are labelled, which counts as supervised learning. What us humans excel at is unsupervised learning, which is difficult for machines. But yes, I agree that humans have the advantage of continuous video access.

xnzakg · on July 21, 2016

Disclaimer: I don't really know what I'm talking about, this is just what I think:

Since we know that things usually don't change instantly (for example, a cat won't suddenly change into a dog), if we assume 10fps vision, 1500 pictures of cats would mean looking at a cat for 2 and a half minutes total in 5 years. And since we know cats won't change into something else, if we see a cat walking somewhere, we'll still know it's a cat, giving us the labels we need for the training.

I think that if we assume 30fps (which still seems kind of low), and we assume that the human looks at a cat for 15 minutes (which still isn't much) that's already 27000 pictures.

cynicaldevil · on July 21, 2016

But I argue that those are pictures of the same cat, and most of the images will be very, very similar. Of those thousands of pictures, only a few, noticeably different pictures would matter.