Learning to Predict Depth on the Pixel 3 Phones

fanzhang · on Dec 1, 2018

Non-linear combination of PDAF information, along with the contents of the images themselves is an interesting demonstration of the power of neural networks. The general concept of non-linear combination is likely how our brains work too.

Given that they built just one testing rig, and they use information like "commonly known object sizes" for depth, I wonder if they are using any transfer learning at all? I imagine whoever is bringing that rig around can't look at all objects, and transferring the knowledge of "common object sizes" and even what an object is can be hugely valuable for such a project.

I bet this is also a sneak peak into how powerful self-driving cars can be at Waymo. Not only do such cars have access to human data (sights, sounds), they can have way more sensors which can combine in a neural network to predict things way better than humans. For example, a NN could infer the chance a pedestrian might veer from the sidewalk into the streets by analyzing:

- The psychological state of a person using a high res camera on the face.

- Using chromographs to see if the pedestrian is drunk.

- Infrared cameras to see the person's temperature / if she's excited, etc.

- Limb modeling and likely walking paths.

mherdeg · on Dec 1, 2018

I didn't realize that the word "bokeh" is fairly new to English-language discussion of photography -- only since about 1998 ( https://en.wikipedia.org/wiki/Bokeh ).

I had this series of thoughts about smartphone camera "portrait mode" which were like this:

(1) Ha, pretty funny that we're spending all this human effort and R&D time to try to realistically simulate what is essentially a defect in camera lenses that makes them unable to have a wide enough focal plane to accurately represent what is in front of them.

(2) But wait, isn't that snobbish and a little shortsighted? (ha ha). What is a camera for anyway, is it making art or faithfully what it sees or something else? There's whole academic disciplines about this hmm okay.

(3) Well, no, I mean, a photo which has been selectively blurred like this has _less information_ than we started with, so it's objectively telling us less, and that's why it's funny, that we are spending so much time removing information.

(4) But maybe that's just the same as editing. Surely a good editor of texts takes a lot of information and removes information to make a better story; in the same way surely a "portrait mode" is really about figuring out what the most important parts are of a picture and selectively obscuring the rest, so that the 'good stuff' can shine through. That's important and hard; with an 85mm lens at f1.8 you'd do it by controlling focus carefully, but doing it after the fact is definitely interesting.

So okay, the reason this feature is interesting (and hard) is that software is trying to guess what parts of a picture are noise that should be gently deemphasized.

Now, it seems like the best way that software companies can think of to deemphasize unimportant parts of a picture is to mimic what a physical camera does with a portrait lens -- choose parts not on the same focal plane, pretend those parts were out of focus, then soften the image in the areas that "should have been" out of focus and blur them.

Now that they have shipped products that do that I'd be interested to see what the folks at Apple & Android do next. Do their teams still exist? Have they thought about how they would emphasize the important parts of a portrait if they weren't required to simulate the mechanical effect of a camera lens? What else could they do, maybe a little more creatively, to make portraits shine?

(And have they reinvented anything that Photoshop plugin makers have been doing for a decade?)

devadvance · on Dec 1, 2018

It's important to keep in mind that while it seems to be approached as "simulating" the effects traditionally achieved with mechanical effects of a camera, it has a basis in fundamental optics. Creating a sense of focus in a portrait with a camera lens is analogous to how the optics in our eyes work. Thus, this particular effect is more natural than some of the more Photoshop-esque techniques.

A good example of "what else" might be the color pop feature in the Google Photos app. It uses the depth information to selectively decolor the "out of focus" portions of the photo: https://www.androidpolice.com/2018/05/11/google-photos-color...

mherdeg · on Dec 2, 2018

Yeah that's a good point. Until a month ago when I got an eye exam and glasses, I didn't really understand that the way a telephoto lens worked at wide apertures was very similar to the way the eye was supposed to work at _most_ distances.

(I was familiar with focusing on one thing and other things being out of focus, but with lens corrections this now works much better at 10+ meter distances, which is something I really didn't appreciate until I tried it.)

teawrecks · on Dec 1, 2018

All good points. One point you touched on that I'd like to highlight and tie back to your earlier point about removing information is that traditional cameras are bending light rays along paths that smaller phone lenses simply can't, and then they trivially record the information about how light behaved in that moment.

To simulate this given only a 2D projection of some light at a certain point means trying to reconstruct the scene and apply (probably flawed, and certainly imprecise) physical models of how we understand light to behave, and then try to guess how different the image would have turned out in the presence of a theoretical lens affecting the incoming rays.

In other words, there IS missing information, a LOT of it. And approximating that information to any degree of accuracy after the fact is intractable, if not impossible. So the fact that we can train a machine to generate a result this believable is just...unfathomably amazing. A human brain, with it's inherent understanding of how the physical world behaves, really is the only other thing that could feasibly accomplish a remotely similar task.

nl · on Dec 1, 2018

Now that they have shipped products that do that I'd be interested to see what the folks at Apple & Android do next.

Have you seen Night Sight on Pixel phones? It’s pretty amazing.

visarga · on Dec 2, 2018

> What else could they do, maybe a little more creatively, to make portraits shine?

Neural nets can also colorise or even transform photos into the style of paintings. They can detect the better image of a series of shots. They can remove noise, watermarks and even reconstitute arbitrary missing pieces of an image. They can remove whole sections of an image bringing closer far away objects or shortening empty spaces between objects, in a seamless fashion. They can imagine things that are not in the training set (GANs). They can take your face and transfer it (animate it) to another person in a video. They can make summer into winter and vice-versa, in a realistic way.

Really it's an explosion of things we can do now that before were impossible.

shaklee3 · on Dec 1, 2018

I think if this feature was perfected it's far better than what an SLR does. I would much rather take the picture with as much information as possible, and blur to make the subject stand out later. It's kind of the same reason people shoot in raw.

ArtWomb · on Dec 1, 2018

It's kind of the same reason people shoot in raw

Am certainly in this camp. State-of-the-art UltraHD 60FPS HDR looks terrific

https://www.youtube.com/watch?v=qO6-1u0wfPk

Wish there was a simple way to just turn off everything in mobile cameras and instantly enter raw mode at highest resolution.

jhayward · on Dec 1, 2018

> Well, no, I mean, a photo which has been selectively blurred like this has _less information_ than we started with, so it's objectively telling us less, and that's why it's funny, that we are spending so much time removing information.*

You are confusing entropy with information. You can increase the information in an image by selectively removing everything that isn't relevant to some aspect of its viewing. By focusing on a subject our visual and cognitive systems don't have to do that work.

a-dub · on Dec 2, 2018

Ahhh coool. I didn't know they had this PDAF thing to get stereo images. I remember when they released their first round of this for the old cameras, where you were required to smoothly swipe the camera over to the side and (I'm assuming) they used optical flows + accelerometer data to get two source locations for stereo stuff.

Interesting application of object detection/segmentation to make up for a weak depth map from the PDAF, but I have to wonder, if you're constraining yourself to portraits and are doing a pretty good job of detecting the person, I wonder how well you could do without the stereo depth data at all...

a-dub · on Dec 2, 2018

A fun experiment would be to try to build something that predicts complete depth maps from single images alone. (ie: train on images with stereo depth maps, try to predict them from just one side)

sorenjan · on Dec 2, 2018

Called depth estimation from single image. There's ongoing research, here's one example: [0]

Using only the segmented 2D image is what the selfie cam does by the way, at least on Pixel 2 [1].

[0] https://papers.nips.cc/paper/5539-depth-map-prediction-from-...

[1] https://ai.googleblog.com/2017/10/portrait-mode-on-pixel-2-a...

a-dub · on Dec 2, 2018

Cool!

state_less · on Dec 2, 2018

It would be fun to estimate depth maps from single images.

And for the next act, estimate scene graphs from a single image where the understanding of what is being looked at and the position is estimated. For example, a fingernail connected to a jointed finger, connected to a hand, to an arm, to a body and so on. With dimensions, locations and rotations relative to each other.

One could render a humanoid model at 120 frames (or whatever the gpu would allow) a second using a set of estimated scene graphs trying to minimize error between the image and the rendering.

Add more models (tvs, cars, signs, letters, etc...) over time if the technique shows promise.

IshKebab · on Dec 2, 2018

They don't use stereo depth data for the front facing camera.

yomritoyj · on Dec 2, 2018

I'm waiting for the light-field cameras https://en.wikipedia.org/wiki/Light-field_camera

erikpukinskis · on Dec 2, 2018

Me too. Apparently the Lytro folks were absorbed by Google.

jcims · on Dec 2, 2018

Can anyone explain to me why I’m not allowed to zoom in on this web page in Chrome on IOS? Edit: works in safari