Zero-1-to-3: Zero-shot One Image to 3D Object

HopenHeyHi · on March 21, 2023

3D reconstruction from a single image. They stress the examples are not curated, appears to... well, gosh darnit, it appears to work.

If it runs fast enough I wonder whether one could just drive around with a webcam and generate these 3d models on the fly and even import them into a sort of GTA type simulation/game engine in real time. (To generate a novel view, Zero-1-to-3 takes only 2 seconds on an RTX A6000 GPU)

  This research is based on work partially supported by: 
  
  - Toyota Research Institute
  - DARPA MCS program under Federal Agreement No. N660011924032
  - NSF NRI Award #1925157

Oh, huh. Interesting.

  Future Work

  From objects to scenes: 
  Generalization to scenes with complex backgrounds remains an important challenge for our method.
  
  From scenes to videos: 
  Being able to reason about geometry of dynamic scenes from a single view would open novel research directions -- 
  such as understanding occlusions and dynamic object manipulation.

  A few approaches for diffusion-based video generation have been proposed recently and extending them to 3D would be key to opening up these opportunities.

TylerE · on March 21, 2023

Seems like there is a bit of a gap between “runs at 0.5 fps on a $7000 workstation-grade GPU with 48GB of VRAM” and consumer applications.

With the fairly shallow slope of the GPU performance curve overtime, I don’t see them just Moores Lawing out of it either. this would need two, maybe three orders of magnitude more performance.

HopenHeyHi · on March 21, 2023

Of course there is a gap. This is at the exploratory proof of concept stage. The fact that it works at all is what is interesting.

Furthermore once you've identified the make and model of the car, its relative position in 3d, any anomalies -- that ain't just a Ford pickup, it is loaded with cargo that overhangs in a particular way -- its velocity, 'etc -- I'm quite sure that extrapolating additional information from the subsequent frames will be significantly cheaper as you don't have to generate a 3d model from scratch each time.

I think this is a viable exploratory path forward.

    Make it work <- you are here
    Make it work correctly
    Make it work fast

Edit: Scotty does know ;)

scotty79 · on March 21, 2023

I prefer:

    Make it work <- you are here
    Make it work correctly
    Make it work fast

jiggawatts · on March 21, 2023

Computer power goes up exponentially thanks to Moore's law. Sprinkle some software optimisations on top, and it's conceivable for that to be running at interactive framerates on consumer GPUs within 5-10 years.

nitwit005 · on March 21, 2023

> Computer power goes up exponentially thanks to Moore's law

If you look at a graph, that stopped being true well over a decade ago.

jiggawatts · on March 22, 2023

Only for single-threaded programs. Multi-threaded performance continued along the curve unabated.

nitwit005 · on March 22, 2023

Moore's law is specifically about the number (density) of transistors in an integrated circuit.

You could always get as much parallelism as you wanted by adding more chips.

ffitch · on March 21, 2023

the processing may as well shift to the clouds. With the subscription fee, of course : )

TylerE · on March 21, 2023

Until we break the speed of light, I’m very bearish on cloud gaming. It just feels so bad. You’ve got like 9 layers of latency between you and the screen.

jimmySixDOF · on March 21, 2023

One possible definition of Edge Compute is GPU capacity at every last mile POP

jlokier · on March 21, 2023

I agree, though my last mile latency to the nearest POP is about 85ms. Still a bit on the high side for action games compared with playing locally.

kijiki · on March 21, 2023

85ms, holy crap, who is your ISP?

On Sonic fiber internet in San Francisco, I get 1.5ms to the POP. It is only 4.5ms to my VM in Hurricane Electrics Fremont DC.

fooker · on March 21, 2023

You don’t have to break the speed of light, just have the ping below human perception.

~20ms is that threshold, but even 40ms latency is barely noticeable for single player games.

enlyth · on March 21, 2023

It's quite noticeable actually, and it adds up, it's not just an extra 20ms.

For casual gamers and turn based games maybe it could work, as a niche. For FPS, multiplayer, ARPG, and so on, it's a dealbreaker, anything over 100ms feels too sluggish.

We should be happy we have so much autonomy with our own hardware, I don't want some big cloud company to be able to tell me what I can play and render, unless we want the "you will own nothing and be happy" meme to become reality.

TylerE · on March 21, 2023

I actually, in my testing, JRPG/other turn based games were amongst the worst because there is so much “management” (inventory, loot, gear, etc) and the extra lag really throws you off

TylerE · on March 21, 2023

A wireless controller ALONE is already over 20ms, and that’s before you touch the network, actually doing with that input, wait for the display to redraw…

At a 20ms total round trip, that only buys you about a 1500 mile radius, again completely ignoring all other latencies.

frozenport · on March 21, 2023

>> fairly shallow slope of the GPU performance curve overtime

Not true.

nico · on March 21, 2023

This is feeling like almost thought to launch.

In the last week, a lot of the ideas I’ve read about in the comments of HN, have then shown up as full blown projects in the front page.

As if people are building at an insane speed from idea to launch/release.

intelVISA · on March 21, 2023

GPT4 + Python... the product basically writes itself!

Until the oceans boil...

robertlagrant · on March 21, 2023

ChatGPT-5 will be written by ChatGPT4? :)

knodi123 · on March 21, 2023

if I've been reading it correctly, the power of chatgpt is in the training and data, not necessarily the algorithm.

And I'm not sure if it's technically possible for one AI to train another AI with the same algorithm and have better performance. Although I could be wrong about any and everything. :-)

visarga · on March 21, 2023

A LLM by itself could generate data, code and iterate on its training process, thus it can create another LLM from scratch. There is a path to improve LLMs without organic text - connect them to real systems and allow them feedback. They can learn from feedback from their actions. It could be as simple as a Python execution environment, a game, simulator, other chat bots, or a more complex system like real world tests.

BizarroLand · on March 21, 2023

I know that NVidia is using AI that is running on NVidia chips to create new chips that they then run AI on.

All you have left to do is to AI the process of training AI, kind of like building a lathe by hand makes a so-so lathe but that so-so lathe can then be used to build a better and more accurate lathe.

digdugdirk · on March 21, 2023

I actually love this analogy. People tend to not appreciate just how precise modern manufacturing equipment is.

All of that modern machinery was essentially bootstrapped off a couple of relatively flat rocks. Its going to be interesting to see where this LLM stuff goes when the feedback loop is this quick and so much brainpower is focused on it.

One of my sneaky suspicions is that Facebook/Google/Amazon/Microsoft/etc would have been better off keeping employees on the books if for no other reason than keeping thousands of skilled developers occupied, rather than cutting loose thousands of people during a time of rapid technological progress who now have an axe to grind.

scyzoryk_xyz · on March 22, 2023

It is a nice analogy because you can expand it really to all history of technological progress. Tools help make tools - all the way back to obsidian daggers and sticks.

qikInNdOutReply · on March 22, 2023

Same goes for the bellybutton, that navel connected from one living being to another, back to the first mamal.

kindofabigdeal · on March 21, 2023

Doubt

junon · on March 21, 2023

I know this is a joke but electronics cause an unmeasurably small amount of heat dissipation. It's how we generate power that's the problem.

taneq · on March 21, 2023

Or what answers we ask the electronics for... "Univac, how do I increase entropy?" distant rumble of cooling fans

arthurcolle · on March 21, 2023

You mean decrease entropy?

taneq · on March 22, 2023

We'll work up to that. For now, there's insufficient data for meaningful answer.

nmfisher · on March 21, 2023

Just yesterday I was literally musing to myself "I wonder if NeRFs would help with 3D object synthesis", and here we are.

It's definitely a fun time to be involved.

popinman322 · on March 21, 2023

NeRFs are a form of inverse renderer; this paper uses Score Jacobian Chaining[0] instead. Model reconstruction from NeRFs is also an active area of research. Check out the "Model Reconstruction" section of Awesome NeRF[1].

From the SJC paper:

> We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as function f with parameters θ, i.e., x = f (θ). Applying the chain rule through the Jacobian ∂x/∂θ converts a gradient on image x into a gradient on the parameter θ.

> Our method uses differentiable rendering to aggregate 2D image gradients over multiple viewpoints into a 3D asset gradient, and lifts a generative model from 2D to 3D. We parameterize a 3D asset θ as a radiance field stored on voxels and choose f to be the volume rendering function.

Interpretation: they take multiple input views, then optimize parameters (a voxel grid in this case) to a differentiable renderer (the volume rendering function for voxels) such that they can reproduce the input views.

[0]: https://pals.ttic.edu/p/score-jacobian-chaining [1]: https://github.com/awesome-NeRF/awesome-NeRF

regegrt · on March 21, 2023

It's not based on the NeRF concept though, is it?

Its outputs can provide the inputs for NeRF training, which is why they mention NeRFs. But it's not NeRF technology.

noduerme · on March 21, 2023

it's actually a really fun time to know how to sculpt in ZBrush and print out models.

nmfisher · on March 21, 2023

If I had any artistic talent whatsoever, I'd probably agree with you!

noduerme · on March 21, 2023

I won't lie... ZBrush is brutally hard. I got a subscription for work and only used it for one paid job, ever. But it's super satisfying if you just want to spend Sunday night making a clay elephant or rhinoceros, and drop $20 to have the file printed out and shipped to you by Thursday. I've fed lots of my sculpture renderings to Dali and gotten some pretty cool 2D results... but nothing nearly as cool as the little asymmetrical epoxy sculptures I can line up on the bookshelf...

dimatura · on March 21, 2023

People are definitely building at a high pace, but for what it's worth, this isn't the first work to tackle this problem, as you can see from the references. The results are impressive though!

noduerme · on March 21, 2023

yeah, the road to hell is paved with a desperate need for upvotes (and angel investment).

amelius · on March 21, 2023

Is image classification at the point yet where you can train it with one or a few examples (plus perhaps some textual explanation)?

f38zf5vdt · on March 21, 2023

Image classification is still a difficult task, especially if there are only a few examples. Training a high resolution 1k multi-class imagenet on 1m+ images is a drag involving hundreds or thousands of GPU hours from scratch. You can do low-resolution classifiers more easily, but they're less accurate.

There are tricks to do it faster but they all involve using other vision models that themselves are trained for as long.

amelius · on March 21, 2023

But can't something like GPT help here? For example you show it a picture of a cat, then you say "this is a cat; cats are furry creatures with claws, etc." and then you show it another image and ask if it is also a cat.

f38zf5vdt · on March 21, 2023

You are humanizing token prediction. The multimodal models for text-vision were all established using a scaffold of architectures that unified text-token and vision-token similarity e.g. BLIP2. [1] It's possible that a model using unified representations might be able to establish that the set of visual tokens you are searching for corresponds to some set of text tokens, but only if the pretrained weights for the vision encoder are able to extract the features corresponding to the object to which you are describing to the vision model.

And the pretrained vision encoder will have at some point been trained to minimize text-visual token cosine similarity on some training set, so it really depends on what exactly that training set had in it.

[1] https://arxiv.org/pdf/2301.12597.pdf

aleph_infinity · on March 21, 2023

This paper https://cv.cs.columbia.edu/sachit/classviadescr/ (from the same lab as the main post, funnily) does something along those lines with GPT. It shows for things that are easy to describe like Wordle ("tiled letters, some are yellow and green") you can recognize them with zero training. For things that are harder to describe we'll probably need new approaches, but it's an interesting direction.

GaggiX · on March 21, 2023

If you have a few examples you can use an already trained encoder (like CLIP image encoder) and train a SVM on the embeddings, no need to train a neural network.

cainxinth · on March 23, 2023

The engineers of the future will be poets. -Terence McKenna

qikInNdOutReply · on March 21, 2023

What happens, if you build a circle? As in this creates a 3d object from a image and another ai creates a image from a 3d object?

https://www.youtube.com/watch?v=zPqJUrfKuqs

Does it stabilize, or refine prejudices, or go on a fractal journey of errors over the weight landscape?

King-Aaron · on March 21, 2023

That's honestly extremely impressive. I do hope that the 'in the wild' examples aren't completely curated and are actually being rendered on the fly (They appear to be, but it's hard for me to tell if that' truly the case). Pretty cool to see however.

GaggiX · on March 21, 2023

>and are actually being rendered on the fly

They are precomputed, "Note that the demo allows a limited selection of rotation angles quantized by 30 degrees due to limited storage space of the hosting server." but I don't think they are curated, the seeds probably correspond to the seeds of the live demo you can host (they released the code and the models)

wslh · on March 21, 2023

I keep thinking in my project where we are taking multiple photos from the same angle with moving lights for rebuilding the 3D model. We are not using AI, just optic research like in [1]. We applied that on art at [2].

[1] Methods for 3D digitization of Cultural Heritage: http://www.ipet.gr/~akoutsou/docs/M3DD.pdf

[2] https://sublim.art

bogwog · on March 21, 2023

So the business model there is: scanner + paper shredder + NFT = $$$?

How many people have taken you up on that offer? Unless it's a shitty/low-effort painting, it seems insane to me that anyone would destroy their artwork in exchange for an NFT of that same artwork.

wslh · on March 21, 2023

What is insane for you could be completely different for others: we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later (working now) but you can see AP coverage here [1].

It is also important to highlight that we are doing this project at our own risk, with our own money, have built the hardware and software, and not charging artists for the process. Just the primary market sell is split between 85% for artists and the rest for the project. Pretty generous in this risky market.

[1] https://youtu.be/ajDEHSLi0iE

bogwog · on March 21, 2023

> we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later

Please also include the number of those people who actually understand what an NFT is. As a native Miamian, I can guarantee you not a single one does. This city has always been a magnet for the get rich quick scheme types, and crypto is a good match for that because it's harder for a layman to grasp the scam part.

wslh · on March 22, 2023

We should start to talk about what an NFT and POAP is for you. What we do is a new concept where you can probe the physical object does not exist anymore and it is now digital. The NFT is part of this experiment. It is an experiment for us and for the artists.

tough · on March 21, 2023

It's Banksy as a Service

echelon · on March 21, 2023

Super cool results.

This is what my startup is getting into. So I'm very interested.

These aren't "game ready" - the sculpts are pretty gross. But we're clearly getting somewhere. It's only going to keep getting better.

I expect we'll be building all new kinds of game engines, render pipelines, and 3D animation tools shortly.

redox99 · on March 21, 2023

While this is cool, this is not meant to target "game ready". For games and CGI, there's no reason to limit yourself to a single image. Photogrammetry is already extensively used, and it involves using tens or hundreds of images of the object to scan. Using many images as an input will obviously always be superior to a single one, as a single image means it has to literally make up the back side, and it has no parallax information.

oefrha · on March 21, 2023

You appear to be thinking about scanning a physical object, whereas zero-shot one image to 3D object would be vastly more useful with a single (possibly AI-generated or AI-assisted) illustration. You get a 3D model in seconds at essentially zero cost, can iterate hundreds of times in a single day.

redox99 · on March 21, 2023

I agree that for stylized, painting-like 3D models it could be very cool. I was indeed thinking of the typical pipeline for a photoreallistic asset.

digilypse · on March 21, 2023

What if I have a dynamically generated character description in my game’s world, generate a portrait for them using StableDiffusion and then turn that into a 3d model that can be posed and re-used?

flangola7 · on March 21, 2023

This has DARPA and NSF behind it.

They're not building this for games they're building it for autonomous weapons.

bredren · on March 21, 2023

How do these kinds of tools complement actual 3d scanning?

For example, Apple supposedly has put some time into 3d asset building (presumably in support of AR world building content).

Can these inference techniques stack or otherwise help more detailed object data collection?

regularfry · on March 21, 2023

I'd be interested in zero-shot two images to 3d object. You can see how a stereo pair ought to improve the amount of information it has available.

nico · on March 21, 2023

And 3D printing. So quickly building physical tools too.

skybrian · on March 21, 2023

For printing parts, precision matters since they likely need to fit with something else. You’ll want to be able to edit dimensions on the model to get the fit right.

So maybe someday, but I think it would have to be a project that targets CAD.

jonplackett · on March 21, 2023

Would this be useful for a robot / car trying to navigate to be able to do this?

eternalban · on March 21, 2023

Great idea. Processing latency may be an issue. It has to be fast, small, and energy efficient.

elif · on March 21, 2023

unlikely. the front bumper of a car you are following has zero value for your ego's safety. most of the optimization of FSD is in removing extra data to improve latency of the mapping loop.

jonplackett · on March 22, 2023

But it seems like the main problem for self drive is accurately understanding the world around the car. The actual driving is pretty easy. Being able to see a partial object and understand what the rest of it is is very useful to human drivers

throwaway4aday · on March 21, 2023

If you can produce any view angle you want of an object then can't you use photogrammetry to construct a 3D object?

nwoli · on March 21, 2023

See the “Single-View 3D Reconstruction” section at the bottom where they do precisely that

throwaway4aday · on March 21, 2023

Cool, I missed that.

lofatdairy · on March 21, 2023

This is insanely impressive, looking at the 3D reconstruction results. If I'm not mistaken occlusions are where a lot of attentions being placed in pose estimation problems, and if there are enough annotated environmental spaces to create ground truths you could probably add environment reconstruction to pose reconstruction. What's nice there is that if you have multiple angles of an environment from a moving camera in a video, you can treat each previous frame as a prior which helps with prediction time and accuracy.

hiccuphippo · on March 21, 2023

Can you obtain the 3d object from this or only an image with the new perspective? This could revolutionize indie gamedev.

jxf · on March 21, 2023

You can obtain a 3D object, but it's more useful for the novel views than the object, because the object isn't very good and probably needs some processing. See the bottom of the paper.

mitthrowaway2 · on March 21, 2023

> We compare our reconstruction with state-of-the-art models in single-view 3D reconstruction.

Here they list "GT Mesh", "Ours", "Point-E", and "MCC". Does anyone know what technique "GT mesh" refers to? Is it simply the original mesh that generated the source image?

GaggiX · on March 21, 2023

"Ground Truth", the actual mesh

haykmartiros · on March 21, 2023

Ground truth

EGreg · on March 21, 2023

Well honestly the "Ground truth" algorithm seems a lot superior to their method, it has higher fidelity in ALL the examples

chaboud · on March 21, 2023

I read that with the sarcasm that I hope was intended and had a good laugh.

razemio · on March 21, 2023

Haha, I am sorry. I spit my coffee reading this. It is ofc totally OK to not know what ground truth means but the irony was to funny. Yes ground truth will always be superior compared to anything else :)!

yorwba · on March 21, 2023

Ground truth will always be superior on the "does this match the ground truth?" metric, but that's often just a proxy for output quality and the model will be judged differently once deployed (e.g. "do human users like this?")

That's something to be aware of, especially when you're using convenience data of unknown quality to evaluate your model – many research datasets scraped off the internet with little curation and labeled in a rush by low-paid workers contain a lot of SEO garbage and labeling errors.

EGreg · on March 22, 2023

I always wanted to meet the team behind Ground Truth. It’s truly remarkable what they have built. Every time AI models show up, these guys outperform them on every metric.

Anyone have any contacts? They seem to be extremely elusive

sophiebits · on March 21, 2023

“Ground truth” doesn’t refer to a particular algorithm; it refers to the ideal benchmark of what a perfect performance would look like, which they’re grading against.

simlevesque · on March 21, 2023

Ground truth means that a human person created the model.

DarthNebo · on March 21, 2023

Not necessarily, could also be synthetic. Google did the same for hand poses in BlazePalm

Thorrez · on March 21, 2023

Ground truth means the original model that the image was generated from.

hypertexthero · on March 21, 2023

Brings to mind the Blade Runner enhance scene: https://www.youtube.com/watch?v=hHwjceFcF2Q

Sakos · on March 21, 2023

Reminds me of this at the time fantastical scene in Enemy of the State https://youtu.be/3EwZQddc3kY

BiteCode_dev · on March 21, 2023

Given the data is (credible and beautiful) BS, I think it's closer to red dwarf:

https://www.youtube.com/watch?v=6i3NWKbBaaU

bmitc · on March 21, 2023

What if you give it a picture of a cardboard cutout or billboard?

noduerme · on March 21, 2023

it'll build Angelyne for you, to distract your pathetic carbon-based intelligence.

https://www.hollywoodreporter.com/wp-content/uploads/2017/07...

brokensegue · on March 21, 2023

how is this different from the previous NeRF work? does it build a 3D model?

GaggiX · on March 21, 2023

NeRF models are trained on several views with known location and viewing direction. This model takes one image (and you don't need to train a model for each object).

amelius · on March 21, 2023

But if it takes only one image, isn't it likely to hallucinate information?

gs17 · on March 21, 2023

Not just likely, it does. Try out the demo and see, e.g. what the backside of their Pikachu toy looks like. Or a little simpler, the paper has an example (the demo also has this) of the back of a car under different seeds.

fooker · on March 21, 2023

Not hotdog.

ilaksh · on March 21, 2023

I wonder if this type of thing could be adapted to a vision system for a robot? So it would locate the camera and reconstruct an entire scene from a series of images as the robot moves around.

Probably needs a ways to get there but to be able to do robust SLAM etc. With just a single camera would make things much less expensive.

guyomes · on March 21, 2023

You might be interested in this related recent work [1] that fits simple ellipsoids to images and then use them for the pose estimation of a camera.

[1]: https://ieeexplore.ieee.org/document/9127873

lefrancais · on March 21, 2023

Same ref [1], but open : [https://hal.science/hal-02886633/document]

gs17 · on March 21, 2023

For anyone else who tried to download the weights and got Google Drive throwing a quota error at you, they're working on it: https://github.com/cvlab-columbia/zero123/issues/2

yawnxyz · on March 21, 2023

Are there any models that take an image to SVG?

hombre_fatal · on March 21, 2023

Aside, I really like the UI indicators on the draggable models at the bottom that let you know you can rotate them.

desmond373 · on March 21, 2023

Would it be possible to generate cad files with this. As a base for part construction this could be game changing

gs17 · on March 21, 2023

If you look at the example meshes, it doesn't seem very likely that it would be better than manually creating them, unless you're okay with lumpy parts that aren't exactly the right size. This is too early for it to not require a lot of cleanup to be usable.

flangola7 · on March 21, 2023

In other words we just need to wait 6 more months

noduerme · on March 21, 2023

Is there some kind of symmetry at work here in the deductive process?

mov · on March 21, 2023

People plugging it as output of Midjourney in 3, 2, 1...

ar9av · on March 21, 2023

It's hard to tell for certain from the paper, without going deep into the code, but it seems they created the new model the same way the depth conditioned SD models were made i.e. normal finetune.

It might be possible to create a "original view + new angle" conditioned model much more easily by taking the Controlnet/T2IAdapter/GLIDE route where you freeze the original model.

Text to 3d seems almost close to being solved.

It also makes me think a "original character image + new pose" conditioned model would also work quite well.