3D reconstruction from a single image. They stress the examples are not curated, appears to... well, gosh darnit, it appears to work.
If it runs fast enough I wonder whether one could just drive around with a webcam and generate these 3d models on the fly and even import them into a sort of GTA type simulation/game engine in real time. (To generate a novel view, Zero-1-to-3 takes only 2 seconds on an RTX A6000 GPU)
This research is based on work partially supported by:
- Toyota Research Institute
- DARPA MCS program under Federal Agreement No. N660011924032
- NSF NRI Award #1925157
Oh, huh. Interesting.
Future Work
From objects to scenes:
Generalization to scenes with complex backgrounds remains an important challenge for our method.
From scenes to videos:
Being able to reason about geometry of dynamic scenes from a single view would open novel research directions --
such as understanding occlusions and dynamic object manipulation.
A few approaches for diffusion-based video generation have been proposed recently and extending them to 3D would be key to opening up these opportunities.
Seems like there is a bit of a gap between “runs at 0.5 fps on a $7000 workstation-grade GPU with 48GB of VRAM” and consumer applications.
With the fairly shallow slope of the GPU performance curve overtime, I don’t see them just Moores Lawing out of it either. this would need two, maybe three orders of magnitude more performance.
Of course there is a gap. This is at the exploratory proof of concept stage. The fact that it works at all is what is interesting.
Furthermore once you've identified the make and model of the car, its relative position in 3d, any anomalies -- that ain't just a Ford pickup, it is loaded with cargo that overhangs in a particular way -- its velocity, 'etc -- I'm quite sure that extrapolating additional information from the subsequent frames will be significantly cheaper as you don't have to generate a 3d model from scratch each time.
I think this is a viable exploratory path forward.
Make it work <- you are here
Make it work correctly
Make it work fast
Computer power goes up exponentially thanks to Moore's law. Sprinkle some software optimisations on top, and it's conceivable for that to be running at interactive framerates on consumer GPUs within 5-10 years.
Until we break the speed of light, I’m very bearish on cloud gaming. It just feels so bad. You’ve got like 9 layers of latency between you and the screen.
It's quite noticeable actually, and it adds up, it's not just an extra 20ms.
For casual gamers and turn based games maybe it could work, as a niche. For FPS, multiplayer, ARPG, and so on, it's a dealbreaker, anything over 100ms feels too sluggish.
We should be happy we have so much autonomy with our own hardware, I don't want some big cloud company to be able to tell me what I can play and render, unless we want the "you will own nothing and be happy" meme to become reality.
I actually, in my testing, JRPG/other turn based games were amongst the worst because there is so much “management” (inventory, loot, gear, etc) and the extra lag really throws you off
A wireless controller ALONE is already over 20ms, and that’s before you touch the network, actually doing with that input, wait for the display to redraw…
At a 20ms total round trip, that only buys you about a 1500 mile radius, again completely ignoring all other latencies.
if I've been reading it correctly, the power of chatgpt is in the training and data, not necessarily the algorithm.
And I'm not sure if it's technically possible for one AI to train another AI with the same algorithm and have better performance. Although I could be wrong about any and everything. :-)
A LLM by itself could generate data, code and iterate on its training process, thus it can create another LLM from scratch. There is a path to improve LLMs without organic text - connect them to real systems and allow them feedback. They can learn from feedback from their actions. It could be as simple as a Python execution environment, a game, simulator, other chat bots, or a more complex system like real world tests.
I know that NVidia is using AI that is running on NVidia chips to create new chips that they then run AI on.
All you have left to do is to AI the process of training AI, kind of like building a lathe by hand makes a so-so lathe but that so-so lathe can then be used to build a better and more accurate lathe.
I actually love this analogy. People tend to not appreciate just how precise modern manufacturing equipment is.
All of that modern machinery was essentially bootstrapped off a couple of relatively flat rocks. Its going to be interesting to see where this LLM stuff goes when the feedback loop is this quick and so much brainpower is focused on it.
One of my sneaky suspicions is that Facebook/Google/Amazon/Microsoft/etc would have been better off keeping employees on the books if for no other reason than keeping thousands of skilled developers occupied, rather than cutting loose thousands of people during a time of rapid technological progress who now have an axe to grind.
It is a nice analogy because you can expand it really to all history of technological progress. Tools help make tools - all the way back to obsidian daggers and sticks.
NeRFs are a form of inverse renderer; this paper uses Score Jacobian Chaining[0] instead. Model reconstruction from NeRFs is also an active area of research. Check out the "Model Reconstruction" section of Awesome NeRF[1].
From the SJC paper:
> We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as function f with parameters θ, i.e., x = f (θ). Applying the chain rule through the Jacobian ∂x/∂θ converts a gradient on image x into a gradient on the parameter θ.
> Our method uses differentiable rendering to aggregate 2D image gradients over multiple viewpoints into a 3D asset gradient, and lifts a generative model from 2D to 3D. We parameterize a 3D asset θ as a radiance field stored on voxels and choose f to be the volume rendering function.
Interpretation: they take multiple input views, then optimize parameters (a voxel grid in this case) to a differentiable renderer (the volume rendering function for voxels) such that they can reproduce the input views.
I won't lie... ZBrush is brutally hard. I got a subscription for work and only used it for one paid job, ever. But it's super satisfying if you just want to spend
Sunday night making a clay elephant or rhinoceros, and drop $20 to have the file printed out and shipped to you by Thursday.
I've fed lots of my sculpture renderings to Dali and gotten some pretty cool 2D results... but nothing nearly as cool as the little asymmetrical epoxy sculptures I can line up on the bookshelf...
People are definitely building at a high pace, but for what it's worth, this isn't the first work to tackle this problem, as you can see from the references. The results are impressive though!
Image classification is still a difficult task, especially if there are only a few examples. Training a high resolution 1k multi-class imagenet on 1m+ images is a drag involving hundreds or thousands of GPU hours from scratch. You can do low-resolution classifiers more easily, but they're less accurate.
There are tricks to do it faster but they all involve using other vision models that themselves are trained for as long.
But can't something like GPT help here? For example you show it a picture of a cat, then you say "this is a cat; cats are furry creatures with claws, etc." and then you show it another image and ask if it is also a cat.
You are humanizing token prediction. The multimodal models for text-vision were all established using a scaffold of architectures that unified text-token and vision-token similarity e.g. BLIP2. [1] It's possible that a model using unified representations might be able to establish that the set of visual tokens you are searching for corresponds to some set of text tokens, but only if the pretrained weights for the vision encoder are able to extract the features corresponding to the object to which you are describing to the vision model.
And the pretrained vision encoder will have at some point been trained to minimize text-visual token cosine similarity on some training set, so it really depends on what exactly that training set had in it.
This paper https://cv.cs.columbia.edu/sachit/classviadescr/ (from the same lab as the main post, funnily) does something along those lines with GPT. It shows for things that are easy to describe like Wordle ("tiled letters, some are yellow and green") you can recognize them with zero training. For things that are harder to describe we'll probably need new approaches, but it's an interesting direction.
If you have a few examples you can use an already trained encoder (like CLIP image encoder) and train a SVM on the embeddings, no need to train a neural network.
That's honestly extremely impressive. I do hope that the 'in the wild' examples aren't completely curated and are actually being rendered on the fly (They appear to be, but it's hard for me to tell if that' truly the case). Pretty cool to see however.
They are precomputed, "Note that the demo allows a limited selection of rotation angles quantized by 30 degrees due to limited storage space of the hosting server." but I don't think they are curated, the seeds probably correspond to the seeds of the live demo you can host (they released the code and the models)
I keep thinking in my project where we are taking multiple photos from the same angle with moving lights for rebuilding the 3D model. We are not using AI, just optic research like in [1]. We applied that on art at [2].
So the business model there is: scanner + paper shredder + NFT = $$$?
How many people have taken you up on that offer? Unless it's a shitty/low-effort painting, it seems insane to me that anyone would destroy their artwork in exchange for an NFT of that same artwork.
What is insane for you could be completely different for others: we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later (working now) but you can see AP coverage here [1].
It is also important to highlight that we are doing this project at our own risk, with our own money, have built the hardware and software, and not charging artists for the process. Just the primary market sell is split between 85% for artists and the rest for the project. Pretty generous in this risky market.
> we have been in the last Miami Art Week and Art Basel and we don't have enough time for the number of artists that wanted to be in the process. Will expand more later
Please also include the number of those people who actually understand what an NFT is. As a native Miamian, I can guarantee you not a single one does. This city has always been a magnet for the get rich quick scheme types, and crypto is a good match for that because it's harder for a layman to grasp the scam part.
We should start to talk about what an NFT and POAP is for you. What we do is a new concept where you can probe the physical object does not exist anymore and it is now digital. The NFT is part of this experiment. It is an experiment for us and for the artists.
While this is cool, this is not meant to target "game ready". For games and CGI, there's no reason to limit yourself to a single image. Photogrammetry is already extensively used, and it involves using tens or hundreds of images of the object to scan. Using many images as an input will obviously always be superior to a single one, as a single image means it has to literally make up the back side, and it has no parallax information.
You appear to be thinking about scanning a physical object, whereas zero-shot one image to 3D object would be vastly more useful with a single (possibly AI-generated or AI-assisted) illustration. You get a 3D model in seconds at essentially zero cost, can iterate hundreds of times in a single day.
What if I have a dynamically generated character description in my game’s world, generate a portrait for them using StableDiffusion and then turn that into a 3d model that can be posed and re-used?
For printing parts, precision matters since they likely need to fit with something else. You’ll want to be able to edit dimensions on the model to get the fit right.
So maybe someday, but I think it would have to be a project that targets CAD.
unlikely. the front bumper of a car you are following has zero value for your ego's safety. most of the optimization of FSD is in removing extra data to improve latency of the mapping loop.
But it seems like the main problem for self drive is accurately understanding the world around the car. The actual driving is pretty easy. Being able to see a partial object and understand what the rest of it is is very useful to human drivers
This is insanely impressive, looking at the 3D reconstruction results. If I'm not mistaken occlusions are where a lot of attentions being placed in pose estimation problems, and if there are enough annotated environmental spaces to create ground truths you could probably add environment reconstruction to pose reconstruction. What's nice there is that if you have multiple angles of an environment from a moving camera in a video, you can treat each previous frame as a prior which helps with prediction time and accuracy.
You can obtain a 3D object, but it's more useful for the novel views than the object, because the object isn't very good and probably needs some processing. See the bottom of the paper.
> We compare our reconstruction with state-of-the-art models in single-view 3D reconstruction.
Here they list "GT Mesh", "Ours", "Point-E", and "MCC". Does anyone know what technique "GT mesh" refers to? Is it simply the original mesh that generated the source image?
Haha, I am sorry. I spit my coffee reading this. It is ofc totally OK to not know what ground truth means but the irony was to funny. Yes ground truth will always be superior compared to anything else :)!
Ground truth will always be superior on the "does this match the ground truth?" metric, but that's often just a proxy for output quality and the model will be judged differently once deployed (e.g. "do human users like this?")
That's something to be aware of, especially when you're using convenience data of unknown quality to evaluate your model – many research datasets scraped off the internet with little curation and labeled in a rush by low-paid workers contain a lot of SEO garbage and labeling errors.
I always wanted to meet the team behind Ground Truth. It’s truly remarkable what they have built. Every time AI models show up, these guys outperform them on every metric.
Anyone have any contacts? They seem to be extremely elusive
“Ground truth” doesn’t refer to a particular algorithm; it refers to the ideal benchmark of what a perfect performance would look like, which they’re grading against.
NeRF models are trained on several views with known location and viewing direction. This model takes one image (and you don't need to train a model for each object).
Not just likely, it does. Try out the demo and see, e.g. what the backside of their Pikachu toy looks like. Or a little simpler, the paper has an example (the demo also has this) of the back of a car under different seeds.
I wonder if this type of thing could be adapted to a vision system for a robot? So it would locate the camera and reconstruct an entire scene from a series of images as the robot moves around.
Probably needs a ways to get there but to be able to do robust SLAM etc. With just a single camera would make things much less expensive.
If you look at the example meshes, it doesn't seem very likely that it would be better than manually creating them, unless you're okay with lumpy parts that aren't exactly the right size. This is too early for it to not require a lot of cleanup to be usable.
It's hard to tell for certain from the paper, without going deep into the code, but it seems they created the new model the same way the depth conditioned SD models were made i.e. normal finetune.
It might be possible to create a "original view + new angle" conditioned model much more easily by taking the Controlnet/T2IAdapter/GLIDE route where you freeze the original model.
Text to 3d seems almost close to being solved.
It also makes me think a "original character image + new pose" conditioned model would also work quite well.
If it runs fast enough I wonder whether one could just drive around with a webcam and generate these 3d models on the fly and even import them into a sort of GTA type simulation/game engine in real time. (To generate a novel view, Zero-1-to-3 takes only 2 seconds on an RTX A6000 GPU)
Oh, huh. Interesting.