Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Generate Stable Diffusion scenes around 3D models (github.com/dabble-studio)
123 points by neilxm on Oct 19, 2023 | hide | past | favorite | 53 comments
3D-to-photo is an open source tool for Generative AI product photography, that uses 3D models to allow fine camera angle control in generated images.

If you have 3D models created using the iOS 3D scanner you can upload them directly on to 3D-to-photo and describe the scene you want to create. For example:

"on a city side walk" "near a lake, overlooking the water"

Then click "generate" to get the final images.

The tech stack behind 3D-to-photo:

Handling 3d models on the web: @threejs Hosting the diffusion model: @replicate 3D scanning apps: shopify,Polycam3D or LumaLabsAI




Given that Stable Diffusion is designed to be able to run on consumer hardware, without the need for a third party cloud platform, it saddens me to see that this, alongside many other similar projects, require the use of a third party platform for hosting the model, even for local usage. The tool itself does seem interesting though.


So the api used for inpainting is very easily swappable to a local instance. But when you say consumer hardware, I think you might be overestimating how capable most people’s computing setups are.


Isn't that to a degree a function of time?

There may be a point when that's not the case, but at present this stuff will run on hardware accessible to a consumer, assuming they're ok waiting a bit.


yea for sure. i guess i'd like this to be something usable on mobile as well. but directionally you're right. there's a lot to be gained from optimizing for local compute in many applications. something to look into.


How is this different than just photoshopping a 2d image of a 3d object onto an SD generated background? Is it just meant to let people skip the step of generating a background and compositing? (Sorry, inpainting. But the distinction seems minimal here as people have been photoshopping 3d objects believably into scenes for decades before SD came around)


Reading your question again, I should clarify - it's not quite compositing a random background under the 2d image. it's using stable diffusion to "in-paint" a background that makes sense spatially, with correct shadows, lighting and perspective to match your prompt. That being said, there's still a lot to be done to deeply integrate the 3D and 2D spaces and that's just going to be part of the ongoing exploration in this project.


That's fair. I guess in my head they're still one in the same. This is compositing without the step of having to match shadows and find a background with a decent angle, etc. I had asked in my original post if this was to skip the step of compositing (because you would do things like fix shadows etc if you were compositing) and that's what it sounds like it is, which is cool.

Do you communicate anything about the angle or anything to SD? (outside of just giving it the image with a transparent background)

I guess, given the additional context (and vs discussing the semantics of compositing), the better question is, how does this extend the capabilities of stable diffusion inpainting?

Is it any different than just putting your 3d model into photoshop or in a 3d viewer, exporting with a transparent background, and inpainting around it?


It's not that different right now you're right about that. It chains that sequence of steps together and I'm not really sending any meta data about the objects pose to SD, I just leave it up to the model. But the next step here is leveraging more control-net approaches and thats when it starts to really take advantage of 3D more than just the 2D snapshot minus background.

For example, a 3D editor to allow for simple low poly style scene creation, that then serves as conditioning input to control net. For example staging a model of a chair in a sketchup model of a living room with super basic furniture elements in low poly 3D models. You pass that into SD and out comes a fully rendered image. At that point i think you could argue that stable diffusion could be used as a platform agnostic renderer like VRAY but for any 3D modelling tool.

This was my version 1 haha


That sounds really damned cool! Kind of like image bashing without having to figure out how to use blender/photoshop or something to arrange it haha


bingo


This streamlines the process of pasting a cropped render in, and Stable Diffusion frequently doesn't get the context right, so I assume thats useful for rapidly testing the shoe in different angles.


yea so as of now, it is a fairly loose integration between threejs "rendering" a 2d image of a scanned product at some orientation, and then Stable diffusion inpainting a background around it.

However, there's a number of extensions here, that makes the integration of 2D and 3D more interesting going forward

1. 3D models means you can relight the model before running it through in-painting. adding lights in 3D around the product, plus using a more raycasted rendering system (which is now possible in-browser) means you can control the input to SD really well.

2. This one is the most interesting piece. You can create a 3D editor to allow very simple low poly style scene creation, lets say a pedestal and a vase or something from a product photography standpoint - then pass the depth map or canny edges as conditioning through something like control net and you have a super controllable scene design tool that you can finely control - both in camera angle and perspective.


You can adjust the pose of a thing you are pasting.


Also this is inpainting, not compositing.


Is it really very different? Mind you just slapping the subject and a random background together would, accurately, not produce a very good result but for decades before SD existed people were able to accurately composite objects into images matching shadows and manipulating lighting to fit.

The distinction doesn't feel worth commenting about overall.


Inpainting manually is incredibly laborious. The comparison is almost like manually adding numbers on a paper spreadsheet vs excel.

Manual inpainting gives a lot of control though, something I hope these SD pipelines can improve on.


That's why I had asked if the goal of the project was just to skip the laborious parts of compositing! haha

I'm with you on the lack of control of SD. It can create some amazing stuff but so frequently it's kind of just "put in some words and click until you get something you like then pull into an image editor to fix up issues" which ends up meaning I have to do some of the labor still anyway.


> The distinction doesn't feel worth commenting about overall.

That's like saying that the distinction between the output of stable diffusion and a real artist isn't worth commenting about overall, since they're both just paintings.


In the context of whether or not a process is compositing or inpainting, that's factual. The output is a 2d plate placed in front of a background. Both achieve similar results. The level of manual (or automated) effort is not what's being discussed here.

So I'm not sure what point you're trying to make.


God we're so close from being able to feed a photo and some measurements into a program and get an accurate model out of it. I can't wait until my smartphone and a set of calipers can replace a $700 3d scanner.


For what it's worth, this doesn't look like it's anything towards that. This is just letting you manipulate a 3d model into an angle, save a 2d image of it, and put it into a generated background. This doesn't turn 2d->3d or seem to do anything 3d except allow a 3d model to be loaded for a photoshoot.


Pretty much. additionally, i think NeRF is really promising for this use case. You don't really need a 3D model per se to view a thing from different angles. The only issue is lighting - both photogrammetry and nerf suck have baked lighting which really limits what you can do with the model. But there's a bunch of work towards solving that problem, so I can see that problem going away soon!


3D Guassian Splatting seems to be superseding NeRF

NeRF might be a dead end technique in case your trying to keep up


Neat. I recently was having fun with doing something similar manually in Blender, generating depth map and using it in Stable Diffusion with controlnet. Results were great. My models didn't have texture though, so sd generated it. But I imagine I could go with img2img to preserve texture if I had it.


Why does it need a 3d model ? It looks like it is just doing inpainting which can be done with a single image.


Yea you don't absolutely need a 3D model. The benefit of 3D though, is you can define any camera angle and perspective of the main product. now of course you just take a picture of a product from any angle if it were in front of you, but that's not always feasible. the use case here is scalable photo generation for ecomm stores with thousands of inventory items.

Additionally, 3d means more than just camera angle control. you can define a scene in 3D and send it into control net to produce a very specific image


yeah that makes sense. I think the challenge is how much you can automate vs the quality.


I can imagine a few good uses for this:

1. Industrial designers and retailers can quickly flesh out a surround for their project/product in various angles and settings without having a physical sample, shoot or building scenes / match perspectives. It would be trivial to automate this into an app so these designers don't need to know anything about AI.

2. People developing content with AI currently have to deal with the subject often varying from inference to inference. With this system the designer would use a model for the subject, and then let stable diffusion make the surrounds. This would reduce the fiddly work of trying to keep the subject consistent from image to image.

3. On-the-fly imagery: Imagine a retailer has a build-to-order ordering system that can have many different options. For example we'll say it's Mr Potato Head with hundreds of different noses, shoes, arms, and clothes. A retailer's website can generate realistic imagery of the customer's specific order on the fly, based on the BTO options selected. Instead of displaying the options in a generic template view, the preview image can instead be various scenes and settings, these can also match the themes of the selected accessories. (e.g. your mr potato head has a chefs hat, so he's in a kitchen cooking. Your mr potato head has sunglasses, now he's on a beach, he's got a chefs hat and sunglasses: now he's BBQing, etc.)

4. Customised content: There are currently services where parents can customise a generic character to look like their child and order a series of books featuring their child. These are usually limited to skin colour, gender and the colour of the clothes. Using this tech the customisation and output imagery could take a significant leap.


awesome ideas! the configurability and flexibility of 3D models is a huge advantage over a pure 2D approach these scenarios.


It doesn't appear that this project uses it, but a 3D model would give you all the other information like depth maps/normal maps that you would need to light the object itself properly. i.e. change the pixels of the object and not just draw a background around it.


That’s the idea for next steps.

Basically if you generate a backdrop and then estimate light direction you can inverse render that onto the 3d model given all the depth information you get for free from the model


You should be able to do relighting with ControlNet. Basically render the model to all the maps you'd use for PBR (fullbright color/depth/reflectivity/etc), train ControlNets that hold all those constant and do img2img but let it make up the background.

Though I'm not aware of anyone already doing this, since I think research has moved on to NeRF models that act on 3D scenes directly.


Automatic lifelike shadows appears to be one benefit.


Yea! Shadows on the item or shadows propogated back onto the scene


Looks pretty cool. Can anyone comment on how to hack together the opposite? That is, going from 2D object image to 3D rendering with in-painted background? Or is that not possible right now.


Do you mean transforming a sketch to a 3D-looking image(i.e not a 3d mesh model): If so Stable diffusion with control net can do that using a good prompt and the SDXL model.

Do you mean you have an existing photo of something and would like to add a realistic setting. There's a lot of ways to do this, but probably the easiest right now is the Generative Fill feature in Adobe Photoshop (beta).


Not sure if they are planning on releasing this but you can mix a image2Nerf model (threestudio is a good repo for this) for the object 3d model, and an image2depth model like ZoeDepth to generate 3d background.


2D to 3D networks are pretty amazing right now.

High poly meshes and low poly textures are possible.


I took a quick look at the python flask code and I’m still not sure if there’s a reason of not just using Next’s server side aspects. JS can do every operation I skimmed by

Thoughts?


Yes and no. So I did it because I honestly get annoyed with pixel operations in node js. The same thing in Python with the pillow library achieves the same things in quarter the lines of code as you would need in something like sharp in nodes

The other reason I used a Python backend was that I want to extend it to more involved image processing, like producing control net inputs or post processing the end result.

Does that make sense ? Open to better ideas though


yeah it makes sense, but if you’re already offloading everything to replicate.ai it makes less sense

just extra stacks to keep track of and keep up with


yea true. some trade offs. i'm going to look into consolidating everything into just the nextjs piece.


how is the lighting on the model? I assume you can't do anything other than overcast days, because the lighting isn't specified.


So because this is a threejs canvas, we could build an interface to add lights anywhere, to adjust lighting on the model. this is an early version so it's not in there, but really easy to add in code. perhaps that will be the next feature.


You need to use your Stable Diffusion generated background environment to also generate spherical environment illumination maps so you can do image based lighting. That enables the 3D model(s) to correctly match any environment imaginable by projection mapping the environment onto the model; after which the model's materials define if that mapping is reflected, mixed with the underlying surface color, ignored, or some combination percentage. It's how integrated CGI is done in film VFX.


This sounds pretty cool, do you have a demo or maybe a webm to put in the README.md?


Good point, I haven't put up a web demo you can try without running the repo, but there's a video linked to the image on the very top of the README. here's the video again, i should probably make it more obvious that the image is clickable :)

https://www.youtube.com/watch?v=P3yPn92v3u8


haha a costly demo to run with the hn hug + inference costs..


yeaaa. well i could technically add a settings page to add your Replicate api token and run it on Vercel.


Wow this is insanely cool


thank you!


Need gaussian splatting integrated asap https://huggingface.co/blog/gaussian-splatting


Hah. Yep!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: