MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

axoltl · 2024-08-14T02:09:06 1723601346

I'm having a hard time finding a reference to the hardware the inference is run on. The paper mentions training was done on a single A100 GPU so I'm going to assume inference was run on that same platform. The 22fps result is somewhat meaningless without that information.

It does feel like we're getting closer and closer to being able to synthesize novel views in realtime from a small set of images at a framerate and quality high enough for use in AR, which is an interesting concept. I'd love to be able to 'walk around' in my photo library.

iforgotpassword · 2024-08-14T07:11:13 1723619473

> I'd love to be able to 'walk around' in my photo library.

Yes this. I've been dreaming about this since I digitized my childhood photos a few years ago. There should be more than enough photos to reconstruct the entire apartment. Or my grandparents' house. Not sure though what happens if items and furniture moves around between shots.

I haven't looked much into this yet and just assumed it will need a bit more time until there is a batteries included solution I can just download and run without reading ten pages of instructions and buying a GPU cluster.

tomp · 2024-08-14T07:00:32 1723618832

Once the Gaussian Splats are computed (whether via ML or classical optimisation), they’re very efficient to render (similar to 3D meshes used in games). High fps isn’t incredible.

Having said that (I have yet to read the paper), "efficiency" probably refers to the first part (calculating the gaussians in the first place) not rendering.

dagmx · 2024-08-14T14:22:14 1723645334

They’re not “very efficient “. They have a significant amount of overdraw due to their transparency and will be a lot more inefficient if you’re only considering material-less surface representation.

They’re more efficient to capture however . They’re also more constant in their render time, but meshes will easily be faster in most scenes cases, but scale worse with complexity.

The “efficiency” of splats is more about the material response and capturing complexity there, than it is about geometric representation.

axoltl · 2024-08-14T14:10:53 1723644653

You are correct. I was confusing this technique with Novel View Synthesis through diffusion (recent paper: https://arxiv.org/abs/2408.06157) where inference means generating frames rather than points.

vessenes · 2024-08-14T15:52:04 1723650724

The tech stack in the splat world is still really young. For instance, I was thinking to myself: “Cool, MVSplat is pretty fast. Maybe I’ll use it to get some renderings of a field by my house.”

As far as I can tell, I will need to offer a bunch of photographs with camera pose data added — okay, fair enough, the splat architecture exists to generate splats.

Now, what’s the best way to get camera pose data from arbitrary outdoor photos? … Cue a long wrangle through multiple papers. Maybe, as of today… FAR? (https://crockwell.github.io/far/). That claims up to 80% pose accuracy depending on source data.

I have no idea how MVSplat will deal with 80% accurate camera pose data… And I also don’t understand if I should use a pre-trained model from them or train my own or fine tune one of their models on my photos… This is sounding like a long project.

I don’t say this to complain, only to note where the edges are right now, and think about the commercialization gap. There are iPhone apps that will get (shitty) splats together for you right now, and there are higher end commercial projects like Skydio that will work with a drone to fill in a three dimensional representation of an object (or maybe some land, not sure about the outdoor support), but those are like multiple thousand-dollar per month subscriptions + hardware as far as I can tell.

Anyway, interesting. I expect that over the next few years we’ll have push button stacks based on ‘good enough’ open models, and those will iterate and go through cycles of being upsold / improved / etc. We are still a ways away from a trawl through an iPhone/gphoto library and a “hey, I made some environments for you!” Type of feature. But not infinitely far away.

algebra-pretext · 2024-08-14T21:31:11 1723671071

COLMAP to generate pose data using structure-from-motion; if you use Nerfstudio to make your splat (using Splatfacto method) it includes a command that will do the COLMAP alignment. This definitely is a weak spot though and a lot goes wrong in the alignment process unless you have a smooth walkthrough video of your subject with no other moving objects.

On iPhone, Scaniverse (owned by Niantic) produces splats far more accurately than splatting from 2D video/images, because it uses LiDAR to gather the depth information needed for good alignment. I think even on older iPhones without LiDAR, it’s able to estimate depth if the phone has multiple camera lenses. Like ryandamm said above, the main issue seems to be low value/demand for novel technology like this. Most of the use cases I can think of (real estate? shopping?) are usually better served with 2D videos and imagery.

ryandamm · 2024-08-14T17:01:11 1723654871

I think the barrier to commercialization is the lack of demonstrated economic value to having push button splats. There's no shortage of small teams wiring together open source splats / NeRF / whatever papers; there's a dearth of valuable, repeatable businesses that could make use of what those small teams are building.

Would it be cool to just have content in 3D? Undoubtedly. But figuring out a use case, that's where people need to be focusing. I think there are a lot of opportunities, but it's still early days -- and not just for the technology.

vessenes · 2024-08-14T20:26:59 1723667219

Yes - agreed. There’s a clear use case for indie content, but tooling around editing/modifying/color/lighting has to improve, and rendering engines or converters need to get better. FWIW it doesn’t seem like a dead-end tech to me though; more likely a gateway tech to cost improvements. We’ll see.

petargyurov · 2024-08-14T07:52:55 1723621975

Someone help me understand inference here.

Every gaussian splat repo I have looked at doesn't mention how to use the pre-trained models to "simply" take MY images as input and output a GS. They all talk about evaluation, but the CMD interface requires the eval datasets as input.

Is training/fine-tuning on my data the only way to get the output?

littlestymaar · 2024-08-14T09:27:07 1723627627

Is there really such thing as a pre-trained model when it comes to Gaussian splatting?

I'm not familiar at all with the topic (nor have I read this particular paper) but I remember that the original 3DGS paper took pride in the fact that this was not “IA” or “deep learning”. There's still a gradient descent process to get the Gaussian splats from the data, but as I understood it, there is no “training on a large dataset then inference”, building the GS from your data is the “training phase” and then rendering it is the equivalent of inference.

Maybe I understood it all wrong though, or maybe new variants of Gaussian splatting use a deep learning network in addition to what was done in the original work, so I'll be happy to be corrected/clarified by someone with actual knowledge here.

jorgemf · 2024-08-14T09:57:12 1723629432

Basically you train a model per each set of images. The model is a neural network able to render the final image. Different images will require different trained models. Initial gaussian splatting models took hours to train, last year models took minutes to train. I am not sure how much this one takes, but it should be between minutes and hours (and probably more close to minutes than hours).

tomp · 2024-08-14T11:51:50 1723636310

No, what you're describing is NeRF, the predecessor technology.

The output of Gaussian Splat "training" is a set of 3d gaussians, which can be rendered very quickly. No ML involved at all (only optimisation)!

They usually require running COLMAP first (to get the relative location of camera between different images), but NVIDIA's InstantSplat doesn't (it however does use a ML model instead!)

dagmx · 2024-08-14T14:24:32 1723645472

Nit: splats are significantly older than NeRFs. They just had a resurgence after nerfs.

We’ve been using pretty similar technology for decades in areas like Renderman radiance caches before RIS.

petargyurov · 2024-08-14T10:37:28 1723631848

Thank you, that explains it.

programjames · 2024-08-14T02:55:29 1723604129

Where would you use 3D Gaussian splatting? Static environments for video games?

dagmx · 2024-08-14T04:40:36 1723610436

No, Gaussian splats are pretty poor for video games. There’s a significant amount of overdraw and they’re not art directable or dynamic.

Gaussian splats are much better suited for capturing things where you don’t have artists available and don’t have a ton of performance requirements with regards to frame time.

So things like capturing real estate , or historical venues etc.

vlovich123 · 2024-08-14T05:05:16 1723611916

Isn’t that a “for now” problem rather than something intractable for the performance anyway? Presumably HW and SW algorithms will continue to improve. Art directable may be a problem but it feels like Gaussian splats + genAI models could be a match made in heaven with the genAI mode generating the starting image and splats generating the 3d scene from it

dagmx · 2024-08-14T05:37:39 1723613859

Sure, given an unlimited amount of time and resources, it’s possible that Gaussian splats could be performant. But that’s just too vague a discussion point to be meaningful.

It’s definitely not in the cards in the near term without a dramatic breakthrough. Splats have been a thing for decades so I’m not holding my breath.

vlovich123 · 2024-08-14T17:55:22 1723658122

I mean here it is running at 22fps. In another 5 years it’s reasonable to conservatively believe hardware and software to be 3x as powerful which gets you to a smooth 60fps.

What am I missing on the performance front?

dagmx · 2024-08-14T18:04:48 1723658688

Well my critique of your comment is just that it’s unbounded. Yes, eventually all compute will get better and we can use once slow technologies. But that’s not a very valuable discussion because nobody is saying it’ll never be useful, just that it isn’t for games today.

It also ignores that everything else will be faster too then as well, and ignores needing to target different baselines of hardware.

Either way 5 years for a 3x improvement seems unrealistic. 4 years saw a little over a doubling of performance at the highest end with a significant increase in power requirements as well, where we’re now hitting realistic power limits.

Taking the 2080 vs 4080 as their respective tiers

153% performance increase 50% more power consumption 50% price increase.

So yes performance at the high end will increase, but it’s scaling pretty poorly with cost and power. And the lower end isn’t scaling as linearly.

On the lower/mid end (1060 Ti vs 2060 Super) we saw only a 53% increase in that same time period.

vlovich123 · 2024-08-14T21:15:56 1723670156

I guess it's to me that's still just an pessimistic perception. Ray tracing was also extremely slow for a long time until Nvidia built dedicated HW to accelerate it. Is there reason to believe that splats are already well served by generic GPU compute that dedicated HW won't accelerate it in a meaningful way?

Here's splats from 2020 working at 50-60fps [1]. I think my overall point is I don't think it's performance that's holding it back in games but tooling & whether it saves meaningful costs elsewhere in the game development pipeline.

[1] https://x.com/8Infinite8/status/1699460316529090568

dagmx · 2024-08-15T00:29:09 1723681749

Again, I’m not saying it won’t be possible someday. Any number of things could happen, even though the trajectory doesn’t imply it will be in the next 5 years. All I’m saying is that the question is pointless without bounds.

Otherwise flying cars will also be possible.

Also your splat is running in isolation. Any single system can run by itself at a good clip. That’s not indicative of anything when running as part of a larger system. Again, the discussion of performance is pointless without bounds.

hansworst · 2024-08-14T06:43:42 1723617822

> they’re not art directable or dynamic

This is not true I believe. There are plenty of papers out there revolving around dynamic/animated splat-based models, some using generative models for that aspect too.

There are also some tools out there that let you touch up/rig splat models. Still not near what you can do with meshes but I think fundamentally it’s not impossible.

dagmx · 2024-08-14T06:58:00 1723618680

You can touch up a splat in the same way you can apply gross edits to an image (cropping, color corrections etc), but you can’t easily change it in a way like “make this bicycle handle bar more rounded”. Ergo it’s not art directable.

With regards to dynamicism, there’s some papers yes but with heavy limitations. Rigging is doable but relighting is still hit and miss, while most complex rigs require a mesh underneath to drive a splats surface. There’s also the issue of making sure the splats are tight to the surface boundary, which is difficult without significant other input.

Other dynamics like animation operate at a very gross level, but you can’t for example do a voronoi fracture for destruction along a surface easily. And again, even at a large scale motion, you still have the issue of splat isolation and fitting to contend with.

The neural motion papers you mention are interesting, but have a significant overhead currently outside of small use cases.

Meshes are much more straightforward, and with advancements in neutral materials and micropolygons (nanite etc) it’s really difficult to make a splat scene that isn’t first represented as a mesh have the quality and performance needed. And if you’re creating splats from a captured real world scene, they need significant cleanup first.

nox101 · 2024-08-14T08:32:23 1723624343

Are they good for that either? I haven't seen one where the data isn't huge

dagmx · 2024-08-14T14:16:47 1723645007

The data is definitely an issue, but they do make for fairly convenient alternatives to something like matterport where you need their cameras rented etc.

Though I think matterport will just start using them since the other half of their product is the user experience on the web.

nox101 · 2024-08-14T23:24:06 1723677846

Will they though? I saw a siggraph demo of a matterport like apartment preview using gaussian splatting. It downloaded 1.6gig! for a single apartment. Checking out a current matterport demo on their site for a similar sized space it was 60meg or 26x smaller

dagmx · 2024-08-15T03:47:16 1723693636

Tbh most splats data today is not optimally stored. There’s a lot that could be done for streaming, data reduction and segmentation. So I think it’s definitely both possible and easy to reduce that data size in half if not more.

They’ll likely never be smaller than a mesh and texture though, because the data frequency will be higher. A wall can be two triangles and a texture. The same representation as splats will have to be many hundreds of points, roughly at the count of the pixels of the lowest resolvable version of that texture.

So I agree they’re far from optimal for data size. But they greatly reduce the complexity of data capture and representation.

littlestymaar · 2024-08-14T09:37:08 1723628228

Have you watch the basketball games in the Olympics? Every once in a while, they showed a replay of a key point with some effect of the camera moving between two views in the middle of the shoot.

It was not likely to be GS since there was tons of artifacts that didn't look like the ones GS produces, but they could have used it for such stuff.

For instance with some kind of 4D GS we could even remap the camera view entirely to have a virtual camera allowing us to see the shoot from the eyes of Steph Curry with Batum and Fournier double teaming him.

two_handfuls · 2024-08-14T14:16:27 1723644987

Good question. One thing I know they are good for are 3D photos because they solve a fundamental issue with the current tech: IPD.

The current tech (Apple Vision Pro included) uses two photos: one per eye. If the photos were taken from a distance that matches the distance between your eyes, then the effect is convincing. Otherwise, it looks a bit off.

The other problem is that a big part of the 3D perception comes from parallax: how the image changes with head motions (even small motions).

Techniques that are not limited to two fixed images, but instead allow us to create new views for small motions, are great for much more impressive 3D photos.

With more input photos you get a “walkable photo”: a photo that you can take a few steps in, say if you are wearing a VR headset.

I’m sure 3D Gaussian splatting is good for other things too, given the excitement around them. Backgrounds in movies maybe?

twelvechairs · 2024-08-14T06:06:38 1723615598

Basically when you don't want to spend time to pre-process e.g. through traditional photogrammetry. So near-real-time events, or where there's huge amounts of pointcloud capture and comparatively little visualisation

Edit: others are mentioning real estate I'd think that will prefer some pre processing but ymmv

tomp · 2024-08-14T07:05:52 1723619152

Not really.

First if all, most GS take posed images as input, so you need to run a traditional photogrammetry pipeline (COLMAP) anyways.

The purpose of GS is that the result is far beyond anything that traditional photogrammetry (dense mesh reconstruction) can manage, especially when it comes to “weird” stuff (semi-transparent objects).

kersplody · 2024-08-14T05:30:06 1723613406

Volumetric live action performance capture. Basically a video you can walk around in. Currently requires a large synchronized camera array. Plays back on most mobile devices. Several major industry efforts in this space ongoing.

jorgemf · 2024-08-14T10:00:32 1723629632

Gaussian splatting transform images to a cloud points. GPUs can render these points but it is a very slow process. You need to transform the cloud points to meshes. So basically is the initial process to capture environments before converting them to 3D meshes that the GPUs can use for anything you want. It is much cheaper to use pictures to have a 3D representantion of an object or environment than buying professional stuff.

andybak · 2024-08-14T10:08:39 1723630119

> Gaussian splatting transform images to a cloud points.

Not exactly. The "splats" are both spread out in space (big ellipsoids), partially transparent (what you end up seeing is the composite of all the splats you can see in a given direction) AND view dependent (they render differently depending on the direction you are looking.

Also - there's not a simple spatial relationship between splats and solid objects. The resulting surfaces are a kind of optical illusion based on all the splats you're seeing in a specific direction. (some methods have attempted to lock splats more closely to the surfaces they are meant to represent but I don't know what the tradeoffs are).

Generating a mesh from splats is possible but then you've thrown away everything that makes a splat special. You're back to shitty photogrammetry. All the clever stuff (which is a kind of radiance capture) is gone.

Splats are a lot faster to render than NeRFs - which is their appeal. But heavier than triangles due to having to sort them every frame (because transparent objects don't composite correctly without depth sorting)

vessenes · 2024-08-14T15:31:12 1723649472

Minor nit — in what way do splats render differently depending on direction of looking? To my mind these are probabilistic ellipsoids in 3D (or 4D for motion splats) space, and so while any novel view will see a slightly different shape, that’s an artifact of the view changing, not the splat. Do I understand it (or you) correctly?

refibrillator · 2024-08-14T16:43:48 1723653828

In 3DGS, spherical harmonics are used to model view-dependent changes in color.

https://en.m.wikipedia.org/wiki/Spherical_harmonics

Basically for each Gaussian there is a set of coefficients and those are used to calculate what color should be rendered depending on the viewing angle of the camera. And the SH coeffs are optimized through gradient descent just like the other parameters including position and shape.

vessenes · 2024-08-14T20:27:38 1723667258

Ah, thank you. Taking into account say reflection/refraction.

noduerme · 2024-08-14T04:05:17 1723608317

Could be very useful for prototyping camera moves and lighting for film / commercial shoots on location. You might not even need to send a scout, just get a few pictures and be able to thumbnail a whole scene.

I could also see a market for people who want to recreate virtual environments from old photos.

Also, load the model on a single-lens 360 camera and infer stereoscopic output.

deckar01 · 2024-08-14T05:23:30 1723613010

Photography. A small cheap camera array could produce higher resolution, alternate angles, and arbitrary lens parameters that would otherwise require expensive or impossible lenses. Then you can render an array of angles for holographic displays.

praveen9920 · 2024-08-14T14:39:32 1723646372

One application I can think of is Google Street View. Gaussian splatting can potentially "smoothen" the transition between the images and make it look more realistic.

lawlessone · 2024-08-14T17:12:18 1723655538

>Where would you use 3D Gaussian splatting?

The primary purpose of Gaussian splatting is to frontpage here every two weeks.

t43562 · 2024-08-14T07:05:32 1723619132

What about virtual tourism? See the pyramids without the expense of going there.

55555 · 2024-08-14T03:26:54 1723606014

Virtual tours for real estate

littlestymaar · 2024-08-14T09:38:23 1723628303

Are there businesses doing it already or is the tech too immature to be used IRL right now?

apinstein · 2024-08-14T10:52:26 1723632746

I started and ran a real estate photography platform from 2004-2018. We started r&d on this in ~2016 when consumer VR first came out. At the time we used photogrammetry and it was “dreadful” to try to capture due to mirrors, glass, etc.

So I have been following GS tech for a while. I’ve not yet seen anything (open source / papers) that quite gets there yet. I do think it will.

In my opinion, there are two useful ways GS can bring to this industry.

The first is ability to use photo capture to re-render as a high production quality video similar to what people do with Luma AI today. While this is a really cool capability, it’s also not really that hard to do anymore with drones and gimbals. So, the experience of creating the same thing via GS has to be better and easier, and it’s not clear when that will likely happen due to how painful the capture side is. You really need good real time capture feedback to make sure you have good coverage. Finding out there’s a hole once you’re off location is a deal breaker.

The second is to create VR capable experiences. I think the first real useful thing for consumers will be so you can walk around in a small three or 4 foot area and get a stereo sense of what it’s like to be there. This is an amazing consumer experience. But the practicality of scaling this depends on VR hardware and adoption, and that hasn’t yet become commonplace enough to make consumer use “adjacent possible” for broad deployment.

I could see it being used on super high end to start out.

halfbreed · 2024-08-14T03:03:36 1723604616

I still wonder this myself, but the most obvious area that comes to mind is real estate virtual tours. Once a splat can render in the browser at high fps, then I see this replacing most all other technologies currently being used.

rebuilder · 2024-08-14T06:59:02 1723618742

The indoor example with the staircase and railing was really surprising - there's only one view of much of what's behind the doorframe and it still seems to reconstruct a pretty good 3d scene there.