I'm having a hard time finding a reference to the hardware the inference is run on. The paper mentions training was done on a single A100 GPU so I'm going to assume inference was run on that same platform. The 22fps result is somewhat meaningless without that information.
It does feel like we're getting closer and closer to being able to synthesize novel views in realtime from a small set of images at a framerate and quality high enough for use in AR, which is an interesting concept. I'd love to be able to 'walk around' in my photo library.
> I'd love to be able to 'walk around' in my photo library.
Yes this. I've been dreaming about this since I digitized my childhood photos a few years ago. There should be more than enough photos to reconstruct the entire apartment. Or my grandparents' house. Not sure though what happens if items and furniture moves around between shots.
I haven't looked much into this yet and just assumed it will need a bit more time until there is a batteries included solution I can just download and run without reading ten pages of instructions and buying a GPU cluster.
Once the Gaussian Splats are computed (whether via ML or classical optimisation), they’re very efficient to render (similar to 3D meshes used in games). High fps isn’t incredible.
Having said that (I have yet to read the paper), "efficiency" probably refers to the first part (calculating the gaussians in the first place) not rendering.
They’re not “very efficient “. They have a significant amount of overdraw due to their transparency and will be a lot more inefficient if you’re only considering material-less surface representation.
They’re more efficient to capture however . They’re also more constant in their render time, but meshes will easily be faster in most scenes cases, but scale worse with complexity.
The “efficiency” of splats is more about the material response and capturing complexity there, than it is about geometric representation.
You are correct. I was confusing this technique with Novel View Synthesis through diffusion (recent paper: https://arxiv.org/abs/2408.06157) where inference means generating frames rather than points.
The tech stack in the splat world is still really young. For instance, I was thinking to myself: “Cool, MVSplat is pretty fast. Maybe I’ll use it to get some renderings of a field by my house.”
As far as I can tell, I will need to offer a bunch of photographs with camera pose data added — okay, fair enough, the splat architecture exists to generate splats.
Now, what’s the best way to get camera pose data from arbitrary outdoor photos? … Cue a long wrangle through multiple papers. Maybe, as of today… FAR? (https://crockwell.github.io/far/). That claims up to 80% pose accuracy depending on source data.
I have no idea how MVSplat will deal with 80% accurate camera pose data… And I also don’t understand if I should use a pre-trained model from them or train my own or fine tune one of their models on my photos… This is sounding like a long project.
I don’t say this to complain, only to note where the edges are right now, and think about the commercialization gap. There are iPhone apps that will get (shitty) splats together for you right now, and there are higher end commercial projects like Skydio that will work with a drone to fill in a three dimensional representation of an object (or maybe some land, not sure about the outdoor support), but those are like multiple thousand-dollar per month subscriptions + hardware as far as I can tell.
Anyway, interesting. I expect that over the next few years we’ll have push button stacks based on ‘good enough’ open models, and those will iterate and go through cycles of being upsold / improved / etc. We are still a ways away from a trawl through an iPhone/gphoto library and a “hey, I made some environments for you!” Type of feature. But not infinitely far away.
COLMAP to generate pose data using structure-from-motion; if you use Nerfstudio to make your splat (using Splatfacto method) it includes a command that will do the COLMAP alignment. This definitely is a weak spot though and a lot goes wrong in the alignment process unless you have a smooth walkthrough video of your subject with no other moving objects.
On iPhone, Scaniverse (owned by Niantic) produces splats far more accurately than splatting from 2D video/images, because it uses LiDAR to gather the depth information needed for good alignment. I think even on older iPhones without LiDAR, it’s able to estimate depth if the phone has multiple camera lenses. Like ryandamm said above, the main issue seems to be low value/demand for novel technology like this. Most of the use cases I can think of (real estate? shopping?) are usually better served with 2D videos and imagery.
I think the barrier to commercialization is the lack of demonstrated economic value to having push button splats. There's no shortage of small teams wiring together open source splats / NeRF / whatever papers; there's a dearth of valuable, repeatable businesses that could make use of what those small teams are building.
Would it be cool to just have content in 3D? Undoubtedly. But figuring out a use case, that's where people need to be focusing. I think there are a lot of opportunities, but it's still early days -- and not just for the technology.
Yes - agreed. There’s a clear use case for indie content, but tooling around editing/modifying/color/lighting has to improve, and rendering engines or converters need to get better. FWIW it doesn’t seem like a dead-end tech to me though; more likely a gateway tech to cost improvements. We’ll see.
Every gaussian splat repo I have looked at doesn't mention how to use the pre-trained models to "simply" take MY images as input and output a GS. They all talk about evaluation, but the CMD interface requires the eval datasets as input.
Is training/fine-tuning on my data the only way to get the output?
Is there really such thing as a pre-trained model when it comes to Gaussian splatting?
I'm not familiar at all with the topic (nor have I read this particular paper) but I remember that the original 3DGS paper took pride in the fact that this was not “IA” or “deep learning”. There's still a gradient descent process to get the Gaussian splats from the data, but as I understood it, there is no “training on a large dataset then inference”, building the GS from your data is the “training phase” and then rendering it is the equivalent of inference.
Maybe I understood it all wrong though, or maybe new variants of Gaussian splatting use a deep learning network in addition to what was done in the original work, so I'll be happy to be corrected/clarified by someone with actual knowledge here.
Basically you train a model per each set of images. The model is a neural network able to render the final image. Different images will require different trained models. Initial gaussian splatting models took hours to train, last year models took minutes to train. I am not sure how much this one takes, but it should be between minutes and hours (and probably more close to minutes than hours).
No, what you're describing is NeRF, the predecessor technology.
The output of Gaussian Splat "training" is a set of 3d gaussians, which can be rendered very quickly. No ML involved at all (only optimisation)!
They usually require running COLMAP first (to get the relative location of camera between different images), but NVIDIA's InstantSplat doesn't (it however does use a ML model instead!)
No, Gaussian splats are pretty poor for video games. There’s a significant amount of overdraw and they’re not art directable or dynamic.
Gaussian splats are much better suited for capturing things where you don’t have artists available and don’t have a ton of performance requirements with regards to frame time.
So things like capturing real estate , or historical venues etc.
Isn’t that a “for now” problem rather than something intractable for the performance anyway? Presumably HW and SW algorithms will continue to improve. Art directable may be a problem but it feels like Gaussian splats + genAI models could be a match made in heaven with the genAI mode generating the starting image and splats generating the 3d scene from it
Sure, given an unlimited amount of time and resources, it’s possible that Gaussian splats could be performant. But that’s just too vague a discussion point to be meaningful.
It’s definitely not in the cards in the near term without a dramatic breakthrough. Splats have been a thing for decades so I’m not holding my breath.
I mean here it is running at 22fps. In another 5 years it’s reasonable to conservatively believe hardware and software to be 3x as powerful which gets you to a smooth 60fps.
Well my critique of your comment is just that it’s unbounded. Yes, eventually all compute will get better and we can use once slow technologies. But that’s not a very valuable discussion because nobody is saying it’ll never be useful, just that it isn’t for games today.
It also ignores that everything else will be faster too then as well, and ignores needing to target different baselines of hardware.
Either way 5 years for a 3x improvement seems unrealistic. 4 years saw a little over a doubling of performance at the highest end with a significant increase in power requirements as well, where we’re now hitting realistic power limits.
Taking the 2080 vs 4080 as their respective tiers
153% performance increase
50% more power consumption
50% price increase.
So yes performance at the high end will increase, but it’s scaling pretty poorly with cost and power. And the lower end isn’t scaling as linearly.
On the lower/mid end (1060 Ti vs 2060 Super) we saw only a 53% increase in that same time period.
I guess it's to me that's still just an pessimistic perception. Ray tracing was also extremely slow for a long time until Nvidia built dedicated HW to accelerate it. Is there reason to believe that splats are already well served by generic GPU compute that dedicated HW won't accelerate it in a meaningful way?
Here's splats from 2020 working at 50-60fps [1]. I think my overall point is I don't think it's performance that's holding it back in games but tooling & whether it saves meaningful costs elsewhere in the game development pipeline.
Again, I’m not saying it won’t be possible someday. Any number of things could happen, even though the trajectory doesn’t imply it will be in the next 5 years. All I’m saying is that the question is pointless without bounds.
Otherwise flying cars will also be possible.
Also your splat is running in isolation. Any single system can run by itself at a good clip. That’s not indicative of anything when running as part of a larger system. Again, the discussion of performance is pointless without bounds.
This is not true I believe. There are plenty of papers out there revolving around dynamic/animated splat-based models, some using generative models for that aspect too.
There are also some tools out there that let you touch up/rig splat models. Still not near what you can do with meshes but I think fundamentally it’s not impossible.
You can touch up a splat in the same way you can apply gross edits to an image (cropping, color corrections etc), but you can’t easily change it in a way like “make this bicycle handle bar more rounded”. Ergo it’s not art directable.
With regards to dynamicism, there’s some papers yes but with heavy limitations. Rigging is doable but relighting is still hit and miss, while most complex rigs require a mesh underneath to drive a splats surface. There’s also the issue of making sure the splats are tight to the surface boundary, which is difficult without significant other input.
Other dynamics like animation operate at a very gross level, but you can’t for example do a voronoi fracture for destruction along a surface easily. And again, even at a large scale motion, you still have the issue of splat isolation and fitting to contend with.
The neural motion papers you mention are interesting, but have a significant overhead currently outside of small use cases.
Meshes are much more straightforward, and with advancements in neutral materials and micropolygons (nanite etc) it’s really difficult to make a splat scene that isn’t first represented as a mesh have the quality and performance needed. And if you’re creating splats from a captured real world scene, they need significant cleanup first.
The data is definitely an issue, but they do make for fairly convenient alternatives to something like matterport where you need their cameras rented etc.
Though I think matterport will just start using them since the other half of their product is the user experience on the web.
Will they though? I saw a siggraph demo of a matterport like apartment preview using gaussian splatting. It downloaded 1.6gig! for a single apartment. Checking out a current matterport demo on their site for a similar sized space it was 60meg or 26x smaller
Tbh most splats data today is not optimally stored. There’s a lot that could be done for streaming, data reduction and segmentation. So I think it’s definitely both possible and easy to reduce that data size in half if not more.
They’ll likely never be smaller than a mesh and texture though, because the data frequency will be higher. A wall can be two triangles and a texture. The same representation as splats will have to be many hundreds of points, roughly at the count of the pixels of the lowest resolvable version of that texture.
So I agree they’re far from optimal for data size. But they greatly reduce the complexity of data capture and representation.
Have you watch the basketball games in the Olympics? Every once in a while, they showed a replay of a key point with some effect of the camera moving between two views in the middle of the shoot.
It was not likely to be GS since there was tons of artifacts that didn't look like the ones GS produces, but they could have used it for such stuff.
For instance with some kind of 4D GS we could even remap the camera view entirely to have a virtual camera allowing us to see the shoot from the eyes of Steph Curry with Batum and Fournier double teaming him.
Good question. One thing I know they are good for are 3D photos because they solve a fundamental issue with the current tech: IPD.
The current tech (Apple Vision Pro included) uses two photos: one per eye. If the photos were taken from a distance that matches the distance between your eyes, then the effect is convincing. Otherwise, it looks a bit off.
The other problem is that a big part of the 3D perception comes from parallax: how the image changes with head motions (even small motions).
Techniques that are not limited to two fixed images, but instead allow us to create new views for small motions, are great for much more impressive 3D photos.
With more input photos you get a “walkable photo”: a photo that you can take a few steps in, say if you are wearing a VR headset.
I’m sure 3D Gaussian splatting is good for other things too, given the excitement around them. Backgrounds in movies maybe?
Basically when you don't want to spend time to pre-process e.g. through traditional photogrammetry. So near-real-time events, or where there's huge amounts of pointcloud capture and comparatively little visualisation
Edit: others are mentioning real estate I'd think that will prefer some pre processing but ymmv
First if all, most GS take posed images as input, so you need to run a traditional photogrammetry pipeline (COLMAP) anyways.
The purpose of GS is that the result is far beyond anything that traditional photogrammetry (dense mesh reconstruction) can manage, especially when it comes to “weird” stuff (semi-transparent objects).
Volumetric live action performance capture. Basically a video you can walk around in. Currently requires a large synchronized camera array. Plays back on most mobile devices. Several major industry efforts in this space ongoing.
Gaussian splatting transform images to a cloud points. GPUs can render these points but it is a very slow process. You need to transform the cloud points to meshes. So basically is the initial process to capture environments before converting them to 3D meshes that the GPUs can use for anything you want. It is much cheaper to use pictures to have a 3D representantion of an object or environment than buying professional stuff.
> Gaussian splatting transform images to a cloud points.
Not exactly. The "splats" are both spread out in space (big ellipsoids), partially transparent (what you end up seeing is the composite of all the splats you can see in a given direction) AND view dependent (they render differently depending on the direction you are looking.
Also - there's not a simple spatial relationship between splats and solid objects. The resulting surfaces are a kind of optical illusion based on all the splats you're seeing in a specific direction. (some methods have attempted to lock splats more closely to the surfaces they are meant to represent but I don't know what the tradeoffs are).
Generating a mesh from splats is possible but then you've thrown away everything that makes a splat special. You're back to shitty photogrammetry. All the clever stuff (which is a kind of radiance capture) is gone.
Splats are a lot faster to render than NeRFs - which is their appeal. But heavier than triangles due to having to sort them every frame (because transparent objects don't composite correctly without depth sorting)
Minor nit — in what way do splats render differently depending on direction of looking? To my mind these are probabilistic ellipsoids in 3D (or 4D for motion splats) space, and so while any novel view will see a slightly different shape, that’s an artifact of the view changing, not the splat. Do I understand it (or you) correctly?
Basically for each Gaussian there is a set of coefficients and those are used to calculate what color should be rendered depending on the viewing angle of the camera. And the SH coeffs are optimized through gradient descent just like the other parameters including position and shape.
Could be very useful for prototyping camera moves and lighting for film / commercial shoots on location. You might not even need to send a scout, just get a few pictures and be able to thumbnail a whole scene.
I could also see a market for people who want to recreate virtual environments from old photos.
Also, load the model on a single-lens 360 camera and infer stereoscopic output.
Photography. A small cheap camera array could produce higher resolution, alternate angles, and arbitrary lens parameters that would otherwise require expensive or impossible lenses. Then you can render an array of angles for holographic displays.
One application I can think of is Google Street View. Gaussian splatting can potentially "smoothen" the transition between the images and make it look more realistic.
I started and ran a real estate photography platform from 2004-2018. We started r&d on this in ~2016 when consumer VR first came out. At the time we used photogrammetry and it was “dreadful” to try to capture due to mirrors, glass, etc.
So I have been following GS tech for a while. I’ve not yet seen anything (open source / papers) that quite gets there yet. I do think it will.
In my opinion, there are two useful ways GS can bring to this industry.
The first is ability to use photo capture to re-render as a high production quality video similar to what people do with Luma AI today. While this is a really cool capability, it’s also not really that hard to do anymore with drones and gimbals. So, the experience of creating the same thing via GS has to be better and easier, and it’s not clear when that will likely happen due to how painful the capture side is. You really need good real time capture feedback to make sure you have good coverage. Finding out there’s a hole once you’re off location is a deal breaker.
The second is to create VR capable experiences. I think the first real useful thing for consumers will be so you can walk around in a small three or 4 foot area and get a stereo sense of what it’s like to be there. This is an amazing consumer experience. But the practicality of scaling this depends on VR hardware and adoption, and that hasn’t yet become commonplace enough to make consumer use “adjacent possible” for broad deployment.
I could see it being used on super high end to start out.
I still wonder this myself, but the most obvious area that comes to mind is real estate virtual tours. Once a splat can render in the browser at high fps, then I see this replacing most all other technologies currently being used.
The indoor example with the staircase and railing was really surprising - there's only one view of much of what's behind the doorframe and it still seems to reconstruct a pretty good 3d scene there.
It does feel like we're getting closer and closer to being able to synthesize novel views in realtime from a small set of images at a framerate and quality high enough for use in AR, which is an interesting concept. I'd love to be able to 'walk around' in my photo library.