Hacker Newsnew | past | comments | ask | show | jobs | submit | danybittel's commentslogin

I just finished create my first gaussian splat of a macro: https://superspl.at/view?id=cf6ac78e

Isn't "fund people" just hiring without the extra steps?


If you're constraining them to work on specific things, then yes. Otherwise no :)


Funding means investing in their company. Hiring means paying them to work for your company.


Isn't that what VCs in general are doing? Hiring for more money, with more expected gains from you, with a different kind of legal arrangement, but still hiring nevertheless.


Funding people means you trust the people are so good they will push any idea to success. Funding an idea means you trust the idea is so good it will push any people to success.

Funding people means having a lot of trust in them. What's unsaid is if the investor believes coloring outside of the lines to make everyone more money is a breach of that trust, or just the normal cost/risk of business.


No, words have meaning.

If i order fast food im not hiring the worker because i pay them.

Hiring means one thing, investing means something else.


VC's can force the company to pivot or double down on things not (yet) working. You don't tell the fast food worker to build houses.


Wait. Were you were investing in that burger?


> Unfortunately, Shader programs are currently restricted to the Logical model, which disallows all of this.

That is not entirely true, you can use phyiscal pointers with the "Buffer device address" feature. (https://docs.vulkan.org/samples/latest/samples/extensions/bu...) It was an extension, but now part of Vulkan. It is widely available on most GPUS.

This only works in buffers though. Not for images or local arrays.


Not on mobile Android powered ones.


It should be, it is part of 1.2. (https://vulkan.gpuinfo.org/listfeaturescore12.php the first entry bufferDeviceAddress, supported by 97.89%)

Or did you mean some specific feature? I haven't used it on mobile.


Supported as it actually works, or gets listed as something the driver knows about, but full of issues when it gets used?

There is a reason why there are some Vulkanised 2025 about improving the state of Vulkan affairs on Android.


How do you do the rendering? Is it sorted (radix?) instances? Do you amortize the sorting over a couple frames? Or use some bin sorting? Are you happy with the performance?


Yes, Spark does instanced rendering of quads, one covering each Gaussian splat. The sorting is done by 1) calculating sort distance for every splat on the GPU, 2) reading it back to the CPU as float16s, 3) doing a 1-pass bucket sort to get an ordering of all the splats from back to front.

On most newer devices the sorting can happen pretty much every frame with approx 1 frame latency, and runs in parallel on a Web Worker. So the sorting itself has minimal performance impact, and because of that Spark can do fully dynamic 3DGS where every splat can move independently each frame!

On some older Android devices it can be a few frames worth of latency, and in that case you could say it's amortized over a few frames. But since it all happens in parallel there's no real impact to the overall rendering performance. I expect for most devices the sorting in Spark is mostly a solved problem, especially with increasing memory bandwidth and shared CPU-GPU memory.


If you say 1 pass bucket sorting.. I assume you do sort the buckets as well?

I've implemented a radix sort on GPU to sort the splats (every frame).. and I'm not quite happy with performance yet. A radix sort (+ prefix scan) is quite involved with lot's of dedicated hierarchical compute shaders.. I might have to get back to tune it.

I might switch to float16s as well, I'm a bit hesitant, as 1 million+ splats, may exceed the precision of halfs.


We are purposefully trading off some sorting precision for speed with float16, and for scenes with large Z extents you'd probably get more Z-fighting, so I'm not sure if I'd recommend it for you if your goal is max reconstruction accuracy! But we'll likely add a 2-pass sort (i.e. radix sort with a large base / #buckets) in the future for higher precision (user selectable so you can decide what's more important for you). But I will say that implementing a sort on the CPU is much simpler than on the GPU, so it opens up possibilities if you're willing to do a readback from GPU to CPU and tolerate at least 1 frame of latency (usually not perceivable).


You might want to consider using words (16 bit integer) instead of halfs? Then you can use all the 65k value precision in a range you choose (by remapping 32bit floats to words), potentially adjust it every frame, or with a delay.


Yeah you're right, using float16 gets us 0x7C00 buckets of resolution only. We could explicitly turn it into a log encoding and spread it over 2^16 buckets and get 2x the range there! Other renderers do this dynamic per-frame range adjustment, we could do that too.


Pretty sure that's illegal in Europe, if she were an AI.


Error diffusion dithering is kind of old fashioned. It is a great algorithm where you only need to go though the image once, pixel by pixel. But it doesn't work well with todays hardware, especially GPUs. Would be fun to come up with new algorithms that are better parallelizable.


Deterministic random value dithering, where the chance of being the dithered color or not is based on the percentage that the true value is that color?


Blue noise threshold map works really well on GPUs.



Collaborators have actually superoptimized some of the more complicated Highway ops on RISC-V, with interesting gains, but I think the approach would struggle with largish tasks/algorithms?


f or g may have side effects too. Like writing to memory. Now a conditional has a different meaning.

You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.


Side effects will be masked, the GPU is still executing exactly the same code for the entire workgroup.



YAGNI: You aren't gonna need it.

“Perfection is achieved not when there is nothing left to add, but when there is nothing left to take away”

“Less is more”

“The value of a creative product doesn’t lie in how much is there, but in how much has been discarded.”

rttm: Reduce to the max

"Kill your darlings"

...


It feels redundant to agree with this comment. I will anyway.

"Any fool can make something complicated. It takes a genius to make it simple."

"I apologize for such a long letter - I didn't have time to write a short one."

--

As for the calculator, I think it points to a bigger problem. Customers need to know what the platform will charge and a way to compare platforms in general. If the only way to truly know how much something will cost is to run their code on it, then maybe that's the thing that someone needs to implement.

There are big issues with this in the most-naive implementation in that people can easily abuse the ability to run code. That suggests that perhaps we need a benchmark-only environment where benchmarks themselves are the only thing allowed out of the environment. This may require a fair-amount of engineering/standards effort but could be a game-changer in the space.

A framework for being able to run this on many platforms to compare performance and pricing would lead to customers generating packages for vendors to compete. Though, I suppose it could also hide some devilish details like step-changes in rates.

This same framework would be useful for other things too, like testing how implementation changes affect future bills. Also, how pricing changes between vendors might become more-advantageous over time.

Of course, the sales folks might balk because they would rather have a conversation with everyone they can. Maybe I'm just advocating for a more advanced and complex calculator? ¯\_(ツ)_/¯


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: