Hacker News new | past | comments | ask | show | jobs | submit login
Porting a renderer from C++ to CUDA - the speed gains and their cost (ntua.gr)
49 points by profquail on Jan 2, 2011 | hide | past | favorite | 18 comments



I work at gamedev studio, and we briefly tried CUDA for DXT compression. Now the speedup (the actual compression part was roughly x10 - x20 faster).

But there were couple of problems (solvable, but might require changing of habits):

   - Once you have CUDA running, you can't Remote Desktop to the machine (you can do VNC). We are still on Vista (and we stared trying this on XP). Things might be better with Windows 7. 
     Alternative: VNC

   - The IT requires every unattended machine to be logged off after 15 minutes. This means that CUDA might stop working for you (no video driver). Same somtimes happen if you do Ctrl+Alt+Del and brought the Task Manager (that's inconsitent - it happens rarely).

   - The biggest problem was that there were 2-3 models (DELL, then HP) machines. We always use Intel with NVIDIA, but now and then every machine would convert to DXT the texture data a little bit different. Visually no problem, but this was messing up MD5 sums of the produced image, and locally people were getting different results than the one stored on the cached asset server. 
     Alternative: Keep the MD5 sums of the source images + arguments for encoding (though this would not detect cases where bad images were encoded).

We had one very smart guy working for 6 months on new radiosity tool. He started with CUDA solution, but moved back to multithreaded/process solution with SSE code. For radisotiy (precomputed lighting for the levels) data we did not kept MD5 sums, so it did not matter.

So for us what worked sanely was OpenMP - speeding up the nvidia compressor by processing blocks with height of 4, and length of the original image - another smart graphics programmer came up with the idea.

That to be said - it's still exciting, but too hard for a lot of common tasks.


The Remote Desktop issue is actually fixed as of CUDA 3.2 if you have a Tesla card that is only doing compute (e.g., you're not running a Tesla C2050 as a display card).

full disclosure: I work on the CUDA driver stack and the dedicated compute driver for Vista/Win7 is my baby :)


Thank you! Thank you! We don't have Tesla's, but I'm dreaming of buying one (or whatever else there is latest and greatest from NVIDIA).


Or as it was cleverly put before the author: "Primary rays cache; secondary rays thrash"

Some interesting things to note, at least from my point of view:

* GPUs should not be considered like the next generation CPUs at this stage.

* Trying to take advantage of CUDA/OpenCL not only requires redesigning your algorithm and altering your data structures, but also the folly of implementing in software what is already available in hardware.

* Thread shared memory on nVidia cards isn't exactly similar to a cache. There's also a lot of papers in the last two years that speak of nothing else but altering algorithms to make better use of CUDA, because for anything else but the simplest of raytracers on a few objects, the situation gets really bad.

Very nice post, though. Neatly organized and well written, I really enjoyed it.


> Very nice post, though. Neatly organized and well written, I really enjoyed it.

Blushing :-) Glad you enjoyed it, hope you have a chance to look at the GPL code, too.


I'll certainly try to find some time for it.


What stops using multiple passes to render the image? One for primary rays and then invoking another pass for the secondary rays.


Nothing. In fact, everyone does this at least to first order; you need to do bundles of primary rays at a time to take advantage of their spatial (and thus, memory access pattern) coherence. The problem is that the secondary rays are not a uniform grid like the primary rays, and they could be pointing any which way, depending on scene geometry.

What you can do is attempt to group up a bunch of secondary rays that appear to be pointing roughly the same direction (e.g., if their primary rays all reflected off the same flat, specular object), and do them in a batch, exploiting their spatial coherence. Whether the process of finding coherent secondary rays is less costly than just processing secondary rays in the same order as their primary rays, again, depends on scene geometry.


Finding the hit points of secondary rays will still have very bad locality of reference unless I've misunderstood what you're saying.


Simple: the secondary rays don't "bunch" up together - there's no advantage to doing them in a second pass, you won't suddenly get coherence that way.


Maybe you should take a look at this:

http://home.comcast.net/~tom_forsyth/larrabee/larrabee.html

"Rasterization on Larrabee and SIMD Programming With Larrabee: Michael Abrash and I doing our double-act. We both talk about the instruction sets, Michael talks about the hierarchical descent rasterisation algorithm, and I talk about how we do basic language structures such as conditionals and flow control with our 16-wide vector units. We were both absurdly proud to be able to finally talk about the architecture we'd worked on for so long - it's not every programmer that gets to design their own instruction set."


There are quite a few clever tricks to manage irregular problems. See poster: http://www.nvidia.com/content/GTC/posters/2010/A06-Task%20Ma...

Also a presentation on OptiX shows how to implement ray tracer: http://www.nvidia.com/object/gtc2010-presentation-archive.ht...

Ray-tracing could be done well on GPU (see OptiX), but it is not trivial and needs some tricks (persistant threads, work donation etc. which are not that common on CPUs)


The BRIGADE real-time path tracer developed by Jacco Bikker uses hybrid rendering utilizing both the CPU and the GPU as much as possible. http://www.youtube.com/watch?v=Jm6hz2-gxZ0


It's not quite clear to me why rasterising of all things is slow. I realise GPUs have a separate rasterisation unit, but other than that, the ALUs are designed for this type of workload. I haven't experimented with latter-era GPGPU APIs and languages, but random memory access in a basic rasteriser sounds suspicious. Bouncing rays? Sure, that'll destroy any locality of reference, but mapping triangles into screen space? No way.


Rasterization is slow when you have lots of small (often sub-pixel) triangles. Mapping a big triangle into screen space is fast, but mapping many triangles per pixel (and hopefully throwing out the 99+% of the geometry that won't show up at all) into screen space and blending them together in a convincing way takes a while.

Real-time graphics does the best it can with a few big triangles (by e.g. texturing and bump-mapping them) out of necessity, but people are moving towards cases that are harder on the rasterizer.


Rasterization works per triangle - which means that each CUDA thread is working on a triangle. Which means that by definition, each thread reads/writes to a completely different place in the Z-buffer and reads from a completely different place in the shadow buffer. Hence the abyssmal speed I experienced with my CUDA rasterizer, and why I switched to CUDA raycasting.


I've worked with CUDA, and his negative conclusions are a little too dramatic compared to how elegantly he used CUDA. (That said, the biggest cost of CUDA development is the year or so it takes to let it all sink in. ALL of it!)


Well it was a weekend project, so I am not sure it is exactly... elegant. But thank you, I accept the compliment :-)

As for my negative conclusions... I wouldn't describe them as "negative", just "objective". Some algorithms are easily adapted to CUDA, and you get an easy win of 10-40x. Others need a lot more work to offer speed gains, and finally, there are some that are simply hopeless (you have to redesign them from scratch).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: