Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A Simple GPU Hash Table (nosferalatu.com)
119 points by nosferalatu75 on March 11, 2020 | hide | past | favorite | 17 comments



You might want to check out Bidirectional Linear Probing, which can handle load factor >90% with good performance. I implemented and benchmarked it here: https://github.com/senderista/hashtable-benchmarks. Storing integer keys as their hash values (using a reversible hash function) also helps performance. I implemented a dozen or so 32- and 64-bit reversible integer hash functions in the same repo.


That's great work. Thanks! Using a reversible hash function is a good idea.


FYI here's a concurrent version of BLP (I haven't implemented it but I stole their simplified insert algorithm): https://pdfs.semanticscholar.org/6d6c/ca94c57d408c0b1164d6ff.... Also see a compact version (concurrent Cleary hash table) by the same authors: https://www.researchgate.net/profile/Alfons_Laarman/publicat....


I wonder if this technique can be used to implement order independent transparency on GPU in a performant manner.

Really great algorithm regardless!

Explanatory Edit:

Visually correct alpha blending requires all blended pixels to be known at the blend time but the way real time graphic pipelines work is too lossy to achieve that. Intermediate pixel are lost on each fragment step.

In current game engines, when a semi-transparent object is rendered, it can only know the previous blended state of the pixel it is writing into. So instead of being able to do something like `blend(p1, p2, p3, p4, ...)`, you work around it with a blend model that looks like `blend(..., blend(p4, blend(p3, blend(p2, p1))))`. A common working (partially) solution uses a linked list of pixel colors, encoded in a texture (src: NVidia RnD) to be able to preserve these colors until the final image is rendered. It is still costly in terms of computation and space. There are other solutions but each has its own, unavoidable drawbacks.

So I figured as a newbie, maybe this hash table can store pixel location as key and a linked list of colors as value to mitigate for the computation cost of NVidia's algorithm. The space cost is still there though. Or maybe I'm entirely wrong.


Intel has a great set of papers; the idea is to store a few of the semi-transparent pixels in a texture, then overflow to a storage buffer as needed.

https://www.gdcvault.com/play/1014547/Adaptive-Order-Indepen...

Especially at around 15:06 they make a great argument that above a certain limit, the net effect of additional pixels is minimal (because they will be blended behind a significant fraction of the whole picture, so just mushing together a few semi-transparent pixels at the _back_ of the stack is reasonable).

The problem is getting each pixel's data as small as possible so the texture doesn't overflow the limits of texture sizes in GPUs - typically mobile devices have a limit of 128 bits per pixel in texture sizes. Also, mobile devices don't really have enough GPU RAM to do this kind of order independent transparency on the 4x anti-aliased framebuffer.

After solving the above problems you're ready to spill some pixels into the storage buffer (again, because most pixels on the screen don't have 8 semi-transparent pixels in them, so spilling to overflow for the few ones that do may be faster than trying to keep it all in the texture). Now you're ready to do something like this technique.


Awesome!

> The table uses power-of-two sizes instead of prime numbers because pow2/AND masking is a fast single instruction and the modulo operator is much slower.

I was under the impression that the idea of prime number hash table sizes being faster was shredded a couple decades ago? Maybe by Bob Jenkins?


The idea behind prime number sizes is that you can use a hash function with periodicity issues, which would normally cause collisions. E.g. in C++, it's totally standard-compliant for std::hash<size_t>::operator() to be the identity function. The prime number of buckets will tend to mitigate this.

These days, the preferred solution to this is usually to just use a better hash function without periodicity issues, so the bucketing no longer matters.


Exactly. Indeed - I found the link I was thinking of - Bob Jenkins wrote about that in his famous 1997 article on hashes in Dr Dobbs. Having learned in college that hash table sizes are ideally prime, reading this article was shocking at the time. Ironically, I learned that hash table sizes should be prime in college after 1997, and didn’t find this article until later. I guess best practice knowledge takes time to percolate.

“The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask. For example, if you need only 10 bits, do h = (h & hashmask(10));”

https://burtleburtle.net/bob/hash/doobs.html


Concerning: “...as the table is linearly searched, the slot index must be wrapped at each slot. The cost of a modulo operation at each slot quickly adds up.” As you linearly search, you don’t need to re-modulo; you just have to set the index to zero if it has reached the end of the table, so it’s a simple compare-and-conditionally-assign operation.


A 32-bit key and 32-bit value are some serious constraints. The whole point of hash tables is usually that you have variable length keys or values.

If you have more than 32GB of RAM (which is not rare nowadays) you can directly map all 32-bit values into a 32GB chunk of memory and be probably faster than this GPU solution on regular CPUs.


Even so, GPU-based hash tables are useful in that they can be used to speed up various kinds of simulations for video games and animation:

https://wickedengine.net/2018/05/21/scalabe-gpu-fluid-simula...

https://developer.nvidia.com/gpugems/gpugems3/part-i-geometr...


How does the performance compare to other implementation either for GPU or not?


My understanding is that pcie devices incur a very high latency for getting data to and from them, due to the pcie device itself and setting up a dma descriptor. Sadly this article doesn't look into that.

Can someone comment on what sort of latency would be involved in something like this? For example, latency to send to pcie device, latency to react to and manipulate the data in a warp(Nvidia?), and latency to send the data back to the host system?


This is just mind blowingly cool


That is a most awesome thing you did there. Thank you very much for sharing!


Isn't the global atomic op a bad idea?


It's not terrible but it might be possible to get better performance within a warp using some warp function.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: