Hacker News new | past | comments | ask | show | jobs | submit login

While this article starts to discuss the low level details that underlie our code, I feel like its incomplete.

* Latency hiding -- CPUs work extremely hard on latency hiding. A lot of those OOP-indirect calls will be branch predicted by the CPU in the simple cases (ex: if one object-type is more common than others, then those indirect branches will be branch predicted correctly in most situations).

There are other bits of latency hiding: Hyperthreads, Out-of-order execution, pipelines.

* Cache: L1, L2, and L3 caches can grossly mitigate a lot of the problems. While CPUs today are far faster than DDR4 RAM, L1 cache scales perfectly with the CPU. If you can keep your code+data under 64kB, you'll always be executing in L1 cache and basically get the full benefits.

If 64kB is too restrictive, then you have 256kB or 512kB (L2 cache), or 8MB (L3 cache), both of which scale with the CPU.

------------

SIMD and Multicore goes backwards IMO. It turns out that the 1970s and 1980s were filled with very intelligent programmers working on extremely sophisticated supercomputers. I enjoy going back to the 1980s and reading Connection Machine articles on C-Star (a parallel language that's arguably the precursor to CUDA), or Star-Lisp.

--------

If you write multicore software, then SMT / Hyperthreads are very effective at latency hiding. Running two threads on one core (or 4-threads or 8-threads, in the case of some IBM processors) allows your cores to "find something else to do" during those long latency periods where your CPU is traversing memory and waiting for RAM to respond.

Working extremely hard to optimize just a single thread seems somewhat counterproductive. Its often easier to write "less optimized" multicore software than highly optimized single-core software.

And SMT / hyperthreads will converge those onto a single core ANYWAY. So might as well take advantage of that feature.




> * Cache: L1, L2, and L3 caches can grossly mitigate a lot of the problems. While CPUs today are far faster than DDR4 RAM, L1 cache scales perfectly with the CPU. If you can keep your code+data under 64kB, you'll always be executing in L1 cache and basically get the full benefits.

I think even now you're underselling this. Depending on the types of transforms you're doing, an object-oriented (or row-oriented) instead of data-oriented (or column-oriented) approach can be faster due to caching. With OOP approaches, the object you're operating on will basically always be in L1 cache. On the other hand, if your ants name and age are in different columns and each column has 10000 things, you'll have to load both into L1 cache separately.

Stuff like vectorization can further complicate things, but it isn't' as simple as data-oriented is more performant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: