Memory layout of multi-dimensional arrays

lispit · on Sept 26, 2015

Related to this is the concept of arrays of structures (AoS) vs. structures of arrays (SoA). Do you store all of the data that makes up one "object" in a single data structure, or do you store all of each "member" of each object in separate flat arrays?

If you need to use all of the "members" of an "object" at once, AoS is more efficient. But if you only need to use 1 or 2 members for a certain task, you will waste lots of memory bandwidth on data that isn't needed for the current operation. In this case, an SoA layout can improve your performance by an order of magnitude, by insuring that only the data that is actually necessary for your task is loaded into cache.

The same principle applies to row vs. column-oriented databases, which obviously are orders of magnitude more sensitive to IO bandwidth usage.

One of the most damning criticisms of OOP, from this perspective, is that it encourages you to wrap up lots of potentially unnecessary data into monolithic "objects" without concern to memory layout or usage patterns, giving you AoS and bloated data structures.

jofer · on Sept 26, 2015

Good points. However, object oriented programming doesn't necessarily mean arrays of objects. I'd disagree that it encourages it in any way.

I realize that's a common paradigm in Java and C++, but it's an architectural decision that you make, not a part of a programming paradigm. The array-of-structs prevalence really stems from C, not object-oriented approaches.

You can do both in a object-oriented manner. For example:

    points = Points(xarray, yarray, zarray)

vs

    points = [Point(x0, y0, z0), Point(x1, y1, z1), ...]

If you're mostly operating on a series, it makes more sense to structure objects and methods around the former "plural" approach, anyway.

colanderman · on Sept 27, 2015

I've always wished for a language that would just let me describe my data logically as a SQL-esque relation, while specifying separately its physical layout (incl. AoS vs. SoA). Tying logical flow to physical storage layout always seemed kinda clunky to me.

jofer · on Sept 26, 2015

Excellent summary!

Layout of N-dimensional arrays becomes even more important when you start considering on-disk formats, where seek times are relatively slow.

People often talk about out-of-core calculations without fully considering how their data is laid out on disk. If your access pattern doesn't match the on-disk ordering of your large, N-dimensional array, you're in for a large surprise. Accessing a portion of an array that takes milliseconds in RAM can easily take multiple hours if you naively write a large multidimensional array to disk with an ordering that doesn't match your access pattern.

Of course, this is why chunking exists, but surprisingly few people are as familiar with it as they should be.

For example, at work, I often need to use a specific legacy format that's basically a large 3D fortran-ordered array stored on disk. I often get asked why I'm loading the entire thing into memory instead of taking an out-of-core approach. The problem is that our access pattern deals entirely with accessing a single "trace" (z-direction) at a time. For a fortran-ordered array stored on disk, that's the worst-possible access pattern. Each element in a single trace is a long ways apart throughout the entire file. Loading a single trace takes far long than loading the entire file into memory on most (non-SSD) systems.

I've recently been on a bit of an evangelical kick for trying to persuade more people to think about the implications of on-disk layout. (For example: http://geology.beer/2015/02/10/hdf-for-large-arrays/ While that's specifically about HDF5, most of it is explaining memory/disk layout and its implications.)

agibsonccc · on Sept 26, 2015

This is a fascinating topic I have had to spend a lot of time studying as I was writing http://nd4j.org (numpy port for the jvm)

When you want to represent say 6d data you have to account for all sorts of interesting details.

One interesting detail is strides along particular dimensions.

For example given an ndarray of

2 x 3 x 1 x 5

You will have a stride to get to the next element along a dimension in the linear buffer.

The strides along k dimensions are determined by ordering.

The crazy thing here though is that 1 in the middle for the shape.

If you find a 1 in the shape you actually skip that stride when indexing into the array.

Numpy's internals go in to this quite a bit: https://scipy-lectures.github.io/advanced/advanced_numpy/

The great thing about the linear storage of these arrays also helps with distributed systems and reading blocks of bytes.

Lots of interesting little things in here to consider.

Without going ok another tangent this is also highly relates to blas. There is a lot more to this in practice. This is a wonderful introduction to the topic though.

Retra · on Sept 26, 2015

You can also use space-filling curves to produce nice cache behavior under certain assumptions. If you're only needing to view a small, contiguous portion of a very large array, for instance.

kxyvr · on Sept 26, 2015

In case anyone is interested, Michael Bader has a book called "Space-Filling Curves: An Introduction with Applications in Scientific Computing" that talks about this:

https://books.google.com/books?id=zmMBMFbia-0C

lispit · on Sept 26, 2015

GPUs use tiling and morder-order storage for textures for this reason.

https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...

sitkack · on Sept 27, 2015

... Morton Order https://en.wikipedia.org/wiki/Z-order_curve

lispit · on Sept 27, 2015

...oops...

agibsonccc · on Sept 26, 2015

Right. It's neat seeing this kind of mix of linear algebra + CS come together.

colanderman · on Sept 26, 2015

Was hoping this touched on Z-ordering. If you have a processor that can efficiently interleave bits (i.e. combine 0110 and 1011 to get 01101101), you can lay out 2D and 4D arrays in Z-order, which will improve spatial locality (nearby elements in both the X and Y dimensions are also nearby in memory/cache/TLBs).

You can actually efficiently achieve this even on processors without an interleave instruction if you're only walking stepwise -- basically, mask out the odd (resp. even) bits separately, setting the even (resp. odd) bits to 1 before adding 1, or to 0 before subtracting 1, then recombine.

rcpt · on Sept 26, 2015

There's more about multi dimensional numpy iterators in "beautiful code" https://www.safaribooksonline.com/library/view/beautiful-cod...

halayli · on Sept 26, 2015

relevant:

https://software.intel.com/en-us/articles/optimize-data-stru...