This is a great post. I guess the idea here is that you can start bringing futur...

naftaliharris · on Nov 14, 2013

Thanks!

Oh, thanks for pointing out the oversight! They were floats generated with random.random() and then shuffled with random.shuffle(.) Shuffling them removes any locality that might have arisen when the floats were being generated, even though the distribution remains uniform on (0, 1). (And indeed, empirically the code runs faster if you don't shuffle them).

Yeah, I'm not sure how well this would work if you were sorting different objects. I would guess that it depends on two things--how large the objects are in memory, and how long it takes to compare them.

bmm6o · on Nov 14, 2013

I was thinking about that when I read the article too. How you've obviously done a lot of testing, but that the run-time's allocator algorithm is going to have an effect on this, and how allocating N floats in a loop will have different performance than if the floats were allocated over time (with other allocations mixed in) or if the objects were larger. Especially if the comparison function requires accessing object members from other memory locations... The magic value 3 might have a lot to do with the number of objects that fit in the cache or a page.

Anyway, it's an interesting article and an interesting library.

osmala · on Nov 15, 2013

The prefetch will always take same amount of data no matter what the size of python object in question is. That leads two results. 1) You can miss if your cacheline size is less than python object size. 2) You miss if your python object is not alligned at the cacheline and comparison is with the part that is in the next cacheline side.

If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.

Another about prefetch distance I'd pick 5 based on his testing data. 1) It has no downside on current tests. 2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance. 3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test. 4) It allows a nice factor of improvement on CORE execution speed relative to memory latency. 5) It allows you to handle situations of page misses better.

bmm6o · on Nov 21, 2013

The prefetch may take the same amount of time, but the number of objects that it pulls into cache depends on their size and allocation pattern. If they are large or were allocated over time (= from different memory blocks) fewer will be pulled in. It seems to me that this would make prefetch less effective. That is, if the thing you are trying to prefetch was pulled in because it was close enough to the last one, prefetch is a no-op.