Creating a single benchmark seems tricky since Timsort’s key strength is it perf...

hinkley · on July 29, 2022

I'm trying to recall what the expected level of clustering is even for properly random inputs.

A few years before Timsort there was someone randomly shuffling the input in order to avoid the reverse-order corner case that kills so many sort algorithms.

For truly random inputs, about half of the entries should be pairs that are partially ordered, and 25% runs of 3, no? And if memory serves Timsort also has a fast path for strictly decrementing runs as well, where it amortizes some of the comparisons done in the scan phase to do a blind reverse() call.

sshine · on July 29, 2022

This kind of almost sorted data is easily synthesised with any degree of already-sorted-ness.

varenc · on July 29, 2022

Certainly! But then how do you decide what degree of already-sorted-ness fairly represents real world data?

sshine · on July 30, 2022

You don’t have to agree on any one degree. You can set up a giant matrix with lengths on one axis, and the sortedness on the other.

Each cell can have a color indicating if TimSort beats, say, some other hybrid of MergeSort; green shades suggest TimSort is winning, red shades suggest the contender is winning.

For each cell, do multiple runs with those sortedness/length parameters and pick the average.

not2b · on July 29, 2022

By collecting a lot of real world data?