For this problem, I wouldn't worry too much about register usage. The x86-64 ISA has 16 GP and 16 FP registers, but modern implementations use register renaming[1] to give many more under the hood. For example, Haswell has 168 GP and 168 FP entries in its register file.[2] This contest was purely about speed, so one should favor using as many registers as possible when making any time-memory tradeoffs.