You are right, 128 is not enough on Piledriver. Still,
./test $(( 512*1024+1024*0 )) aligned: 0 sec, 134539 nsec unaligned on aligned data: 0 sec, 101471 nsec unaligned on one byte unaligned data: 0 sec, 190368 nsec unaligned on three bytes unaligned data: 0 sec, 181823 nsec aligned nontemporal: 0 sec, 359920 nsec naive: 0 sec, 214007 nsec c_is_faster_than_asm_a: 0 sec, 92437 nsec c_is_faster_than_asm_u: 0 sec, 92643 nsec c_is_faster_than_asm_u+1: 0 sec, 156574 nsec c_is_faster_than_asm_u+3: 0 sec, 156359 nsec c_is_faster_than_asm_u+4: 0 sec, 154932 nsec c_is_faster_than_asm_u+8: 0 sec, 155784 nsec ./test $(( 512*1024+1024*1 )) aligned: 0 sec, 107036 nsec unaligned on aligned data: 0 sec, 94861 nsec unaligned on one byte unaligned data: 0 sec, 114444 nsec unaligned on three bytes unaligned data: 0 sec, 115915 nsec aligned nontemporal: 0 sec, 407951 nsec naive: 0 sec, 219215 nsec c_is_faster_than_asm_a: 0 sec, 82474 nsec c_is_faster_than_asm_u: 0 sec, 82554 nsec c_is_faster_than_asm_u+1: 0 sec, 112544 nsec c_is_faster_than_asm_u+3: 0 sec, 115159 nsec c_is_faster_than_asm_u+4: 0 sec, 198434 nsec c_is_faster_than_asm_u+8: 0 sec, 118952 nsec
./test $(( 512*1024+1024*4 )) aligned: 0 sec, 107576 nsec unaligned on aligned data: 0 sec, 94010 nsec unaligned on one byte unaligned data: 0 sec, 140534 nsec unaligned on three bytes unaligned data: 0 sec, 140517 nsec aligned nontemporal: 0 sec, 467981 nsec naive: 0 sec, 206891 nsec c_is_faster_than_asm_a: 0 sec, 85294 nsec c_is_faster_than_asm_u: 0 sec, 85174 nsec c_is_faster_than_asm_u+1: 0 sec, 118674 nsec c_is_faster_than_asm_u+3: 0 sec, 118902 nsec c_is_faster_than_asm_u+4: 0 sec, 118370 nsec c_is_faster_than_asm_u+8: 0 sec, 118638 nsec
./test $(( 512*1024+1024*128 )) aligned: 0 sec, 167906 nsec unaligned on aligned data: 0 sec, 140650 nsec unaligned on one byte unaligned data: 0 sec, 239271 nsec unaligned on three bytes unaligned data: 0 sec, 251342 nsec aligned nontemporal: 0 sec, 458850 nsec naive: 0 sec, 364731 nsec c_is_faster_than_asm_a: 0 sec, 125240 nsec c_is_faster_than_asm_u: 0 sec, 118917 nsec c_is_faster_than_asm_u+1: 0 sec, 197348 nsec c_is_faster_than_asm_u+3: 0 sec, 196755 nsec c_is_faster_than_asm_u+4: 0 sec, 199757 nsec c_is_faster_than_asm_u+8: 0 sec, 197842 nsec
You are right, 128 is not enough on Piledriver. Still,
4k is the stride of L1, your code slows down 1.5x: 128k is the stride of L2, both codes slow down further: