Hacker News new | past | comments | ask | show | jobs | submit login

Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.

You are right, 128 is not enough on Piledriver. Still,

  ./test $(( 512*1024+1024*0 ))
  aligned: 0 sec, 134539 nsec
  unaligned on aligned data: 0 sec, 101471 nsec
  unaligned on one byte unaligned data: 0 sec, 190368 nsec
  unaligned on three bytes unaligned data: 0 sec, 181823 nsec
  aligned nontemporal: 0 sec, 359920 nsec
  naive: 0 sec, 214007 nsec
  c_is_faster_than_asm_a:   0 sec, 92437 nsec
  c_is_faster_than_asm_u:   0 sec, 92643 nsec
  c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
  c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
  c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
  c_is_faster_than_asm_u+8: 0 sec, 155784 nsec

  ./test $(( 512*1024+1024*1 ))
  aligned: 0 sec, 107036 nsec
  unaligned on aligned data: 0 sec, 94861 nsec
  unaligned on one byte unaligned data: 0 sec, 114444 nsec
  unaligned on three bytes unaligned data: 0 sec, 115915 nsec
  aligned nontemporal: 0 sec, 407951 nsec
  naive: 0 sec, 219215 nsec
  c_is_faster_than_asm_a:   0 sec, 82474 nsec
  c_is_faster_than_asm_u:   0 sec, 82554 nsec
  c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
  c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
  c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118952 nsec
4k is the stride of L1, your code slows down 1.5x:

  ./test $(( 512*1024+1024*4 ))
  aligned: 0 sec, 107576 nsec
  unaligned on aligned data: 0 sec, 94010 nsec
  unaligned on one byte unaligned data: 0 sec, 140534 nsec
  unaligned on three bytes unaligned data: 0 sec, 140517 nsec
  aligned nontemporal: 0 sec, 467981 nsec
  naive: 0 sec, 206891 nsec
  c_is_faster_than_asm_a:   0 sec, 85294 nsec
  c_is_faster_than_asm_u:   0 sec, 85174 nsec
  c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
  c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
  c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118638 nsec
  
128k is the stride of L2, both codes slow down further:

  ./test $(( 512*1024+1024*128 ))
  aligned: 0 sec, 167906 nsec
  unaligned on aligned data: 0 sec, 140650 nsec
  unaligned on one byte unaligned data: 0 sec, 239271 nsec
  unaligned on three bytes unaligned data: 0 sec, 251342 nsec
  aligned nontemporal: 0 sec, 458850 nsec
  naive: 0 sec, 364731 nsec
  c_is_faster_than_asm_a:   0 sec, 125240 nsec
  c_is_faster_than_asm_u:   0 sec, 118917 nsec
  c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
  c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
  c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
  c_is_faster_than_asm_u+8: 0 sec, 197842 nsec



Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: