I tried to test this with nanoGPT in an afternoon, since the code change is pret...

I tried to test this with nanoGPT in an afternoon, since the code change is pretty minimal. It's hard to get conclusive results at that scale though - to be able to say anything with confidence you'd need to run multiple tests, figure out if the 'outliers' mentioned only appear above a certain scale, find good tests for quantization performance that work on small enough models that you can iterate quickly ... It's doable but still lots of work, enough that putting out the idea and hoping others with more time+compute will try it out seems a valid strategy to me :) More generally though I definitely agree that the trend among 'improvements' to transformers has been things that don't turn out to work in practice.