Hacker News new | past | comments | ask | show | jobs | submit login

Do outlier features emerge in sub-100M parameter models? I haven't seen any research discuss it below the 124M scale (bert-base). At that scale training a model takes ~4 days on an 8xA100 node.



That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.

However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.

But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.

Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.

I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.

Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: