Hacker News new | past | comments | ask | show | jobs | submit login

The proposed replacement definitely makes more sense (and I've always found the absence of a "failed query" to be puzzling in standard attention), but, in deep learning, things that make more sense don't always actually get better results. So I'm curious whether this has been tried and carefully evaluated.



It would be an amusing find if "Black Swan mega-activations" actually but yet unintentionally made the model smarter...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: