The Transformer was specifically conceived to take advantage of pre-existing mas... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

HarHarVeryFunny 24 days ago | parent | context | favorite | on: RWKV Language Model

The Transformer was specifically conceived to take advantage of pre-existing massively parallel hardware, so it's a bit backwards to say it "won the hardware lottery". Where the Transformer did "win the lottery" is that the key-value form of self-attention (invented by Noam Shazeer) needed to make parallel processing work seems to have accidentally unlocked capabilities like "induction heads" that make this type of architecture extremely well suited to language prediction.

Given limits on clock speed, massive parallelism is always going to be the way to approach brain-like levels of parallel computation, so any model architecture aspiring to human level AGI needs to be able to take advantage of that.

swyx 24 days ago [–]

you are correct of course but i meant hardware lottery in the sense of dedicated silicon companies like Etched and MatX that have now emerged to make chips that only run transformers (not exactly true for matx but hey i am simplifying. would be cool if matx ran other arch's but its not a priority)

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact