Hacker News new | past | comments | ask | show | jobs | submit login

Is there any indication that people had figured out that simpler activation functions like ReLU are worth bothering?



Oh, forgot about that one. Wikipedia says ReLU was used in NNs in 1969 but not widely until 2011. Idk if anyone has ever trained a transformer with sigmoid activations, but I don’t immediately see why it wouldn’t work?


I remember some experiments of using modern day training and data on some old style networks, eg with sigmoid activation.

That worked eventually and worked quite well, but took way more compute and training data that anyone back in the olden days would have thought feasible.

The two main problems with sigmoid activation compared to ReLU are: (a) harder to compute (both the value itself and the gradient), and (b) vanishing gradients, especially in deeper networks.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: