The whole AI trend is in part due to things that are now possible on GPU supercomputers with gigabytes of RAM backed by petabytes of data and at the top speed of GPUs. Some of the algorithms date back to before the ai winter and it's just that we can now do the same thing with a ton more data and faster.
All of the main algorithms do (multi-layer perceptrons and stochastic gradient descent are from the 50s and 60s!). Basically the only thing that changed is we decided to multiply some of the outputs of the multi-layer perceptrons by each other and softmax it (attention) before passing them back into more layers. Almost all of the other stuff is just gravy to make it converge faster (and run faster on modern hardware).
Oh, forgot about that one. Wikipedia says ReLU was used in NNs in 1969 but not widely until 2011. Idk if anyone has ever trained a transformer with sigmoid activations, but I don’t immediately see why it wouldn’t work?
I remember some experiments of using modern day training and data on some old style networks, eg with sigmoid activation.
That worked eventually and worked quite well, but took way more compute and training data that anyone back in the olden days would have thought feasible.
The two main problems with sigmoid activation compared to ReLU are: (a) harder to compute (both the value itself and the gradient), and (b) vanishing gradients, especially in deeper networks.