A lot of why’s just don’t make sense to me at low level. It just feels like we need to address the issues in some way, so we make up something and brute force it with gradient descent with large data and enough computational power. It is unknown whether each design choice is a good idea, it will work anyway