Skipping some detail: the model applies many high-dimensional functions to the input, and we don't know the reasoning for why these functions solve the problem.
Reducing the dimension of the weights to human-readable values is non-trivial, and multiple neurons interact in unpredictable ways.
Interpretability research has resulted in many useful results and pretty visualizations[1][2], and there are many efforts to understand Transformers[3][4] but we're far from being able to completely explain the large models currently in use.
Interpretability research has resulted in many useful results and pretty visualizations[1][2], and there are many efforts to understand Transformers[3][4] but we're far from being able to completely explain the large models currently in use.
[1] - https://distill.pub/2018/building-blocks/
[2] - https://distill.pub/2019/activation-atlas/
[3] - https://transformer-circuits.pub/
[4] - https://arxiv.org/pdf/2407.02646