It’s interesting. It looks like if you’re trying to improve the accuracy/perplexity of the model and using fp32 it doesn’t make a difference , but if you want to quantize it/make it compressible a modified soft max makes a huge difference ( this is what I understand from the Qualcomm paper). Different goals, different findings ?