It is pretty easy to avoid NaNs when working with softmax, you certainly don't need any epsilons. Just subtract the largest value from everything, and you will have no rounding problems or catastrophic cancellation.
Clearly softmax is not too bad, if it is used extensively in all the most powerful models.
Clearly softmax is not too bad, if it is used extensively in all the most powerful models.