It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.
You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.
If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic