To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.
It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.
You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.