Beyond this, an LLM can easily become confused even if outputting JSON with a va...

Beyond this, an LLM can easily become confused even if outputting JSON with a valid schema. For instance, we've had mixed results trying to get an LLM to report structured discrepancies between two multi-paragraph pieces of text, each of which might be using flowery language that "reminds" the LLM of marketing language in its training set. The LLM often gets as confused as a human would, if the human were quickly skimming the text and forgetting which text they're thinking about - or whether they're inventing details from memory that are in line with the tone of the language they're reading. These are very reasonable mistakes to make, and there are ways to mitigate the difficulties with multiple passes, but I wouldn't describe the outputs as highly reliable!