> You can tell it to output in a JSON structure (or some other format) of your choice and it will, with high reliability.
I mean, this is provably false. Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at reliably following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway. By "forcing" I mean either (1) multi-shot prompting: "no, not like that," if the output isn't valid-ish JSON; or (2) literally stripping out—or rejecting—illegal tokens (which is what llama.cpp does[1][2]). And even with all of that, you still won't really have a production-ready pipeline in the general case.
Beyond this, an LLM can easily become confused even if outputting JSON with a valid schema. For instance, we've had mixed results trying to get an LLM to report structured discrepancies between two multi-paragraph pieces of text, each of which might be using flowery language that "reminds" the LLM of marketing language in its training set. The LLM often gets as confused as a human would, if the human were quickly skimming the text and forgetting which text they're thinking about - or whether they're inventing details from memory that are in line with the tone of the language they're reading. These are very reasonable mistakes to make, and there are ways to mitigate the difficulties with multiple passes, but I wouldn't describe the outputs as highly reliable!
I would have agreed with you six months ago, but the latest models - Claude 3, GPT-4o, maybe Llama 3 as well - are much more proficient at outputting JSON correctly.
Seems logical that they will always implement specialized pathways for the most critical and demanding user base. At some point they might even do it all by hand and we wouldn’t know /s
Yes, I'm using them quite extensively with my day to day work for extracting numerical data from unstructured documents. I've been manually verifying the JSON structure and numerical outputs and it's highly accurate for the corpus I'm processing.
FWIW I'm using GPT4o not Llama, I've tried Llama for local tasks and found it pretty lacking in comparison to GPT.
Your comment has an unnecessary and overly negative tone to it that doesn't do this tech justice. These approaches are totally valid and can get you great results. An LLM is just a component in a pipeline. I deployed many of these in production without a hiccup.
Guidance (the industry term for "constraining" the model output) is only there to ensure the output follows a particular grammar. If you need JSON to fit a particular schema or format, then you can always validate it. In case of validation failure you can always pass the JSON and the validation result back to the LLM for it to correct it.
> Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at relaibly following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway.
Yeah it's worked about fifty thousand times for me without issues in the past few months for several NLP production pipelines.
I mean, this is provably false. Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at reliably following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway. By "forcing" I mean either (1) multi-shot prompting: "no, not like that," if the output isn't valid-ish JSON; or (2) literally stripping out—or rejecting—illegal tokens (which is what llama.cpp does[1][2]). And even with all of that, you still won't really have a production-ready pipeline in the general case.
[1] https://github.com/ggerganov/llama.cpp/issues/1300
[2] this is cutely called "constraining" a decoder; what it actually is is correcting a very clear stochastic deficiency in LLMs