I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.
We've been running structured outputs via Claude on Bedrock in production for a year now and it works great. Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response. GG
Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.
It's unclear to me that we will need "modes" for these features.
Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.
Before Claude Code shipped with plan mode, the workflow for using most coding agents was to have it create a `PLAN.md` and update/execute that plan. Planning mode was just a first class version of what users were already doing.
> Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response
I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.
The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].
Claude Code keeps coming out with a lot of really nice tools that others haven't started to emulate from what I've seen.
My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.
I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.
To trigger the interactive mode. Do something like:
Plan a fix for:
<Problem statement>
Please walk me through any options or questions you might have interactively.
I don't think the tool input schema thing does that inference-time trick. I think it just dumps the JSON schema into the context, and tells the model to conform to that schema.
Same, but it’s a PITA when you also want to support tool calling at the same time.
Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
In Python, they're very easy to use. Define your schema with Pydantic and pass the class to your client calls. There are some details to know (eg field order can affect performance), but it's very easy overall. Other languages probably have something similar.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Using structured outputs pretty extensively for a while now, my impression has been that the newer models take less of a quality hit while conforming to a specific schema. Just giving instructions and output examples totally worked, however it came at a considerable cost of quality in the output. My impression is that this effect has diminished over time with models that have been more explicitly trained to produce them.
I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?
Constrained generation makes models somewhat less intelligent. Although it shouldn't be an issue in thinking mode, since it can prepare an unconstrained response and then fix it up.
Not true and citation needed. Whatever you cite there are competing papers claiming that structured and constrained generation does zero harm to output diversity/creativity (within a schema).
That is clearly not possible. Imagine if you asked a model yes/no questions with a schema that didn't contain "yes".
In general you can break any model by using a sampler that chooses bad enough tokens sometimes. I don't think it's well studied how well different models respond to this.
I mean that's too reductionist if you're being exact and not a worry if you're not.
Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.
At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.
The way you get structured output with Claude prior to this is via tool use.
IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.
pgvectorscale is not available in RDS so this wasnt a great solution for us! but it does likely solve many of the problems with vanilla pgvector (what this post was about)
yeh planetscale loves to flex how fast they are but the main reason they are fast is because they run a full abstraction less than any other cloud provider and this does in fact have trade-offs.
What is wrong with running without lots of abstractions? We are clear about the downsides. The results are clear, you can see the customers love it. We run insane amounts of state safely on ephemeral compute. It's a flex. All I've seen from Timescale people is qqing. Write some code or be quiet.
I'm not criticizing your engineering approach at all. Running everything in one box has its merits as your benchmarks show but it is also just not apples to apples there are other trade-offs and I am just appreciating that the community calls that out.
Also hey this is HN not Twitter I think we can be a bit more civilized. Not a good look imo for a CEO to get that upset over a harmless comment.
I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks).
Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?
I think you should do both in parallel, rather than sequentially. Main reason is vector scoring could cut off something that an LLM will score as relevant
Jup and it doesn't bloat the context unnecessarily. The agent can call --help when it needs it. Just imagine a kubectl MCP with all the commands as individual tools, doesn't make any sense whatsoever.
And, this is why I usually use simple system prompts/direct chat for "heavy" problems/development that require reasoning. The context bloat is getting pretty nutty, and is definitely detrimental to performance.
that's not all there is to it, but I think that "the rest of it" is just additional fine tuning.
Benchmarks are good fixed targets for fine tuning, and I think that Sonnet gets significantly more fine tuning than Opus. Sonnet has more users, which is a strategic reason to focus on it, and it's less expensive to fine tune, if API costs of the two models are an indicator.
Types become very useful when the code base reaches a certain level of sophistication and complexity. It makes sense that for a little script they provide little benefit but once you are working on a code base with 5+ engineers and no longer understand every part of it having some more strict guarantees and interfaces defined is very very helpful. Both for communicating to other devs as well as to simply eradicate a good chunk of possible errors that happen when interfaces are not clear.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
reply