More

jascha_eng · 2025-11-14T19:40:41 1763149241

I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.

I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.

simonw · 2025-11-14T20:08:22 1763150902

Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.

I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...

mparis · 2025-11-14T21:10:50 1763154650

We've been running structured outputs via Claude on Bedrock in production for a year now and it works great. Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response. GG

Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.

It's unclear to me that we will need "modes" for these features.

Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.

Agree or Disagree?

Karrot_Kream · 2025-11-14T21:51:13 1763157073

Before Claude Code shipped with plan mode, the workflow for using most coding agents was to have it create a `PLAN.md` and update/execute that plan. Planning mode was just a first class version of what users were already doing.

tempaccount420 · 2025-11-15T18:41:50 1763232110

> Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response

I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.

The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].

kami23 · 2025-11-15T01:53:20 1763171600

Claude Code keeps coming out with a lot of really nice tools that others haven't started to emulate from what I've seen.

My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.

I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.

To trigger the interactive mode. Do something like:

Plan a fix for:

Please walk me through any options or questions you might have interactively.

koakuma-chan · 2025-11-14T22:23:04 1763158984

I don't think the tool input schema thing does that inference-time trick. I think it just dumps the JSON schema into the context, and tells the model to conform to that schema.

alex_duf · 2025-11-15T13:16:47 1763212607

It's not 100% success, I've had responses that didn't match my schema.

I think the new feature goes on to limit which token can be output, which brings a guarantee, where the tools are a suggestion.

fnordsensei · 2025-11-14T20:50:27 1763153427

Same, but it’s a PITA when you also want to support tool calling at the same time. Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.

miki123211 · 2025-11-15T00:02:45 1763164965

So, so much this.

Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.

They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.

maleldil · 2025-11-15T14:25:04 1763216704

In Python, they're very easy to use. Define your schema with Pydantic and pass the class to your client calls. There are some details to know (eg field order can affect performance), but it's very easy overall. Other languages probably have something similar.

andai · 2025-11-15T15:40:29 1763221229

It's nice but I don't know how necessary it is.

You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)

With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.

Reliability is the final frontier!

macNchz · 2025-11-15T16:19:14 1763223554

Using structured outputs pretty extensively for a while now, my impression has been that the newer models take less of a quality hit while conforming to a specific schema. Just giving instructions and output examples totally worked, however it came at a considerable cost of quality in the output. My impression is that this effect has diminished over time with models that have been more explicitly trained to produce them.

veonik · 2025-11-14T19:59:05 1763150345

I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?

BoorishBears · 2025-11-14T22:14:52 1763158492

Always have a top level object for one.

But also Gemini supports contrained generation which can't fail to match a schema, so why not use that instead of prompting?

astrange · 2025-11-14T23:48:35 1763164115

Constrained generation makes models somewhat less intelligent. Although it shouldn't be an issue in thinking mode, since it can prepare an unconstrained response and then fix it up.

Der_Einzige · 2025-11-15T10:34:31 1763202871

Not true and citation needed. Whatever you cite there are competing papers claiming that structured and constrained generation does zero harm to output diversity/creativity (within a schema).

astrange · 2025-11-17T04:16:01 1763352961

That is clearly not possible. Imagine if you asked a model yes/no questions with a schema that didn't contain "yes".

In general you can break any model by using a sampler that chooses bad enough tokens sometimes. I don't think it's well studied how well different models respond to this.

BoorishBears · 2025-11-15T04:05:44 1763179544

I mean that's too reductionist if you're being exact and not a worry if you're not.

Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.

At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.

sails · 2025-11-14T19:53:25 1763150005

Agree, it feels so fundamental. Any idea why? Gemini has also had it for a long time

crazylogger · 2025-11-15T12:05:15 1763208315

The way you get structured output with Claude prior to this is via tool use.

IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.

swyx · 2025-11-15T02:44:07 1763174647

and they've done super well without it. makes you really question if this is really that core.

jascha_eng · 2025-11-03T20:37:34 1762202254

There are also approaches do doing the filtering while traversing a vector index (not just pre/post) e.g. this paper by microsoft explains an approach https://dl.acm.org/doi/10.1145/3543507.3583552 which pgvectorscale implements here: https://github.com/timescale/pgvectorscale?tab=readme-ov-fil...

In theory these can be more efficient than plain pre/post filtering.

tacoooooooo · 2025-11-03T21:02:39 1762203759

pgvectorscale is not available in RDS so this wasnt a great solution for us! but it does likely solve many of the problems with vanilla pgvector (what this post was about)

jascha_eng · 2025-10-30T10:12:51 1761819171

Yeh it would be fun if we could reverse engineer the prompts from auto generated blog posts. But this is not quite the case.

jascha_eng · 2025-10-24T11:51:42 1761306702

yeh planetscale loves to flex how fast they are but the main reason they are fast is because they run a full abstraction less than any other cloud provider and this does in fact have trade-offs.

samlambert · 2025-10-24T13:43:50 1761313430

What is wrong with running without lots of abstractions? We are clear about the downsides. The results are clear, you can see the customers love it. We run insane amounts of state safely on ephemeral compute. It's a flex. All I've seen from Timescale people is qqing. Write some code or be quiet.

jascha_eng · 2025-10-24T14:04:48 1761314688

I'm not criticizing your engineering approach at all. Running everything in one box has its merits as your benchmarks show but it is also just not apples to apples there are other trade-offs and I am just appreciating that the community calls that out.

Also hey this is HN not Twitter I think we can be a bit more civilized. Not a good look imo for a CEO to get that upset over a harmless comment.

samlambert · 2025-10-24T14:52:32 1761317552

We run 3 nodes not 1. Your comment is not in isolation we get constant shade from Timescale people when we don't even think about you.

jascha_eng · 2025-10-22T14:30:35 1761143435

postgres docs are great but a bit daunting if you're new to it. Found this a much more beginner friendly starting point.

jascha_eng · 2025-10-20T16:40:12 1760978412

I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks). Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?

tifa2up · 2025-10-20T16:43:52 1760978632

OP here. rerankers are finetuned small models, they're cheap and very fast compared to an additional GPT-5 call.

jascha_eng · 2025-10-20T17:14:45 1760980485

It's an async process in my case (custom deep research like) so speed is not that critical

alansaber · 2025-10-20T20:41:45 1760992905

I think you should do both in parallel, rather than sequentially. Main reason is vector scoring could cut off something that an LLM will score as relevant

jascha_eng · 2025-10-17T18:53:16 1760727196

Jup and it doesn't bloat the context unnecessarily. The agent can call --help when it needs it. Just imagine a kubectl MCP with all the commands as individual tools, doesn't make any sense whatsoever.

nomel · 2025-10-17T19:09:45 1760728185

> and it doesn't bloat the context unnecessarily.

And, this is why I usually use simple system prompts/direct chat for "heavy" problems/development that require reasoning. The context bloat is getting pretty nutty, and is definitely detrimental to performance.

okeuro49 · 2025-10-17T19:07:01 1760728021

Do you have any information e.g. blog posts on this pattern?

jascha_eng · 2025-10-15T22:24:15 1760567055

But sonnet 4.5 outperforms opus 4 on most benchmarks and tasks that can't be all that's to it

sharkjacobs · 2025-10-15T23:07:05 1760569625

that's not all there is to it, but I think that "the rest of it" is just additional fine tuning.

Benchmarks are good fixed targets for fine tuning, and I think that Sonnet gets significantly more fine tuning than Opus. Sonnet has more users, which is a strategic reason to focus on it, and it's less expensive to fine tune, if API costs of the two models are an indicator.

jascha_eng · 2025-09-29T21:39:46 1759181986

Cerebras is super cool. I wish OpenAI and Anthropic would have their models hosted there. But I guess supporting yet another platform is hard.

jascha_eng · 2025-09-28T12:15:23 1759061723

Types become very useful when the code base reaches a certain level of sophistication and complexity. It makes sense that for a little script they provide little benefit but once you are working on a code base with 5+ engineers and no longer understand every part of it having some more strict guarantees and interfaces defined is very very helpful. Both for communicating to other devs as well as to simply eradicate a good chunk of possible errors that happen when interfaces are not clear.