Hacker News new | past | comments | ask | show | jobs | submit | jumploops's comments login

(copied from the o3 + o4-mini thread)

The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these new models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/


The big step function here seems to be RL on tool calling.

Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

[0]https://www.anthropic.com/engineering/building-effective-age...

[1]https://github.com/1rgs/claude-code-proxy

[2]https://openai.com/index/openai-codex/


This article starts to touch on an important distinction between software and products, but seems to miss the larger picture.

With the cost of developing software dropping to near zero, there are a whole class of product features that may not be relevant for most users.

> You haven’t considered encoding, internationalization, concurrency, authentication, telemetry, billing, branding, mobile devices, deployment.

Software, up until now, was only really viable if you could get millions (or billions) of users.

Development costs are high, and as a product developer you're forced to make software that covers as many edge cases as possible, so that it can meet the needs of the most people.

I like to think of this type of software as "average" -- average in the sense that it's tailored not to one specific user or group of users, but necessarily accommodating a larger more amalgamous "ideal" user.

We've all seen software we love get worse over time, as it tries to expand it's scope to more users.

Salesforce may be the quintessential example of this, with each org only using a small fraction of it's total capabilities.

Vibe coding thus enables users to create programs (rather than products) that do exactly what they want, how they want, when they want. Users are no longer forced into the confines of the average user, and can instead tailor software directly to their needs.

Is that software a product? Maybe one day. For now, it's a solution.


One of the surprising benefits of raising a toddler is gaining the ability to instantly tell when another adult has fallen into a "toddler-like" state (myself included!).

Before having kids, I would try and explain someone's behavior in a logical sense.

Toddlers, however, are mostly driven by their current physical needs (hungry/sleepy) and whatever they're currently doing (autonomy).

We've found the most success in avoiding all boolean questions. Do you want to read a book? (when playing with trains before bedtime) Obvious no!

Do you want to read this book or that book? Oh... a decision!

It's striking how well tactics like these work outside the realm of toddlers.


We had a VP make a similar observation during an all hands. In the following all hands, he had to apologize because people felt they were being insulted by being compared to kids. The irony of the situation was not lost on some of us

illusion of choice is extremely effective on c-suite as well. I recommend it for engineers trying to push changes up corporate ladders. Give them three options, the one nobody should ever do, the compromise solution, and the "whale" option. Just like product pricing.

For very young toddlers distraction is also extremely effective but it stops working at some point. Not sure about how effective it is on c-suite someone will have to do some testing.


Never present an option you wouldn't want to live with. The internet teaches us that what "nobody should ever do" isn't obvious to everyone all the time.

Humans have been programming other humans since the beginning of our time (and arguably other species too[0]).

The irony that the affiliate link was for a book about this exact topic, just fantastic.

LLMs are truly memetic machines, the best we’ve created so far.

What’s the difference between a bot and a human who parrots other humans?

Is agency+novelty the new version of the Turing test?

[0] https://www.scientificamerican.com/article/bonobo-calls-are-...


> LLMs are truly memetic machines, the best we’ve created so far.

If I had to guess, I’d say it’s more likely this post is just recycled from some past post that did well, rather than written from-scratch by an LLM.

The rest of the account’s posts are just recycled old posts that did well. Simplest explanation is that this one is too.


Agreed, this has been happening since long before LLMs existed.

I did a quick search for the affiliate tag noted in the blog post, and found another Redditor complaining about that same affiliate tag, but from three other accounts[0].

My fascination is that, with LLMs, the line between obvious bot regurgitation and seemingly human posts is now much thinner.

The fact that the Reddit post was an affiliate link for Edward Bernays' Propaganda is just the cherry on top, in this case it's like selling ice to eskimos.

[0]https://www.reddit.com/r/TheseFuckingAccounts/comments/1giyl...


> Humans have been programming other humans since the beginning of our time (and arguably other species too[0]).

There's the often-captured idea that social interaction (including the ability to reason about what information another being is in possession of, being able to empathize with their viewpoint, anticipate their reactions, and use all of this to manipulate their next set of actions) is perhaps the main driver of intelligence explosion in humans, birds and other noticably more intelligent animals.

I find that interesting because if you look into the SciFi golden age notion of what the Intelligent Machines era would be like, Asimov-style, you usually get depictions of cooly calculating and reasoning, maximally logical machine beings. Yet what we've actually been able to create is mushy, vibe-y text generators that excel at generating manipulative slop. Maybe it's not a coincidence, but somehow echoing the general thrust of higher intelligence.


I put your prompt into the vibe coding tool I'm working on (shameless plug).

The first version[0] looked good, but when I inspected it I found that it just picked an I Ching prediction at random on the back-end, instead of actually flipping coins.

I updated the prompt to the following:

> Create app where you input a question, then flip coins to generate an I Ching prediction (client-side). First show the standard I Ching prediction and it's hexagram, and then use AI to generate a fortune analysis based on the prediction and your initial question.

And the result was much more laborious[1] of a UI :shrug:

[0]https://iching.magicloops.app

[1]https://iching2.magicloops.app


This is awesome.

My wife built a matching game[0] for our toddler, because she doesn't like the flashy content and addictive nature of the games on the app store (and neither do I!)

It's very simple, but it's exactly what she envisioned.

Disclaimer: I work on the vibe code tool she built this with.

[0]https://toddler-matching-game.magicloops.app


> I suspect that machines to be programmed in our native tongues —be it Dutch, English, American, French, German, or Swahili— are as damned difficult to make as they would be to use.

Seeing as transformers are relatively simple to implement…

It stands to reason he was, in some sense, right. LLMs are damn easy to use.


Unless you include "getting correct results" within the scope of "use".

Glad you finally found Claude Code useful, Tom ;)

On a more serious note: I've found that for debugging difficult issues, o1 Pro is in a league of it's own.

Claude Code's eagerness to do work will often fix things given enough time, especially for self-contained pieces of software, but I still find myself going to o1 Pro more often than I'd expect.

A coworker and I did a comparison the other day, where we fired up o1 Pro and Claude Code with the same refactor. o1 Pro one-shotted it, while Claude Code took a few iterations.

Interestingly enough, the _thinking_ time of o1 Pro led us to just commit the Claude Code changes, as they were both finished in around the same time (1 min 37s vs. 2+ minutes), however we did end up using some feedback from o1 to fix an issue Claude hadn't caught. YMMV


Do you mean you want to process multiple files with a single LLM call or process multiple files using the same prompt across multiple LLM calls?

(I would recommend the latter)


Multiple files with a single LLM call.

I have a prompt which works for a single file in Copilot, but it's slower than opening the file and looking at it to find one specific piece of information and re-saving it manually and then running a .bat file to rename with more of the information, then filling out the last two bits when entering things.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: