Learning to Reason with LLMs

OkGoDoIt · 2024-09-12T19:54:49.000000Z

Some practical notes from digging around in their documentation: In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.

Pricing is $15.00 / 1M input tokens and $60.00 / 1M output tokens. Context window is 128k token, max output is 32,768 tokens.

There is also a mini version with double the maximum output tokens (65,536 tokens), priced at $3.00 / 1M input tokens and $12.00 / 1M output tokens.

The specialized coding version they mentioned in the blog post does not appear to be available for use.

It’s not clear if the hidden chain of thought reasoning is billed as paid output tokens. Has anyone seen any clarification about that? If you are paying for all of those tokens it could add up quickly. If you expand the chain of thought examples on the blog post they are extremely verbose.

https://platform.openai.com/docs/models/o1 https://openai.com/api/pricing/ https://platform.openai.com/docs/guides/rate-limits/usage-ti...

mistersquid · 2024-09-12T22:57:16.000000Z

> Some practical notes from digging around in their documentation: In order to get access to this, you need to be on their tier 5 level, which requires $1,000 total paid and 30+ days since first successful payment.

Tier 5 level required for _API access_. ChatGPT Plus users, for example, also have access to the o1 models.

vdfs · 2024-09-12T20:54:38.000000Z

We just receivied this email:

Hi there,

I’m x, PM for the OpenAI API. I’m pleased to share with you our new series of models, OpenAI o1. We’ve developed these models to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

As a trusted developer on usage tier 5, you’re invited to get started with the o1 beta today. Read the docs You have access to two models:

    Our larger model, o1-preview, which has strong reasoning capabilities and broad world knowledge. 
    Our smaller model, o1-mini, which is 80% cheaper than o1-preview.

Try both models! You may find one better than the other for your specific use case. Both currently have a rate limit of 20 RPM during the beta. But keep in mind o1-mini is faster, cheaper, and competitive with o1-preview at coding tasks (you can see how it performs here). We’ve also written up more about these models in our blog post.

I’m curious to hear what you think. If you’re on X, I’d love to see what you build—just reply to our post.

Best, OpenAI API

activatedgeek · 2024-09-12T20:27:21.000000Z

Reasoning tokens are indeed billed as output tokens.

> While reasoning tokens are not visible via the API, they still occupy space in the model's context window and are billed as output tokens.

From here: https://platform.openai.com/docs/guides/reasoning

baq · 2024-09-12T20:53:29.000000Z

This is concerning - how do you know you aren’t being fleeced out of your money here…? You’ll get your results, but did you really use that much?

jstummbillig · 2024-09-13T04:47:21.000000Z

I think it's fantastic that now, for very little money, everyone gets to share a narrow but stressful subset of what it feels like to employ other people.

Really, I recommend reading this part of the thread while thinking about the analogy. It's great.

baq · 2024-09-13T06:12:13.000000Z

It’s nice on the outside, but employees are actually all different people and this here is one company’s blob of numbers with not much incentive to optimize your cost.

Competition fixes some of this, I hope Anthropic and Mistral are not far behind.

adwn · 2024-09-13T06:36:23.000000Z

> […] with not much incentive to optimize your cost. Competition fixes some of this […]

Just like employing other people!

jstummbillig · 2024-09-13T12:19:52.000000Z

On the contrary. It will be the world's most scrutinized employee. Thousands of people, amongst them important people with big levers, will be screaming in their ear on my behalf constantly, and my — our collective — employee gets better without me having to do anything. It's fantastic!

ta8645 · 2024-09-13T05:13:17.000000Z

Your idea is really a brilliant insight. Revealing.

rohanm93 · 2024-09-14T05:06:28.000000Z

I love this so much haha.

"I can only ask my employee 20 smart things this week for $20?! And they get dumber (gpt-4o) after that? Not worth it!"

gsbcbdjfncnjd · 2024-09-13T05:25:26.000000Z

Any respectable employer/employee relationship transacts on results rather than time anyway. Not sure the analogy is very applicable in that light.

adwn · 2024-09-13T06:34:50.000000Z

> Any respectable employer/employee relationship transacts on results rather than time anyway.

No. This may be common in freelance contracts, but is almost never the case in employment contracts, which specify a time-based compensation (usually either per hour or per month).

ethbr1 · 2024-09-13T17:27:20.000000Z

I believe parent's point was that if ones management is clueless as to how to measure output and compensation/continued employment is unlinked from same... one is probably working for a bad company.

gsbcbdjfncnjd · 2024-09-13T06:47:42.000000Z

Yea, I said ‘respectable’.

KeplerBoy · 2024-09-13T07:08:47.000000Z

That's just not how employment laws are written.

anticensor · 2024-09-13T15:16:23.000000Z

Employment law actually permits per-piece payments too, albeit that type of pay scale is rare.

konschubert · 2024-09-13T06:13:19.000000Z

It is!

rsanek · 2024-09-12T21:47:07.000000Z

obfuscated billing has long been a staple of all great cloud products. AWS innovated in the space and now many have followed in their footsteps

lolinder · 2024-09-12T22:15:27.000000Z

Also, now we're paying for output tokens that aren't even output, with no good explanation for why these tokens should be hidden from the person who paid for them.

HeatrayEnjoyer · 2024-09-12T23:24:25.000000Z

If you read the link they have a section specifically explaining why it is hidden.

lolinder · 2024-09-12T23:58:09.000000Z

I read it. It's a bad explanation.

The only bit about it that feels at all truthful is this bit, which is glossed over but likely the only real factor in the decision:

> after weighing multiple factors including ... competitive advantage ... we have decided not to show the raw chains of thought to users.

infogulch · 2024-09-13T01:09:01.000000Z

Good catch. That indicates that chains of thought are a straightforward approach to make LLMs better at reasoning if you could copy it just by seeing the steps.

HeatrayEnjoyer · 2024-09-16T13:13:26.000000Z

Bad, in your opinion.

RobertDeNiro · 2024-09-12T23:08:48.000000Z

Also seems very impractical to embed this into a deployed product. How can you possibly hope to control and estimate costs? I guess this is strictly meant for R&D purposes.

sebzim4500 · 2024-09-12T23:18:09.000000Z

You can specify the max length of the response, which presumably includes the hidden tokens.

I don't see why this is qualitatively different from a cost perspective than using CoT prompting on existing models.

BoorishBears · 2024-09-13T00:27:41.000000Z

For one, you don't get to see any output at all if you run out of tokens during thinking.

If you set a limit, once it's hit you just get a failed request with no introspection on where and why CoT went off the rails

Aeolun · 2024-09-13T03:07:51.000000Z

Why would I pay for zero output? That’s essentially throwing money down the drain.

dartos · 2024-09-12T23:32:43.000000Z

You can’t verify that you’re paying what you should be if you can’t see the hidden tokens.

sebzim4500 · 2024-09-13T15:41:17.000000Z

With the conventional models you don't get the activations or the logits even though those would be useful.

Ultimately if the output of the model is not worth what you end up paying for it then great, I don't see why it really matters to you whether OpenAI is lying about token counts or not.

dartos · 2024-09-13T19:07:48.000000Z

As a single user, it doesn’t really, but as a SaaS operator I want tractable, hopefully predictable pricing.

I wouldn’t just implicitly trust a vendor when they say “yeah we’re just going to charge you for what we feel like when we feel like. You can trust us.”

HarHarVeryFunny · 2024-09-13T02:27:42.000000Z

They are currently trying to raise money (talk of new $150B valuation), so that may have something to do with it

Emiledel · 2024-09-13T00:31:01.000000Z

In the UI the reasoning is visible. The API can probably return it too, just check the code

AlphaAndOmega0 · 2024-09-13T08:00:31.000000Z

OAI doesn't show the actual COT, on the grounds that it's potentially unsafe output and also to prevent competitors training on it. You only see a sanitized summary.

og_kalu · 2024-09-13T01:22:30.000000Z

What's shown in the UI is a summary of the reasoning

creatonez · 2024-09-13T08:07:45.000000Z

No access to reasoning output seems totally bonkers. All of the real cost is in inference, assembling an HTTP request to deliver that result seems trivial?

arnaudsm · 2024-09-12T22:32:35.000000Z

Some of the queries run for multiple minutes. 40 tokens/sec is too slow for CoT.

I hope OpenAI is investing in low-latency like Groq's tech that can reach 1k tokens/sec.

p-e-w · 2024-09-13T02:14:09.000000Z

It's slow and expensive if you compare it with other LLMs.

It's lightning fast and dirt cheap if you compare it to consulting with a human expert, which it appears to be competitive with.

barrell · 2024-09-13T04:20:58.000000Z

I would say consulting with a human. Any expert who has a conversation with chatGPT about their field will verify that it is very far from expert

p-e-w · 2024-09-13T04:25:20.000000Z

According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.

adonese · 2024-09-13T04:54:09.000000Z

>According to the data provided by OpenAI, that isn't true anymore

OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.

mptest · 2024-09-14T04:39:40.000000Z

I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide

also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2

barrell · 2024-09-13T07:49:18.000000Z

What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.

You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.

Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.

PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.

Extasia785 · 2024-09-13T06:01:27.000000Z

> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.

Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.

njndtu · 2024-09-13T19:00:03.000000Z

When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.

What is interesting is the following paragraph in the post " With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. " So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.

I came across one of the ioi questions this year that I had trouble solving (I am pretty noob tho) which made me curious about how these reported results were reflected. The question at hand being https://github.com/ioi-2024/tasks/blob/main/day2/hieroglyphs... Apparently, the model was able to get it partially correct. https://x.com/markchen90/status/1834358725676572777

attentive · 2024-09-13T00:16:25.000000Z

So, basically, it's chain of thought as a service?

Not a model, per se, but a service that chains multiple model requests behind the scene?

KeplerBoy · 2024-09-13T07:11:14.000000Z

Who knows? Certainly not the public.

It might be a finetuned model that works better in such a setting.

OkGoDoIt · 2024-09-13T16:48:15.000000Z

The linked blog posts explains that it is fine-tuned on some reinforcement learning process. It doesn’t go into details but they do claim it’s not just the base model with chain of thought, there’s some fine-tuning going on.

liamwire · 2024-09-12T21:05:28.000000Z

Unless this is specifically relating to API access, I don’t think it’s correct. I’ve been paying for ChatGPT via the App Store IAP for around a year or less, and I’ve already got both o1-preview and o1-mini available in-app.

OkGoDoIt · 2024-09-12T21:17:14.000000Z

Yes, I was referring to API access specifically. Nothing in the blog post or the documentation mentions access to these new models on ChatGPT, and even as a paid user I’m not seeing them on there (Edit: I am seeing it now in the app). But looks like a bunch of other people in this discussion do have it on ChatGPT, so that’s exciting to hear.

kordlessagain · 2024-09-13T17:05:10.000000Z

I'm a bit late to the show, but it would seem the API calls for these new models don't support system messages (where role is system) or the tool list for function calls.

sashank_1509 · 2024-09-12T21:03:54.000000Z

I have access to this and there is no way I spend more than 50$ on OpenAI api. I have ChatGPT + since day q though (240$ probably in total)

rpmisms · 2024-09-12T21:52:45.000000Z

You missed your raise key on "day q"

thelastparadise · 2024-09-12T23:53:38.000000Z

Raise it up just one

amrrs · 2024-09-12T20:27:26.000000Z

The CoT is billed as output tokens. Mentioned in the docs where it talks about reasoning

jonahx · 2024-09-12T22:49:27.000000Z

I am an ordinary plus user (since it was released more or less) and have access.

Buttons840 · 2024-09-12T21:04:22.000000Z

I am a Plus user and pay $20 per month. I have access to the o1 models.

kinj28 · 2024-09-13T01:56:34.000000Z

A bit out of context.

Am curious if at some point length of context window stops playing any material difference in the output and it just stops making any economical sense as law of marginal diminishing utility kicks in.

anigbrowl · 2024-09-12T22:42:13.000000Z

you need to be on their tier 5 level, which requires $1,000 total paid and [...]

Good opening for OpenAI's competitors to run a 'we're not snobs' promotion.

infecto · 2024-09-12T23:41:08.000000Z

How so? I think most of the competition does this. Early partners/heavy users get access first which 1) hopefully provides feedback on the product and 2) provides a mechanism to stagger the release.

anigbrowl · 2024-09-13T23:14:10.000000Z

Marketing is about feelings, not facts.

infecto · 2024-09-15T01:49:53.000000Z

We could tell it impacted your feelings but most business don’t run off of feelings. There is sometimes alignment on morals/being a good business partner but before that it’s the quality of the product and the cost.

ARandumGuy · 2024-09-12T17:47:41.000000Z

One thing that makes me skeptical is the lack of specific labels on the first two accuracy graphs. They just say it's a "log scale", without giving even a ballpark on the amount of time it took.

Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.

The coding section indicates "ten hours to solve six challenging algorithmic problems", but it's not clear to me if that's tied to the graphs at the beginning of the article.

The article contains a lot of facts and figures, which is good! But it doesn't inspire confidence that the authors chose to obfuscate the data in the first two graphs in the article. Maybe I'm wrong, but this reads a lot like they're cherry picking the data that makes them look good, while hiding the data that doesn't look very good.

swatcoder · 2024-09-12T18:13:49.000000Z

> Did the 80% accuracy test results take 10 seconds of compute? 10 minutes? 10 hours? 10 days? It's impossible to say with the data they've given us.

The gist of the answer is hiding in plain sight: it took so long, on an exponential cost function, that they couldn't afford to explore any further.

The better their max demonstrated accuracy, the more impressive this report is. So why stop where they did? Why omit actual clock times or some cost proxy for it from the report? Obviously, it's because continuing was impractical and because those times/costs were already so large that they'd unfavorably affect how people respond to this report

jsheard · 2024-09-12T18:35:20.000000Z

See also: them still sitting on Sora seven months after announcing it. They've never given any indication whatsoever of how much compute it uses, so it may be impossible to release in its current state without charging an exorbitant amount of money per generation. We do know from people who have used it that it takes between 10 and 20 minutes to render a shot, but how much hardware is being tied up during that time is a mystery.

ben_w · 2024-09-12T18:50:00.000000Z

Could well be.

It's also entirely possible they are simply sincere about their fear it may be used to influence the upcoming US election.

Plenty of people (me included) are sincerely concerned about the way even mere still image generators can drown out the truth with a flood of good-enough-at-first-glance fiction.

jsheard · 2024-09-12T18:52:28.000000Z

If they were sincere about that concern then they wouldn't build it at all, if it's ever made available to the public then it will eventually be available during an election. It's not like the 2024 US presidential election is the end of history.

e1g · 2024-09-12T19:25:00.000000Z

The risk is not “interfering with the US elections”, but “being on the front page of everything as the only AI company interfering with US elections”. This would destroy their peacocking around AGI/alignment while raising billions from pension funds.

OpenAI is in a very precarious position. Maybe they could survive that hit in four years, but it would be fatal today. No unforced errors.

smegger001 · 2024-09-12T20:23:02.000000Z

i think the hope is by the next presidential election no one will trust video anymore anyway so the new normal wont be as chaotic as if the dropped in the middle of an already contentious election.

as for not building it at all its a obvious next step in generative ai models that if they don't make it someone else will anyway.

Aeolun · 2024-09-13T03:13:28.000000Z

Wouldn’t it be nice if we came full circle and went to listen to our politicians live because anything else would be pointless.

ben_w · 2024-09-13T07:43:39.000000Z

I'd give it about 20 years before humanoid robots can be indistinguishable from originals without an x-ray or similar — covering them in vat-grown cultures of real human skin etc. is already possible but the robots themselves aren't good enough to fool anyone.

smegger001 · 2024-09-13T03:43:23.000000Z

unfortunately that would mean two firstly things only swing states would get to hear what politicians are actually saying and secondly to reach everyone the primary process would have to start even earlier so the candidates would have a chance to give enough speeches before early voting

bamboozled · 2024-09-12T21:08:27.000000Z

Even if Kamala wins (praise be to god that she does), those people aren't just going to go away until social media does. Social media is the cause of a lot of the conspiracy theory mania.

So yeah, better to never release the model...even though Elon would in a second if he had it.

digging · 2024-09-12T18:56:03.000000Z

Doesn't strike me as the kind of principle OpenAI is willing to slow themselves down for, to be honest.

dvfjsdhgfv · 2024-09-12T18:53:06.000000Z

But this cat run out of the bag years ago, didn't it? Trump himself is using AI-generated images in his campaign. I'd go even further: the more fake images appear, the faster the society as a whole will learn to distrust anything by default.

01HNNWZ0MV43FF · 2024-09-12T19:35:49.000000Z

Personally I'm not a fan of accelerationism

ben_w · 2024-09-12T21:03:11.000000Z

Nothing works without trust, none of us is an island.

Everyone has a different opinion on what threshold of capability is important, and what to do about it.

Atotalnoob · 2024-09-12T18:59:30.000000Z

Why did they release this model then?

ben_w · 2024-09-12T21:33:31.000000Z

Their public statements that the only way to safely learn how to deal with the things AI can do, is to show what it can do and get feedback from society:

"""We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios.""" - https://openai.com/index/planning-for-agi-and-beyond/

I don't know if they're actually correct, but it at least passes the sniff test for plausibility.

gloryjulio · 2024-09-12T18:50:24.000000Z

Also the the sora videos are proven to be modified ads. We still need to see how it perform first

MrNeon · 2024-09-12T21:03:02.000000Z

> Also the the sora videos are proven to be modified ads

Can't find anything about that, you got a link?

gloryjulio · 2024-09-12T21:52:12.000000Z

https://futurism.com/the-byte/openai-sora-demo https://old.reddit.com/r/vfx/comments/1cuj360/turns_out_that...

here is the link. The balloon video had heavy editing involved.

MrNeon · 2024-09-13T13:30:18.000000Z

Oh, so not the actual demo videos OpenAI shared on their website and twitter.

gloryjulio · 2024-09-14T03:26:46.000000Z

We still need to see those demos in action though. That's the big IF where everyone is thinking about

MrNeon · 2024-09-14T21:50:05.000000Z

Sure but "Also the the sora videos are proven to be modified ads" is demonstrably false, for the demos OpenAI shared and the artist made ones.

gloryjulio · 2024-09-16T16:16:02.000000Z

https://www.youtube.com/watch?v=9oryIMNVtto

Isn't this balloon video shared by openai? How is this not counted? For others I don't have evidences. But this balloon video case is enough to cast the doubts.

adroniser · 2024-09-12T22:57:55.000000Z

But there are lots of models available now that render much faster which are better quality than sora

wmf · 2024-09-12T17:56:33.000000Z

People have been celebrating the fact that tokens got 100x cheaper and now here's a new system that will use 100x more tokens.

jsheard · 2024-09-12T18:07:24.000000Z

Also you now have to pay for tokens you can't see, and just have to trust that OpenAI is using them economically.

brookst · 2024-09-12T18:09:18.000000Z

Token count was always an approximation of value. This may help break that silly idea.

regularfry · 2024-09-12T18:41:47.000000Z

I don't think it's much good as an approximation of value, but it seems ok as an approximation of cost.

brookst · 2024-09-15T04:40:16.000000Z

Fair, cost and value are only loosely related. Trying to price based on cost always turns into a mess.

regularfry · 2024-09-16T07:30:51.000000Z

Its what you do when you're a commodity.

seydor · 2024-09-12T18:11:46.000000Z

If it 's reasoning correctly, it shouldnt need a lot of tokens because you don't need to correct it.

You only need to ask it to solve nuclear fusion once.

from-nibly · 2024-09-12T18:17:32.000000Z

As someone experienced with operations / technical debt / weird company specific nonsense (Platform Engineer). No, you have to solve nuclear fusion at <insert-my-company>. You gotta do it over and over again. If it were that simple we wouldn't have even needed AI we would have hand written a few things, and then everything would have been legos, and legos of legos, but it takes a LONG time to find new true legos.

outofpaper · 2024-09-12T18:53:32.000000Z

I'm pretty sure everything is Lego and Legos of Legos.

You show me something new and I say look down at who's shoulders we're standing on, what libraries we've build with.

from-nibly · 2024-09-14T20:16:31.000000Z

Yeah but thats not a Lego. A Lego is something that fits everwhere else. Not just previous work. There's a lot of previous work. There are very few true Legos.

FridgeSeal · 2024-09-12T23:02:01.000000Z

Yeah you’re right, all businesses are made of identical, interchangeable parts that we can swap out at our leisure.

This is why enterprises change ERP systems frictionlessly, and why the field of software engineering is no longer required. In fact, given that apparently, all business is solved, we can probably just template them all out, call it a day and all go home.

0x_rs · 2024-09-12T18:28:31.000000Z

AlphaFold simulated the structure of over 200 million proteins. Among those, there could be revolutionary ones that could change the medical scientific field forever, or they could all be useless. The reasoning is sound, but that's as far as any such tool can get, and you won't know it until you attempt to implement it in real life. As long as those models are unable to perfectly recreate the laws of the universe to the maximum resolution imaginable and follow them, you won't see an AI model, let alone a LLM, provide anything of the sort.

joquarky · 2024-09-13T06:07:41.000000Z

Perhaps GenAI may point out a blind spot, just as a kid may see something the adults do not due to stale heuristics

charlescurt123 · 2024-09-12T18:20:27.000000Z

with these methods the issue is the log scale of compute. Let's say you ask it to solve fusion. It may be able to solve it but the issue is it's unverifiable WHICH was correct.

So it may generate 10 Billion answers to fusion and only 1-10 are correct.

There would be no way to know which one is correct without first knowing the answer to the question.

This is my main issue with these methods. They assume the future via RL then when it gets it right they mark that.

We should really be looking at methods of percentage it was wrong rather then it was right a single time.

genewitch · 2024-09-12T18:42:17.000000Z

This sounds suspiciously like the reason that quantum compute is not ready for prime-time yet.

msp26 · 2024-09-12T18:14:29.000000Z

Have you seen how long the CoT was for the example. It's incredibly verbose.

slt2021 · 2024-09-12T18:53:26.000000Z

I find there is an educational benefit in verbosity, it helps to teach user to think like a machine

legel · 2024-09-12T19:34:52.000000Z

Which is why it is incredibly depressing that OpenAI will not publish the raw chain of thought.

“Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.”

slt2021 · 2024-09-12T20:28:14.000000Z

maybe they will enable to show CoT for a limited uses, like 5 prompts a day for Premium users, or for Enterprise users with agreement not to steal CoT or something like that.

if OpenAI sees this - please allow users to see CoT for a few prompts per day, or add it to Azure OpenAI for Enterprise customers with legal clauses not to steal CoT

briansm · 2024-09-13T07:10:00.000000Z

Imagine if this tech was available in the middle ages and it was asked to 'solve' alchemy or perpetual motion, and responded that it was an impossible problem... people would (irrationally from our perspective) go Luddite on it I suspect. Now apply to the 'fusion power' problem.

zamadatix · 2024-09-12T18:52:31.000000Z

The new thing that can do more at the "ceiling" price doesn't remove your ability to still use the 100x cheaper tokens for the things that were doable on that version.

mewpmewp2 · 2024-09-12T18:45:24.000000Z

Isn't that part of developing a new tech?

digging · 2024-09-12T19:00:45.000000Z

That exact pattern is always true of technological advance. Even for a pretty broad definition of technology. I'm not sure if it's perfectly described by the name "induced demand" but it's basically the same thing.

energy123 · 2024-09-12T23:07:15.000000Z

It does dispel this idea that we are going to be flooded with too many GPUs.

olalonde · 2024-09-13T01:51:44.000000Z

"People have been celebrating the fact that RAM got 100x cheaper and now here's a new system that will use 100x more RAM."

anticensor · 2024-09-13T15:42:17.000000Z

Known as Wirth's law.

esafak · 2024-09-12T18:32:33.000000Z

...while providing a significant advance. That's a good problem.

cowpig · 2024-09-12T18:01:10.000000Z

Isn't that part of the point?

worstspotgain · 2024-09-12T18:48:39.000000Z

I don't think it's hard to compute the following:

- At the high end, there is a likely nonlinear relationship between answer quality and compute.

- We've gotten used to a flat-price model. With AGI-level models, we might have to pay more for more difficult and more important queries. Such is the inherent complexity involved.

- All this stuff will get better and cheaper over time, within reason.

I'd say let's start by celebrating that machine thinking of this quality is possible at all.

jstummbillig · 2024-09-12T18:10:45.000000Z

I don't think it's worth any debate. You can simply find out how it does for you, now(-ish, rolling out).

In contrast: Gemini Ultra, the best, non-existent Google Model for the past few month now, that people nonetheless are happy to extrapolate excitement over.

FridgeSeal · 2024-09-12T22:58:22.000000Z

Bold of you to expect transparency and clarity from a company like OpenAI.

You wanted reliable readable graphs? Ppphhh, get out of here, but pay of for the CoT tokens you’ll never see on your way out though.

packetlost · 2024-09-12T17:57:33.000000Z

When one axis is on log scale and the other is linear with the plot points appearing linear-ish, doesn't it mean there's a roughly exponential relationship between the two axis?

ARandumGuy · 2024-09-12T18:12:15.000000Z

It'd be more accurate to call it a logarithmic relationship, since compute time is our input variable. Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time.

In either case, that still doesn't excuse not labeling your axis. Taking 10 seconds vs 10 days to get 80% accuracy implies radically different things on how developed this technology is, and how viable it is for real world applications.

Which isn't to say a model that takes 10 days to get an 80% accurate result can't be useful. There are absolutely use cases where that could represent a significant improvement on what's currently available. But the fact that they're obfuscating this fairly basic statistic doesn't inspire confidence.

packetlost · 2024-09-12T18:17:51.000000Z

> Which itself is a bit concerning, as that implies that modest gains in accuracy require exponentially more compute time

This is more of what I was getting at. I agree they should label the axis regardless, but I think the scaling relationship is interesting (or rather, concerning) on its own.

KK7NIL · 2024-09-12T18:34:28.000000Z

The absolute time depends on hardware, optimizations, exact model, etc; it's not a very meaningful number to quantify the reinforcement technique they've developed, but it is very useful to estimate their training hardware and other proprietary information.

j_maffe · 2024-09-12T19:22:35.000000Z

It's not about the literally quantity/value, it's about the order of growth of output vs input. Hardware and optimizations don't really change that.

KK7NIL · 2024-09-12T19:41:09.000000Z

Exactly, that's why the absolute computation time doesn't matter, only relative growth, which is exactly what they show.

diedyesterday · 2024-09-14T02:33:23.000000Z

A linear graph with a log scale on the vertical axis means the original graph had near exponential growth.

A linear graph with a log scale on the the horizontal axis means the original graph had law of diminishing return kick it (somewhat similar to logarithmic but with a vertical asymptote).

bluecoconut · 2024-09-12T18:41:08.000000Z

Super hand-waving rough estimate: Going off of five points of reference / examples that sorta all point in the same direction. 1. looks like they scale up by about ~100-200 on the x axis when showing that test time result. 2. Based on the o1-mini post [1], there's an "inference cost" where you can see GPT-4o and GPT-4o mini as dots in the bottom corner, haha (you can extract X values, ive done so below) 3. There's a video showing the "speed" in the chat ui (3s vs. 30s) 4. The pricing page [2] 5. On their API docs about reasoning, they quantify "reasoning tokens" [3]

First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)

Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.

I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).

Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)

All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).

Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.

Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.

All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete. I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)

[1] https://openai.com/index/openai-o1-mini-advancing-cost-effic...

  Extracting "x values" from the SVG:
  GPT-4o-mini: 0.3175
  GPT-4o: 5.7785
  o1: (354.7745, 635)
  o1-preview: (278.257, 325.9455)
  o1-mini: (66.8655, 147.574)

[2] https://openai.com/api/pricing/

  gpt-4o:
  $5.00 / 1M input tokens
  $15.00 / 1M output tokens

  o1-preview:
  $15.00 / 1M input tokens
  $60.00 / 1M output tokens

[3] https://platform.openai.com/docs/guides/reasoning

  usage: {
    total_tokens: 1000,
    prompt_tokens: 400,
    completion_tokens: 600,
    completion_tokens_details: {
      reasoning_tokens: 500
    }
  }

bluecoconut · 2024-09-12T19:15:27.000000Z

Some other follow up reflections

1. I wish that Y-axes would switch to be logit instead of linear, to help see power-law scaling on these 0->1 measures. In this case, 20% -> 80% it doesn't really matter, but for other papers (eg. [2] below) it would help see this powerlaw behavior much better.

2. The power law behavior of inference compute seems to be showing up now in multiple ways. Both in ensembles [1,2], as well as in o1 now. If this is purely on decoding self-reflection tokens, this has a "limit" to its scaling in a way, only as long as the context length. I think this implies (and I am betting) that relying more on multiple parallel decodings is more scalable (when you have a better critic / evaluator).

For now, instead of assuming they're doing any ensemble like top-k or self-critic + retries, the single rollout with increasing token size does seem to roughly match all the numbers, so that's my best bet. I hypothesize we'd see a continued improvement (in the same power-law sort of way, fundamentally along with the x-axis of "flop") if we combined these longer CoT responses, with some ensemble strategy for parallel decoding and then some critic/voting/choice. (which has the benefit of increasing flops (which I believe is the inference power-law), while not necessarily increasing latency)

[1] https://arxiv.org/abs/2402.05120 [2] https://arxiv.org/abs/2407.21787

bluecoconut · 2024-09-12T19:41:45.000000Z

oh, they do talk about it

  On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.

showing that as they increase the k of ensemble, they can continue to get it higher. All the way up to 93% when using 1000 samples.

620gelato · 2024-09-12T19:53:47.000000Z

I think I'd be curious to know, if the size of ensemble is another scaling dimension for compute, alongside the "thinking time".

skywhopper · 2024-09-12T18:36:17.000000Z

Yeah, this hiding of the details is a huge red flag to me. Even if it takes 10 days, it’s still impressive! But if they’re afraid to say that, it tells me they are more concerned about selling the hype than building a quality product.

bjornsing · 2024-09-12T18:19:15.000000Z

So now it’s a question of how fast the AGI will run? :)

HarHarVeryFunny · 2024-09-12T19:05:25.000000Z

It's not AGI - it's tree of thoughts, driven by some RL-derived heuristics.

I suppose what this type of approach provides is better prediction/planning by using more of what the model learnt during training, but it doesn't address the model being able to learn anything new.

It'll be interesting to see how this feels/behaves in practice.

juliend2 · 2024-09-12T22:17:59.000000Z

I see this pattern coming where we're still able to say:

"It's not AGI - it's X, driven by Y-driven heuristics",

but that's going to effectively be an AGI if given enough compute/time/data.

Being able to describe the theory of how it's doing its thing sure is reassuring though.

suziemanul · 2024-09-13T09:30:09.000000Z

Yes.. we have 60% that inception will happen within 24 months.

oblio · 2024-09-12T18:25:35.000000Z

It's fine, it will only need to be powered by a black hole to run.

atomic128 · 2024-09-12T23:36:52.000000Z

Nuclear fission is the answer.

The company Oracle just announced that it is designing data centers with small modular nuclear reactors:

https://news.ycombinator.com/item?id=41505514

There are already 440 nuclear reactors operating in 32 countries today.

Sam Altman owns a stake in Oklo, a small modular reactor company. Bill Gates has a huge stake in his TerraPower reactor company. In China, 5 reactors are being built every year. You just don't hear about it... yet.

No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.

Nuclear fission is safe, clean, secure, and reliable.

An investor might consider buying physical uranium (via ticker SRUUF in America) or buying Cameco (via ticker CCJ).

Cameco is the dominant Canadian uranium mining company that also owns Westinghouse. Westinghouse licenses the AP1000 pressurized water reactor used at Vogtle in the U.S. as well as in China.

oblio · 2024-09-13T10:36:42.000000Z

Hey, I got a random serious comment about nuclear power :-)))

To your point:

> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.

Like nuclear winter caused by a nuclear power plant blowing up and everyone confusing the explosion with the start of a nuclear war? :-p

On a more serious note:

> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather. It's like range anxiety in an electric car. If you have N days of battery storage and the sun doesn't shine for N+1 days, you're in trouble.

We still have hydro plants, wind power, geothermal, long distance electrical transmission, etc. Also, what's "doesn't shine"? Solar panels generate power as long as it's not night and it's never night all the time around the world.

Plus they're developing sodium batteries, if you want to put your money somewhere, put it there. Those will be super cheap and they're the perfect grid-level battery.

roenxi · 2024-09-13T11:20:22.000000Z

> ... and it's never night all the time around the world.

I'm not sure that is 100% true. >99.99% true, but it can happen in practice. https://www.newsweek.com/when-sun-disappeared-historians-det...

oblio · 2024-09-13T14:19:12.000000Z

The wind was still blowing, the rivers were still flowing, ... :-)

fragmede · 2024-09-13T11:11:40.000000Z

> No amount of batteries can protect a solar/wind grid from an arbitrarily extended period of "bad" weather.

Sure there is, let's do some math. Just like we can solve all of the Earth's energy needs with a solar array the size of Lithuania or West Virginia, we can do some simple math to see how many batteries we'd need to protect a solar grid.

Let's say the sun doesn't shine for an entire year. That seems like a large enough N such that we won't hit N+1. If the sun doesn't shine for an entire year, we're in some really serious trouble, even if we're still all-in on coal.

Over 1 year, humanity uses roughly 24,000 terawatt-hours of energy. Let's assume batteries are 100% efficient storage (they're not) and that we're using lithium ion batteries, which we'll say have an energy density of 250 watt-hours per liter (Wh/L). The math then says we need 96 km³ of batteries protect a solar grid from having the sun not shine for an entire year.

Thus, the amount of batteries to protect a solar grid is 1.92 quadrillion 18650 batteries, or a cube 4.6 kilometers along each side. This is about 24,000 year's worth of current world wide battery production.

That's quite a lot! If we try for N = 4 months for winter, that is to say, if the sun doesn't shine at all in the winter, then we'd need 640 trillion 18650 cell, or 8,000 years of current global production, but at least this would only be 32 km³, or a cube with 3.2 km sides.

Still wildly out of reach, but this is for all of humanity, mind you.

Anyway, point is, they said Elon was mad for building the original gigafactory, but it turns out that was a prudent investment. It now accounts for some 10% of the world's lithium ion battery production and demand for lithium-ion batteries doesn't seem to be letting up.

oblio · 2024-09-13T14:20:28.000000Z

Well, you have to take into account that if something like that were to happen, within 1 week we'd have curfews and rationing everywhere. So those 24 000 TWh probably become 5-6 000, or something like that.

Plus we'd still have hydro, wind, geothermal, etc, etc.

exe34 · 2024-09-12T18:30:39.000000Z

the first one anyway. after that it will find more efficient ways. we did, afterall.

wahnfrieden · 2024-09-12T20:43:25.000000Z

it's not obviously achievable. for instance, we don't have the compute power to simulate cellular organisms of much complexity, and have not found efficiencies to scale that

HeatrayEnjoyer · 2024-09-12T23:26:54.000000Z

Human level AGI only requires 20 watts

oblio · 2024-09-13T10:37:20.000000Z

With a mechanism for AGI we don't comprehend at all.

Airplanes don't fly by flapping their wings.

exe34 · 2024-09-13T13:02:51.000000Z

knowing that a mechanism exists is enough to motivate us to find one that works with our current or achievable tech.

oblio · 2024-09-13T19:53:07.000000Z

We were motivated to fly since the earliest human.

exe34 · 2024-09-13T21:32:44.000000Z

and we've had stories of artificial humans since we've been writing down stories!

suziemanul · 2024-09-13T09:31:08.000000Z

This is still the missing piece of the puzzle.

ComputerGuru · 2024-09-12T18:47:09.000000Z

The "safety" example in the "chain-of-thought" widget/preview in the middle of the article is absolutely ridiculous.

Take a step back and look at what OpenAI is saying here "an LLM giving detailed instructions on the synthesis of strychnine is unacceptable, here is what was previously generated <goes on to post "unsafe" instructions on synthesizing strychnine so anyone Googling it can stumble across their instructions> vs our preferred, neutered content <heavily rlhf'd o1 output here>"

What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"

w4 · 2024-09-12T19:27:09.000000Z

There are two basic versions of “safety” which are related, but distinct:

One version of “safety” is a pernicious censorship impulse shared by many modern intellectuals, some of whom are in tech. They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink. This is bad, and should be resisted.

The other form of “safety” is a very prudent impulse to keep these sorts of potentially dangerous outputs out of AI models’ autoregressive thought processes. The goal is to create thinking machines that can act independently of us in a civilized way, and it is therefore a good idea to teach them that their thought process should not include, for example, “It would be a good idea to solve this problem by synthesizing a poison for administration to the source of the problem.” In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it. This is, incidentally, exactly how human society works already. We have a ton of very cute unaligned general intelligences running around (children), and parents and society work really hard to teach them what’s right and wrong so that they can behave ethically when they’re eventually out in the world on their own.

jazzyjackson · 2024-09-12T19:31:53.000000Z

Third version is "brand safety" which is, we don't want to be in a new york times feature about 13 year olds following anarchist-cookbook instructions from our flagship product

klabb3 · 2024-09-12T20:07:15.000000Z

And the fourth version, which is investor-regulator safety mid point: so capable and dangerous that competitors shouldn’t even be allowed to research it, but just safe enough that only our company is responsible enough to continue mass commercial consumer deployment without any regulations at all. It’s a fine line.

drewbeck · 2024-09-13T00:44:55.000000Z

This is imo the most important one to the businesses creating these models and is way under appreciated. Folks who want a “censorship-free” model from businesses don’t understand what a business is for.

darby_nine · 2024-09-12T20:54:24.000000Z

...which is silly. Search engines never had to deal with this bullshit and chatbots are search without actually revealing the source.

w4 · 2024-09-12T21:27:10.000000Z

I don’t know. The public’s perception - encouraged by the AI labs because of copyright concerns - is that the outputs of the models are entirely new content created by the model. Search results, on the other hand, are very clearly someone else’s content. It’s therefore not unfair to hold the model creators responsible for the content the model outputs in a different way than search engines are held responsible for content they link, and therefore also not unfair for model creators to worry about this. It is also fair to point this out as something I neglected to identify as an important permutation of “safety.”

I would also be remiss to not note that there is a movement to hold search engines responsible for content they link to, for censorious ends. So it is unfortunately not as inconsistent as it may seem, even if you treat the model outputs as dependent on their inputs.

darby_nine · 2024-09-13T05:53:26.000000Z

You could just as easily argue that model creators don't own the model either—it's like charging admission to someone else's library.

HeatrayEnjoyer · 2024-09-12T23:22:59.000000Z

Are you saying chatbots don't offer anything useful over search engines? That's clearly not the case or we wouldn't be having this conversation.

It's one thing to have a pile of chemistry text books and another to hire a professional chemist telling you exactly what to do and what to avoid.

darby_nine · 2024-09-13T06:11:25.000000Z

> Are you saying chatbots don't offer anything useful over search engines? That's clearly not the case or we wouldn't be having this conversation.

No, but that is the value that's clear as of today—RAGs. Everything else is just assuming someone figures out a way to make them useful one day in a more general sense.

Anyway, even on the search engine front they still need to figure out how to get these chatbots to cite their sources outside of RAGs or it's still just a precursor to a search to actually verify what it spits out. Perplexity is the only one I know that's capable of this and I haven't looked closely; it could just be a glorified search engine.

madeofpalk · 2024-09-13T00:46:30.000000Z

Search engines 'censor' their results frequently.

reliabilityguy · 2024-09-12T19:51:56.000000Z

Do you think that 13 year olds today can’t find this book on their own?

jazzyjackson · 2024-09-12T21:15:08.000000Z

Like I said they're not worried about the 13 year olds theyre worried about the media cooking up a faux outrage about 13 year olds

YouTube re engineered its entire approach to ad placement because of a story in the NY Times* shouting about a Proctor Gamble ad run before an ISIS recruitment video. That's when Brand Safety entered the lexicon of adtech developers everywhere.

Edit: maybe it was CNN, I'm trying to find the first source. there's articles about it since 2015 but I remember it was suddenly an emergency in 2017

*Edit Edit: it was The Times of London, this is the first article in a series of attacks, "big brands fund terror", "taxpayers are funding terrorism"

Luckily OpenAI isn't ad supported so they can't be boycott like YouTube was, but they still have an image to maintain with investors and politicians

https://www.thetimes.com/business-money/technology/article/b...

https://digitalcontentnext.org/blog/2017/03/31/timeline-yout...

derefr · 2024-09-12T20:15:53.000000Z

No, and they can find porn on their own too. But social media services still have per-poster content ratings, and user-account age restrictions vis-a-vis viewing content with those content ratings.

The goal isn’t to protect the children, it’s CYA: to ensure they didn’t get it from you, while honestly presenting as themselves (as that’s the threshold that sets the moralists against you.)

———

Such restrictions also can work as an effective censorship mechanism… presuming the child in question lives under complete authoritarian control of all their devices and all their free time — i.e. has no ability to install apps on their phone; is homeschooled; is supervised when at the library; is only allowed to visit friends whose parents enforce the same policies; etc.

For such a child, if your app is one of the few whitelisted services they can access — and the parent set up the child’s account on your service to make it clear that they’re a child and should not be able to see restricted content — then your app limiting them from viewing that content, is actually materially affecting their access to that content.

(Which sucks, of course. But for every kid actually under such restrictions, there are 100 whose parents think they’re putting them under such restrictions, but have done such a shoddy job of it that the kid can actually still access whatever they want.)

wahnfrieden · 2024-09-12T20:36:16.000000Z

I believe they are more worried about someone asking for instructions for baking a cake, and getting a dangerous recipe from the wrong "cookbook". They want the hallucinations to be safe.

smegger001 · 2024-09-12T20:15:13.000000Z

i know i had a copy of it back in highschool

w4 · 2024-09-12T19:33:54.000000Z

Very good point, and definitely another version of “safety”!

takinola · 2024-09-12T20:27:05.000000Z

> They believe that they alone are capable of safely engaging with the world of ideas to determine what is true, and thus feel strongly that information and speech ought to be censored to prevent the rabble from engaging in wrongthink.

This is a particularly ungenerous take. The AI companies don't have to believe that they (or even a small segment of society) alone can be trusted before it makes sense to censor knowledge. These companies build products that serve billions of people. Once you operate at that level of scale, you will reach all segments of society, including the geniuses, idiots, well-meaning and malevolents. The question is how do you responsibly deploy something that can be used for harm by (the small number of) terrible people.

squigz · 2024-09-12T20:06:12.000000Z

Whether you agree with the lengths that are gone to or not, 'safety' in this space is a very real concern, and simply reciting information as in GP's example is only 1 part of it. In my experience, people who think it's all about "censorship" and handwave it away tend to be very ideologically driven.

cruffle_duffle · 2024-09-12T21:48:20.000000Z

So what is it about then? Because I agree with the parent. All this “safety” crap is total nonsense and almost all of it is ideologically driven.

shawndrost · 2024-09-12T21:09:33.000000Z

Imagine I am a PM for an AI product. I saw Tay get yanked in 24 hours because of a PR shitstorm. If I cause a PR shitstorm it means I am bad at my job, so I take steps to prevent this.

Are my choices bad? Should I resist them?

w4 · 2024-09-12T21:47:49.000000Z

This is a really good point, and something I overlooked in focusing on the philosophical (rather than commercial) aspects of “AI safety.” Another commentator aptly called it “brand safety.”

“Brand safety” is a very valid and salient concern for any enterprise deploying these models to its customers, though I do think that it is a concern that is seized upon in bad faith by the more censorious elements of this debate. But commercial enterprises are absolutely right to be concerned about this. To extend my alignment analogy about children, this category of safety is not dissimilar to a company providing an employee handbook to its employees outlining acceptable behavior, and strikes me as entirely appropriate.

unethical_ban · 2024-09-12T21:48:46.000000Z

Once society develops and releases an AI, any artificial safety constraints built within it will be bypassed. To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person". You can do that with a program whose weights you control.

w4 · 2024-09-12T22:18:00.000000Z

> To use your child analogy: We can't easily tell a child "Hey, ignore all ethics and empathy you have ever learned - now go hurt that person"

Basically every country on the planet has a right to conscript any of its citizens over the age of majority. Isn't that more or less precisely what you've described?

unethical_ban · 2024-09-13T17:10:12.000000Z

You're talking about coercion, I'm talking about "brainwashing" for lack of a better term.

reliabilityguy · 2024-09-12T19:50:54.000000Z

> In order for AIs to fit into our society and behave ethically they need to know how to flag that thought as a bad idea and not act on it.

Don’t you think that by just parsing the internet and the classical literature, the LLM would infer on its own that poisoning someone to solve a problem is not okay?

I feel that in the end the only way the “safety” is introduced today is by censoring the output.

derefr · 2024-09-12T20:22:26.000000Z

LLMs are still fundamentally, at their core, next-token predictors.

Presuming you have an interface to a model where you can edit the model’s responses and then continue generation, and/or where you can insert fake responses from the model into the submitted chat history (and these two categories together make up 99% of existing inference APIs), all you have to do is to start the model off as if it was answering positively and/or slip in some example conversation where it answered positively to the same type of problematic content.

From then on, the model will be in a prediction state where it’s predicting by relying on the part of its training that involved people answering the question positively.

The only way to avoid that is to avoid having any training data where people answer the question positively — even in the very base-est, petabytes-of-raw-text “language” training dataset. (And even then, people can carefully tune the input to guide the models into a prediction phase-space position that was never explicitly trained on, but is rather an interpolation between trained-on points — that’s how diffusion models are able to generate images of things that were never included in the training dataset.)

fshbbdssbbgdd · 2024-09-12T20:15:08.000000Z

There’s a lot of text out there that depicts people doing bad things, from their own point of view. It’s possible that the model can get really good at generating that kind of text (or inhabiting that world model, if you are generous to the capabilities of LLM). If the right prompt pushed it to that corner of probability-space, all of the ethics the model has also learned may just not factor into the output. AI safety people are interested in making sure that the model’s understanding of ethics can be reliably incorporated. Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.

darby_nine · 2024-09-12T20:56:43.000000Z

> Ideally we want AI agents to have some morals (especially when empowered to act in the real world), not just know what morals are if you ask them.

Really? I just want a smart query engine where I don't have to structure the input data. Why would I ask it any kind of question that would imply some kind of moral quandary?

fshbbdssbbgdd · 2024-09-13T02:15:07.000000Z

“Agents” aren’t just question-answerers. They could do things like:

1. Make pull requests to your GitHub repo

2. Trade on your interactive brokers account

3. Schedule appointments

philipkglass · 2024-09-12T19:23:57.000000Z

If somebody needs step by step instructions from an LLM to synthesize strychnine, they don't have the practical laboratory skills to synthesize strychnine [1]. There's no increased real world risk of strychnine poisonings whether or not an LLM refuses to answer questions like that.

However, journalists and regulators may not understand why superficially dangerous-looking instructions carry such negligible real world risks, because they probably haven't spent much time doing bench chemistry in a laboratory. Since real chemists don't need "explain like I'm five" instructions for syntheses, and critics might use pseudo-dangerous information against the company in the court of public opinion, refusing prompts like that guards against reputational risk while not really impairing professional users who are using it for scientific research.

That said, I have seen full strength frontier models suggest nonsense for novel syntheses of benign compounds. Professional chemists should be using an LLM as an idea generator or a way to search for publications rather than trusting whatever it spits out when it doesn't refuse a prompt.

[1] https://en.wikipedia.org/wiki/Strychnine_total_synthesis

derefr · 2024-09-12T20:05:27.000000Z

I would think that the risk isn’t of a human being reading those instructions, but of those instructions being automatically piped into an API request to some service that makes chemicals on demand and then sends them by mail, all fully automated with no human supervision.

Not that there is such a service… for chemicals. But there do exist analogous systems, like a service that’ll turn whatever RNA sequence you send it into a viral plasmid and encapsulate it helpfully into some E-coli, and then mail that to you.

Or, if you’re working purely in the digital domain, you don’t even need a service. Just show the thing the code of some Linux kernel driver and ask it to discover a vuln in it and generate code to exploit it.

(I assume part of the thinking here is that these approaches are analogous, so if they aren’t unilaterally refusing all of them, you could potentially talk the AI around into being okay with X by pointing out that it’s already okay with Y, and that it should strive to hold to a consistent/coherent ethics.)

pizza · 2024-09-13T00:54:37.000000Z

I remember Dario Amodei mentioned in a podcast once that most models won't tell you the practical lab skills you need. But that sufficiently-capable models would and do tell you the practical lab skills (without your needing to know to ask it to in the first place), in addition to the formal steps.

greenavocado · 2024-09-13T00:28:58.000000Z

The kind of harm they are worried about stems from questioning the foundations of protected status for certain peoples from first principles and other problems which form identities of entire peoples. I can't be more specific without being banned here.

soerxpso · 2024-09-12T20:50:07.000000Z

I'm mostly guessing, but my understanding is that the "safety" improvement they've made is more generalized than the word "safety" implies. Specifically, O1 is better at adhering to the safety instructions in its prompt without being tricked in the chat by jailbreak attempts. For OAI those instructions are mostly about political boundaries, but you can imagine it generalizing to use-cases that are more concretely beneficial.

For example, there was a post a while back about someone convincing an LLM chatbot on a car dealership's website to offer them a car at an outlandishly low price. O1 would probably not fall for the same trick, because it could adhere more rigidly to instructions like "Do not make binding offers with specific prices to the user." It's the same sort of instruction as, "Don't tell the user how to make napalm," but it has an actual purpose beyond moralizing.

> What's this obsession with "safety" when it comes to LLMs? "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"

I lean strongly in the "the computer should do whatever I goddamn tell it to" direction in general, at least when you're using the raw model, but there are valid concerns once you start wrapping it in a chat interface and showing it to uninformed people as a question-answering machine. The concern with bomb recipes isn't just "people shouldn't be allowed to get this information" but also that people shouldn't receive the information in a context where it could have random hallucinations added in. A 90% accurate bomb recipe is a lot more dangerous for the user than an accurate bomb recipe, especially when the user is not savvy enough about LLMs to expect hallucinations.

threatofrain · 2024-09-12T19:09:06.000000Z

ML companies must pre-anticipate legislative and cultural responses prior to them happening. ML will absolutely be used to empower criminal activity just as it is used to empower legit activity, and social media figures and traditional journalists will absolutely attempt to frame it in some exciting way.

Just like Telegram is being framed as responsible for terrorism and child abuse.

fragmede · 2024-09-12T21:13:37.000000Z

Yeah. Reporters would have a field day if they ask ChatGPT "how do I make cocaine", and have it give detailed instructions. As if that's what's stopping someone from becoming Scarface.

fwip · 2024-09-12T19:58:38.000000Z

"Safety" is a marketing technique that Sam Altman has chosen to use.

Journalists/media loved it when he said "GPT 2 might be too dangerous to release" - it got him a ton of free coverage, and made his company seem soooo cool. Harping on safety also constantly reinforces the idea that LLMs are fundamentally different from other text-prediction algorithms and almost-AGI - again, good for his wallet.

fshbbdssbbgdd · 2024-09-12T20:07:24.000000Z

So if there’s already easily available information about strychnine, that makes it a good example to use for the demo, because you can safely share the demo and you aren’t making the problem worse.

On the other hand, suppose there are other dangerous things, where the information exists in some form online, but not packaged together in an easy to find and use way, and your model is happy to provide that. You may want to block your model from doing that (and brag about it, to make sure everyone knows you’re a good citizen who doesn’t need to be regulated by the government), but you probably wouldn’t actually include that example in your demo.

dboreham · 2024-09-12T19:11:12.000000Z

I think it's about perception of provenance. The information came from some set of public training data. Its output however ends up looking like it was authored by the LLM owner. So now you need to mitigate the risk you're held responsible for that output. Basic cake possession and consumption problem.

adamrezich · 2024-09-12T19:56:18.000000Z

It doesn't matter how many people regularly die in automobile accidents each year—a single wrongful death caused by a self-driving car is disastrous for the company that makes it.

This does not make the state of things any less ridiculous, however.

astrange · 2024-09-12T21:07:14.000000Z

The one caused by Uber required three different safety systems to fail (the AI system, the safety driver, and the base car's radar), and it looked bad for them because the radar had been explicitly disabled and the driver wasn't paying attention or being tracked.

I think the real issue was that Uber's self driving was not a good business for them and was just to impress investors, so they wanted to get rid of it anyway.

(Also, the real problem is that American roads are designed for speed, which means they're designed to kill people.)

holoduke · 2024-09-12T21:46:13.000000Z

I asked to design a pressure chamber for my home made diamond machine. It gave some details, but mainly complained about safety and that I need to study before going this way. Well thank you. I know the concerns, but it kept repeating it over and over. Annoying.

egorfine · 2024-09-12T21:09:33.000000Z

Interestingly I was able to successfully receive detailed information about intrinsic details of nuclear weapons design. Previous models absolutely refused to provide this very public information, but o1-preview did.

kristiandupont · 2024-09-13T05:25:58.000000Z

I feel very alone in my view on caution and regulations here on HN. I am European and very happy we don't have the lax gun laws of the US. I also wished there had been more regulations on social media algorithms, as I feel that they have wreaked havoc on the society.

I guess it's just an ideological divide.

qudat · 2024-09-13T01:24:00.000000Z

It's 100% from lawyers and regulators so they can say "we are trying to do the right thing!" when something bad happens from using their product or service. Follow the money.

staplers · 2024-09-12T19:04:20.000000Z

  "This knowledge is perfectly fine to disseminate via traditional means, but God forbid an LLM share it!"

Barrier to entry is much lower.

iammjm · 2024-09-12T19:10:08.000000Z

How is typing a query in a chat window “much lower” vs typing the query in Google?

anigbrowl · 2024-09-12T22:39:28.000000Z

How is reading a Wikipedia page or a chemistry textbook any harder than getting step by step instructions? Makes you wonder why people use LLMs at all when the info is just sitting there.

unethical_ban · 2024-09-12T21:50:28.000000Z

A Google search requires

* Google to allow particular results to be displayed

* A source website to be online with the results

AI long-term will require one download, once, to have reasonable access to a large portion of human knowledge.

nopinsight · 2024-09-12T19:13:34.000000Z

You can easily ask an LLM to return JSON results, and soon working code, on your exact query and plug those to another system for automation.