Hacker Newsnew | past | comments | ask | show | jobs | submit | p-e-w's commentslogin

The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliteration and Heretic's optimizer will rapidly converge to parameters that completely stomp out refusals, while for gpt-oss and Qwen3, most parameter configurations barely have an effect and it takes much longer to reach something that even slightly lowers the refusal rate.

It seems to me that thinking models are harder to decensor, as they are trained to think whether to accept your request.

It goes both ways. E.g. unmodified thinking Qwen is actually easier to jailbreak to talk about things like Tiananmen by convincing it that it is unethical to refuse to do so.

Please let me know if you encounter any problems with the 120b! I'm really interested in how well it will work. When presented with the Pareto front at the end, I recommend choosing a configuration with a KL divergence below 1, even if the refusal rate seems high. The gpt-oss models are trained to do an internal monologue about refusing in the CoT, so the actual refusal rate is often substantially lower because Heretic's refusal classifier gets confused by the trigger words.

FWIW, I already used Heretic to decensor gpt-oss-20b [1], and it works just fine. Note that the number of refusals listed on the model card is actually an overestimate because refusal trigger words occur in the CoT, even though the model doesn't actually end up refusing in the end.

[1] https://huggingface.co/p-e-w/gpt-oss-20b-heretic


What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training)

The problem is that in order to do optimization, you need a classifier that can distinguish the two types of responses (like refusal/compliance). In case of refusals, that's relatively easy to do using trigger words like "disallowed" or "I can't". I imagine this would be much, much harder to do automatically for classes like correctness.

And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.


That's such a weird disclaimer, considering that an overwhelming majority of mission-critical software is written entirely in "unsafe" code (that is, C/C++).

"Please be cautious about using Linux/macOS/Windows/Firefox/Chrome/Safari in adversarial conditions." I've never read a statement like that, even though it would be more warranted than in this case.

And even unsafe Rust is far safer than C and C++. It still provides automatic memory management by default, the thread safety guarantees that come with ownership, and abstraction mechanics that make it harder to commit blunders that can lead to unsafety.


Those are well-established languages. Rust's only selling point is its alleged safety


Marking code as unsafe in Rust is the escape hatch to do optimisations and reaching out to other systems. It draws the attention to that area for audits and allows building safer stuff atop.

In another language, like C, you can have a good structure and well organized attractions, but you have your "unsafe" potentially sprinkled all over.


Rust is pretty well-established now, being used in production by companies like Amazon. Safety is most certainly not its "only selling point". And the underlying mechanisms have been evaluated in detail by many researchers, both theoretically and practically, so labeling it as "alleged safety" is disingenuous.


Probably the most amazing thing about this is that he found himself in or near THREE highly unusual incidents in a span of 11 years: Two catastrophic bus accidents and a burning building. I don't know any other person for whom this is true, except of course those whose jobs by their nature involve regularly confronting such incidents.


Armenia at the time belonged to the Soviet Union. To me, these examples emphasize again how truly dysfunctional that system must have been.


Are you suggesting such incidents were commonplace in the Soviet Union, to the extent that a person could expect to just stumble upon multiple of them in a few years? That would certainly surprise me.


This is often the case, whenever a topic like this comes up, people who have grown up in the civilized world struggle to comprehend the absolute tragic comedy that the Soviet Union was. All of my older relatives have similar stories to tell, the ones from the army are even more outlandish.


Is this where that saying about not laughing at the circus after serving in the (Soviet) Army came from?


I've seen a lot of work from Stephen Kotkin over the Soviet Union and it wouldn't surprise me. The Soviet Union went through three major famines between 1932 and 1947 [1]. As is often the case, the situation was worse in states that were ruled by the Soviet Union than in the main country itself. For example, the Ukrainian Famine was much worse than the famine in Russia (Russia population grew in this period while the Ukrainian population declined) [2].

I can't find statistics on mortality in Armenia for the 1980s period, but I think it's fair to say that it probably was similar to some modern corrupt regimes like Congo or South Africa. For example, in these countries, the annual death rate from road injuries in 2021 was around 40-50 per 100,000 people [3]. So that means that over a lifetime of about 50 years, a person on average has a chance of (50*50) / 100 000 = 2.5% on dying from road injuries. For comparison, this rate is 4 times higher than the US and 13 times higher than Western Europe. If you then are someone who is outside a lot, then yes I would say it's not unthinkable to be near multiple fatal accidents in one lifetime.

[1]: https://en.wikipedia.org/wiki/Excess_mortality_in_the_Soviet...

[2]: https://en.wikipedia.org/wiki/Holodomor

[3]: https://ourworldindata.org/grapher/death-rates-road-incident...


> For example, the Ukrainian Famine was much worse than the famine in Russia (Russia population grew in this period while the Ukrainian population declined.

It is disingenuous to speak of Russia as a whole here. A number of regions in Russia were severely affected as well as parts of Kazakhstan [1].

> I can't find statistics on mortality in Armenia for the 1980s period, but I think it's fair to say that it probably was similar to some modern corrupt regimes like Congo or South Africa.

It is right there in the link [3] that you provided. 12.5 per 100K per year on par with Monaco (12.4) and Finland (12.8) and almost two times less than US (22.9) (all in 1980).

[1]: https://en.m.wikipedia.org/wiki/Soviet_famine_of_1930%E2%80%...

[3]: https://ourworldindata.org/grapher/death-rates-road-incident...


Life in Soviet states was totally different before vs after Stalin's death. Neither were great, but my impression is that the grim incidents of the 1930s and 40s were something that the Soviets were desperate to avoid repeating.


Incident? It was colonial genocides with red decorations.


Sure. Didn't mean to downplay it.


Is this why induatrial accident porn always seems to be from China? I just assumed it was a numbers thing.


right? i'd be kinda wary about hanging out with the guy =)


small scale catastrophes are not that rare, most of us just duck away from them. It takes some special mindset to run towards danger.


> There's literally a popular decentralized social network.

No there isn't. Not a single one.

There are a few federated social networks, which is a fancy way of saying that they are centralized networks that have (or can have, in principle) more than one "center".

In practice, the overwhelming majority of users of such networks gravitate towards one or a handful of large providers. And many of those providers actually refuse to federate with other providers unless they follow an ever-growing list of politically-charged rules. This is just centralization with extra steps.


bluesky has over 27 million users


Bluesky is federated, not decentralized.


Caveat emptor: "Zed downloads NodeJS binary and npm packages from Internet without user’s consent"[1]

This has been an open issue for 5 months. When I noticed it, I couldn't believe my eyes and it was the last time I've run Zed so far. Judge for yourself whether this is a deal-breaker for you; I wish I had known about it earlier.

[1] https://github.com/zed-industries/zed/issues/12589


Oops indeed. (Downloading can be fine in many---but not all---cases, but the lack of authentication is not really justifiable!) The latest comment does hint that it will change in the near future, as the change is required for remote development anyway:

> Status update: We are still working on this! The major blocker is that extensions have not been setup to interact with setting. However, we also need to change this API to support our upcoming remote development feature. So we're going to roll both of these breaking changes into a larger extension update, coming this November or December :)


I don't see how this is different from having all these pre-bundled with a new version of Zed? Either way I'm going to download all of them again.


By bundling, Zed guarantees or at least claims that those bundled executables can be trusted. The same level of trust is possible with on-demand downloading only when some sort of authentication is used [1] but Zed currently doesn't actually authenticate any downloads to my knowledge.

[1] Either by embedding cryptographic hashes to the distribution, or by having some means to distribute publicly signed hashes (e.g. minisign via HTTPS).


>By bundling, Zed guarantees or at least claims that those bundled executables can be trusted

As if anyone at Zed cares and checks them all thoroughly? Even if they wanted they couldn't, given how expansive Node dependencies get.

At best, someone will report an issue/vulnerability for one of those to them. Usually months/years after it exists.


Well, in any case Zed would be morally responsible for that issue or vulnerability, in the way that they have to at least push a new version that fixes it or prevents the download of affected dependencies. (I don't expect any legal responsibility to be clear.) Bundling at least makes Zed more conscious about what to include, even though it is unreasonable to expect that they've checked every details.


What I might trust on my laptop is TOTALLY different from what my company might allow on a remote server.


> To me it feels like TikTok culture infesting all types of social media.

Social media is and always has been a mirror of the real world. Today's dominant real-world culture consists of virtue signaling, vague pseudo-philosophy, toxic positivity, and a hyper-focus on group identity. You can see this every time you read the news, but you can also hear it when you just talk to random people.

The trend towards that culture started a few years before social media became a thing. I can't even imagine having a conversation with a friend anymore the way I used to in the 1990s and early 2000s. Everything I see online, I recognize from real-world interactions. Those who think social media has "corrupted" society are barking up the wrong tree.


> I can't even imagine having a conversation with a friend anymore the way I used to in the 1990s and early 2000s.

I do wonder what you actually mean by this.

> Those who think social media has "corrupted" society are barking up the wrong tree.

There is clear, evidenced research around social media's negative effect on society.

https://link.springer.com/article/10.1007/s00127-020-01906-9

https://scholarcommons.scu.edu/engl_176/2/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7364393/

https://journals.sagepub.com/doi/10.1177/20563051241269305

etc.


> I do wonder what you actually mean by this.

Spoke to an old friend on the phone recently, for the first time in 10 years or so. Within a few minutes of the conversation, he had used phrases like "As a father..." and "I cannot stand by while..."

It was as if he were speaking to an audience, rather than to me. I was glad when the call ended.


Maybe that's what the platform creators tell themselves for the sake of being able to sleep at night. From my view, social media systematically brings out bad habits that people have much less trouble suppressing in face-to-face conversation. That is true even if the platform is well-intentioned, but if it isn't and deliberately prioritizes click-bait and borderline spam content to keep the ad machine churning (i.e. every SM platform today), then the "it's just human nature" excuse becomes very very thin, IMO.


Come on.

They hide anything my actual friends do, and just show some rage inducing video for the sake of keeping me there. It doesn't happen by mistake.


> Today's dominant real-world culture consists of virtue signaling, vague pseudo-philosophy, toxic positivity, and a hyper-focus on group identity

Its some people who do this, and these people absolutely dominate the internet world, but not the real world.


> Real top-tiers programmers actually don’t feel threatened by LLMs.

They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

The idea that human intellect is something especially difficult to replicate is just delusional. There is no reason to assume so, considering that we have gone from hole card programming to LLMs competing with humans in a single human lifetime.

I still remember when elite chessplayers were boasting "sure, chess computers may beat amateurs, but they will never beat a human grandmaster". That was just a few short years before the Deep Blue match.

The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.


> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?

Because the more I use LLMs and I see their improvement rate, the less worried I am about my job.

The only thing that worries me is salaries going down because management cannot tell how bad they're burying themselves into technical debt and maintenance hell, so they'll underpay a bunch of LLM-powered interns... which I will have to clean up and honestly I don't want to (I've already been cleaning enough shit non-LLM code, LLMs will just generate more and more of that).


> Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?

This is just a political question and of course so long as humans are involved in politics they can just decide to ban or delay new technologies, or limit their deployment.

Also in practice it's not like people stopped traditional pre-industrial production after industrialization occurred. It's just that pre-industrial societies fell further and further behind and ended up very poor compared to societies that chose to adopt the newest means of production.

I mean, even today, you can make a living growing and eating your own crops in large swathes of the world. However you'll be objectively poor, making only the equivalent of a few dollars a day.

In short I'm willing to bet money that you'll always be able to have your current job, somewhere in the world. Whether your job maintains its relative income and whether you'd still find it attractive is a whole different question.


> The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.

I don't buy this. A big part of the programmer's job is to convert vague and poorly described business requirements into something that is actually possible to implement in code and that roughly solves the business need. LLMs don't solve that part at all since it requires back and forth with business stakeholders to clarify what they want and educate them on how software can help. Sure, when the requirements are finally clear enough, LLMs can make a solution. But then the tasks of testing it, building, deploying and maintaining it remain too, which also typically fall to the programmer. LLMs are useful tools in each stage of the process and speed up tasks, but not replacing the human that designs and architects the solution (the programmer).


> > Real top-tiers programmers actually don’t feel threatened by LLMs.

> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

Not worrying about that because if they've gotten to that point (note: top tier programmers also need domain knowledge) then we're all dead a few years later.


It's slow and expensive if you compare it with other LLMs.

It's lightning fast and dirt cheap if you compare it to consulting with a human expert, which it appears to be competitive with.


I would say consulting with a human. Any expert who has a conversation with chatGPT about their field will verify that it is very far from expert


According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.


>According to the data provided by OpenAI, that isn't true anymore

OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.


I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide

also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2


What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.

You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.

Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.

PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.


> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.

Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.


When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.

What is interesting is the following paragraph in the post " With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. " So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.

I came across one of the ioi questions this year that I had trouble solving (I am pretty noob tho) which made me curious about how these reported results were reflected. The question at hand being https://github.com/ioi-2024/tasks/blob/main/day2/hieroglyphs... Apparently, the model was able to get it partially correct. https://x.com/markchen90/status/1834358725676572777


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: