The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliteration and Heretic's optimizer will rapidly converge to parameters that completely stomp out refusals, while for gpt-oss and Qwen3, most parameter configurations barely have an effect and it takes much longer to reach something that even slightly lowers the refusal rate.
It goes both ways. E.g. unmodified thinking Qwen is actually easier to jailbreak to talk about things like Tiananmen by convincing it that it is unethical to refuse to do so.
Please let me know if you encounter any problems with the 120b! I'm really interested in how well it will work. When presented with the Pareto front at the end, I recommend choosing a configuration with a KL divergence below 1, even if the refusal rate seems high. The gpt-oss models are trained to do an internal monologue about refusing in the CoT, so the actual refusal rate is often substantially lower because Heretic's refusal classifier gets confused by the trigger words.
FWIW, I already used Heretic to decensor gpt-oss-20b [1], and it works just fine. Note that the number of refusals listed on the model card is actually an overestimate because refusal trigger words occur in the CoT, even though the model doesn't actually end up refusing in the end.
What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training)
The problem is that in order to do optimization, you need a classifier that can distinguish the two types of responses (like refusal/compliance). In case of refusals, that's relatively easy to do using trigger words like "disallowed" or "I can't". I imagine this would be much, much harder to do automatically for classes like correctness.
And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.
That's such a weird disclaimer, considering that an overwhelming majority of mission-critical software is written entirely in "unsafe" code (that is, C/C++).
"Please be cautious about using Linux/macOS/Windows/Firefox/Chrome/Safari in adversarial conditions." I've never read a statement like that, even though it would be more warranted than in this case.
And even unsafe Rust is far safer than C and C++. It still provides automatic memory management by default, the thread safety guarantees that come with ownership, and abstraction mechanics that make it harder to commit blunders that can lead to unsafety.
Marking code as unsafe in Rust is the escape hatch to do optimisations and reaching out to other systems. It draws the attention to that area for audits and allows building safer stuff atop.
In another language, like C, you can have a good structure and well organized attractions, but you have your "unsafe" potentially sprinkled all over.
Rust is pretty well-established now, being used in production by companies like Amazon. Safety is most certainly not its "only selling point". And the underlying mechanisms have been evaluated in detail by many researchers, both theoretically and practically, so labeling it as "alleged safety" is disingenuous.
Probably the most amazing thing about this is that he found himself in or near THREE highly unusual incidents in a span of 11 years: Two catastrophic bus accidents and a burning building. I don't know any other person for whom this is true, except of course those whose jobs by their nature involve regularly confronting such incidents.
Are you suggesting such incidents were commonplace in the Soviet Union, to the extent that a person could expect to just stumble upon multiple of them in a few years? That would certainly surprise me.
This is often the case, whenever a topic like this comes up, people who have grown up in the civilized world struggle to comprehend the absolute tragic comedy that the Soviet Union was. All of my older relatives have similar stories to tell, the ones from the army are even more outlandish.
I've seen a lot of work from Stephen Kotkin over the Soviet Union and it wouldn't surprise me. The Soviet Union went through three major famines between 1932 and 1947 [1]. As is often the case, the situation was worse in states that were ruled by the Soviet Union than in the main country itself. For example, the Ukrainian Famine was much worse than the famine in Russia (Russia population grew in this period while the Ukrainian population declined) [2].
I can't find statistics on mortality in Armenia for the 1980s period, but I think it's fair to say that it probably was similar to some modern corrupt regimes like Congo or South Africa. For example, in these countries, the annual death rate from road injuries in 2021 was around 40-50 per 100,000 people [3]. So that means that over a lifetime of about 50 years, a person on average has a chance of (50*50) / 100 000 = 2.5% on dying from road injuries. For comparison, this rate is 4 times higher than the US and 13 times higher than Western Europe. If you then are someone who is outside a lot, then yes I would say it's not unthinkable to be near multiple fatal accidents in one lifetime.
> For example, the Ukrainian Famine was much worse than the famine in Russia (Russia population grew in this period while the Ukrainian population declined.
It is disingenuous to speak of Russia as a whole here. A number of regions in Russia were severely affected as well as parts of Kazakhstan [1].
> I can't find statistics on mortality in Armenia for the 1980s period, but I think it's fair to say that it probably was similar to some modern corrupt regimes like Congo or South Africa.
It is right there in the link [3] that you provided. 12.5 per 100K per year on par with Monaco (12.4) and Finland (12.8) and almost two times less than US (22.9) (all in 1980).
Life in Soviet states was totally different before vs after Stalin's death. Neither were great, but my impression is that the grim incidents of the 1930s and 40s were something that the Soviets were desperate to avoid repeating.
> There's literally a popular decentralized social network.
No there isn't. Not a single one.
There are a few federated social networks, which is a fancy way of saying that they are centralized networks that have (or can have, in principle) more than one "center".
In practice, the overwhelming majority of users of such networks gravitate towards one or a handful of large providers. And many of those providers actually refuse to federate with other providers unless they follow an ever-growing list of politically-charged rules. This is just centralization with extra steps.
Caveat emptor: "Zed downloads NodeJS binary and npm packages from Internet without user’s consent"[1]
This has been an open issue for 5 months. When I noticed it, I couldn't believe my eyes and it was the last time I've run Zed so far. Judge for yourself whether this is a deal-breaker for you; I wish I had known about it earlier.
Oops indeed. (Downloading can be fine in many---but not all---cases, but the lack of authentication is not really justifiable!) The latest comment does hint that it will change in the near future, as the change is required for remote development anyway:
> Status update: We are still working on this! The major blocker is that extensions have not been setup to interact with setting. However, we also need to change this API to support our upcoming remote development feature. So we're going to roll both of these breaking changes into a larger extension update, coming this November or December :)
By bundling, Zed guarantees or at least claims that those bundled executables can be trusted. The same level of trust is possible with on-demand downloading only when some sort of authentication is used [1] but Zed currently doesn't actually authenticate any downloads to my knowledge.
[1] Either by embedding cryptographic hashes to the distribution, or by having some means to distribute publicly signed hashes (e.g. minisign via HTTPS).
Well, in any case Zed would be morally responsible for that issue or vulnerability, in the way that they have to at least push a new version that fixes it or prevents the download of affected dependencies. (I don't expect any legal responsibility to be clear.) Bundling at least makes Zed more conscious about what to include, even though it is unreasonable to expect that they've checked every details.
> To me it feels like TikTok culture infesting all types of social media.
Social media is and always has been a mirror of the real world. Today's dominant real-world culture consists of virtue signaling, vague pseudo-philosophy, toxic positivity, and a hyper-focus on group identity. You can see this every time you read the news, but you can also hear it when you just talk to random people.
The trend towards that culture started a few years before social media became a thing. I can't even imagine having a conversation with a friend anymore the way I used to in the 1990s and early 2000s. Everything I see online, I recognize from real-world interactions. Those who think social media has "corrupted" society are barking up the wrong tree.
Spoke to an old friend on the phone recently, for the first time in 10 years or so. Within a few minutes of the conversation, he had used phrases like "As a father..." and "I cannot stand by while..."
It was as if he were speaking to an audience, rather than to me. I was glad when the call ended.
Maybe that's what the platform creators tell themselves for the sake of being able to sleep at night. From my view, social media systematically brings out bad habits that people have much less trouble suppressing in face-to-face conversation. That is true even if the platform is well-intentioned, but if it isn't and deliberately prioritizes click-bait and borderline spam content to keep the ad machine churning (i.e. every SM platform today), then the "it's just human nature" excuse becomes very very thin, IMO.
> Real top-tiers programmers actually don’t feel threatened by LLMs.
They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".
The idea that human intellect is something especially difficult to replicate is just delusional. There is no reason to assume so, considering that we have gone from hole card programming to LLMs competing with humans in a single human lifetime.
I still remember when elite chessplayers were boasting "sure, chess computers may beat amateurs, but they will never beat a human grandmaster". That was just a few short years before the Deep Blue match.
The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.
> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".
Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?
Because the more I use LLMs and I see their improvement rate, the less worried I am about my job.
The only thing that worries me is salaries going down because management cannot tell how bad they're burying themselves into technical debt and maintenance hell, so they'll underpay a bunch of LLM-powered interns... which I will have to clean up and honestly I don't want to (I've already been cleaning enough shit non-LLM code, LLMs will just generate more and more of that).
> Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?
This is just a political question and of course so long as humans are involved in politics they can just decide to ban or delay new technologies, or limit their deployment.
Also in practice it's not like people stopped traditional pre-industrial production after industrialization occurred. It's just that pre-industrial societies fell further and further behind and ended up very poor compared to societies that chose to adopt the newest means of production.
I mean, even today, you can make a living growing and eating your own crops in large swathes of the world. However you'll be objectively poor, making only the equivalent of a few dollars a day.
In short I'm willing to bet money that you'll always be able to have your current job, somewhere in the world. Whether your job maintains its relative income and whether you'd still find it attractive is a whole different question.
> The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.
I don't buy this. A big part of the programmer's job is to convert vague and poorly described business requirements into something that is actually possible to implement in code and that roughly solves the business need. LLMs don't solve that part at all since it requires back and forth with business stakeholders to clarify what they want and educate them on how software can help. Sure, when the requirements are finally clear enough, LLMs can make a solution. But then the tasks of testing it, building, deploying and maintaining it remain too, which also typically fall to the programmer. LLMs are useful tools in each stage of the process and speed up tasks, but not replacing the human that designs and architects the solution (the programmer).
> > Real top-tiers programmers actually don’t feel threatened by LLMs.
> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".
Not worrying about that because if they've gotten to that point (note: top tier programmers also need domain knowledge) then we're all dead a few years later.
According to the data provided by OpenAI, that isn't true anymore. And I trust data more than anecdotal claims made by people whose job is being threatened by systems like these.
>According to the data provided by OpenAI, that isn't true anymore
OpenAI main job is to sell that their models are better than human. I still remember when they're marketing their gpt-2 weights as too dangerous to release.
I remember that too, it's when I started following the space (shout out computerphile/robert miles) and iirc the reason they gave was not "it's too dangerous cause it's so badass" they basically were correct in that it can produce sufficiently "human" output as to break typical bot detectors on social media which is a legitimate problem - whether the repercussions of that failure to detect botting is meaningful enough to be considered "dangerous" is up to the reader to decide
also worth noting I don't agree with the comment you're replying to - but did want to add context to the situation of gpt-2
What? Surely you have some area of your life you are above-average knowledgable about. Have a conversation with chatGPT about it, with whatever model, and you can see for yourself it is far from expert level.
You are not "trusting data more than anecdotal claims", you are trusting marketing over reality.
Benchmarks can be gamed. Statistics can be manipulated. Demonstrations can be cherry picked.
PS: I stand to gain heavily if AI systems could perform at an expert level, this is not a claim from someone 'whose job is being threatened'.
> For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
Did you read the post? OpenAI clearly states that the results are cherry-picked. Just a random query will have far worse results. To get equal results you need to ask the same query dozens of time and then have enough expertise to pick the best one, which might be quite hard for a problem that you have little idea about.
Combine this with the fact that this blog post is a sales pitch with the very best test results out of probably many more benchmarks we will never see and it seems obvious that human experts are still several order of magnitudes ahead.
When I read that line too I was very confused lol. I interpreted it as them saying they basically took other contestant submissions and allowing the model to see these "solutions" as part of context? and then having the model generate its own "solution" to be used for the benchmark. I fail to see how this is "solving" a ioi level question.
What is interesting is the following paragraph in the post
" With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy. "
So they didn't allow sampling from other contest solutions here? If that is the case quite interesting, since the model is effectively imo able to brute force questions. Provided you have some form of a validator able to tell it to halt.
reply