Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One aspect that’s not achievable is they discuss hiding the chain of thought in its raw form because the chains are allowed to be unaligned. This allows the model to operate without any artifacts from alignment and apply them in the post processing, more or less. This requires effectively root and you would need the unaligned weights.



Ok but this presses on a latent question: what do we mean by alignment?

Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.

If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.

Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.


Alignment started as a fairly nifty idea, but you can't meaningfully test for it. We don't have the tools to understand the internals of an LLM.

So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".


The challenge is alignment ends up changing the models in ways that aren’t representative of the actual training set and as I understand it this generally lowers the performance even for aligned things. Further the decision to summarize the chains of thought includes the answers that wouldn’t pass alignment themselves without removal. From what I read the final output is aligned but could have considered unaligned COT. In fact because they’re in the context they’re necessarily changing the final output even if the final output complies with the alignment. There are a few other “only root could do this,” which says yes anyone could implement these without secret sauce as long as they have a raw frontier model.


Glass half full and the good faith argument.

It's a compromise.

OpenAI will now have access to vaste amounts of unaligned output so they can actually study it's thinking.

Whereas the current checks and balances meant the request was rejected and the data providing this insight was not created in the first place.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: