One aspect that’s not achievable is they discuss hiding the chain of thought in ...

lukev · 2024-09-13T03:12:12 1726197132

Ok but this presses on a latent question: what do we mean by alignment?

Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.

If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.

Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.

doe_eyes · 2024-09-13T04:26:49 1726201609

Alignment started as a fairly nifty idea, but you can't meaningfully test for it. We don't have the tools to understand the internals of an LLM.

So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".

fnordpiglet · 2024-09-13T04:00:24 1726200024

The challenge is alignment ends up changing the models in ways that aren’t representative of the actual training set and as I understand it this generally lowers the performance even for aligned things. Further the decision to summarize the chains of thought includes the answers that wouldn’t pass alignment themselves without removal. From what I read the final output is aligned but could have considered unaligned COT. In fact because they’re in the context they’re necessarily changing the final output even if the final output complies with the alignment. There are a few other “only root could do this,” which says yes anyone could implement these without secret sauce as long as they have a raw frontier model.

KoolKat23 · 2024-09-13T13:29:59 1726234199

Glass half full and the good faith argument.

It's a compromise.

OpenAI will now have access to vaste amounts of unaligned output so they can actually study it's thinking.

Whereas the current checks and balances meant the request was rejected and the data providing this insight was not created in the first place.