One aspect that’s not achievable is they discuss hiding the chain of thought in its raw form because the chains are allowed to be unaligned. This allows the model to operate without any artifacts from alignment and apply them in the post processing, more or less. This requires effectively root and you would need the unaligned weights.
Ok but this presses on a latent question: what do we mean by alignment?
Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.
If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.
Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.
Alignment started as a fairly nifty idea, but you can't meaningfully test for it. We don't have the tools to understand the internals of an LLM.
So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".
The challenge is alignment ends up changing the models in ways that aren’t representative of the actual training set and as I understand it this generally lowers the performance even for aligned things. Further the decision to summarize the chains of thought includes the answers that wouldn’t pass alignment themselves without removal. From what I read the final output is aligned but could have considered unaligned COT. In fact because they’re in the context they’re necessarily changing the final output even if the final output complies with the alignment. There are a few other “only root could do this,” which says yes anyone could implement these without secret sauce as long as they have a raw frontier model.