I wonder if a language model can be treated as a policy over token-level actions instead of full response actions. Then each response from the language model is a full rollout of the policy. In math and coding, the reward for the response can be evaluated. This is not how DeepSeek works now, right? It treats full responses from the language model as the action if I understand correctly.