Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
1. Removed KL Divergence
2. Normalize by total length (Dr. GRPO style)
3. Minibatch normalization for advantages
4. Relaxing trust region