At temperature 0 they are effectively producing the token that maximizes a weighted sum of base LM probability and model reward.
At temperature 0 they are effectively producing the token that maximizes a weighted sum of base LM probability and model reward.