Learning is always exploring a search space. Literally deciding which choice would be most likely to get to an answer. If you have a set of answers, deciding what alterations would make for a better answer.
Like, I don't know what you think is wrong on that? The human/expert feedback is there to provide scores that we don't know how to fully codify, yet. Is effectively acknowledging that we don't know how to codify the "I know it when I see it" rule. And based on those scores, the model updates and new things can be scored.
Direct human feedback - from an actual human - is the gold standard here, since it is an actual human who will be evaluating how well they like your deployed model.
Note that using codified-HF (as is in fact already done - the actual HF being first used to train a proxy reward model) doesn't change things here - tuning the model to maximize this metric of human usability IS the goal. The idea of using RL is to do the search at training time rather than inference time when it'd be massively more expensive. You can think of all the multi-token model outputs being evaluated by the reward model as branches of the search tree, and the goal of RL is to influence earlier token selection to lead towards these preferred outputs.
This is no different from any other ML situation, though? Famously, people found out that Amazon's hands free checkout thing was being offloaded to people in the cases where the system couldn't give a high confidence answer. I would be shocked to know that those judgements were not then labeled and used in automated training later.
And I should say that I said "codified" but I don't mean just code. Labeled training samples is fine here. Doesn't change that finding a model that will give good answers is ultimately something that can be conceptualized as a search.
You are also blurring the reinforcement/scoring at inference time as compared to the work that is done at training time? The idea of using RL at training time is not just because it is expensive there. The goal is to find the policies that are best to use at inference time.
Like, I don't know what you think is wrong on that? The human/expert feedback is there to provide scores that we don't know how to fully codify, yet. Is effectively acknowledging that we don't know how to codify the "I know it when I see it" rule. And based on those scores, the model updates and new things can be scored.
What is not accurate in that description?