My observation is that the models are better at evaluating than they are generating, this is the technique used in the o1 models. They will use unaligned hidden tokens as "thinking" steps that will include evaluation of previous attempts.
I thought that was a good approach to vetting bad ideas.
> My observation is that the [o1-like] models are better at evaluating than they are generating
This is very good (a very good thing that you see that the out-loud reasoning is working well as judgement),
but we at this stage face an architectural problem. The "model, exemplary" entities will iteratively judge and both * approximate the world model towards progressive truthfulness and completeness, and * refine their judgement abilities and general intellectual proficiency in the process. That (in a way) requires that the main body of knowledge (including "functioning", proficiency over the better processes) is updated. The current architectures I know are static... Instead, we want them to learn: to understand (not memorize) e.g. that Copernicus is better than Ptolemy and to use the gained intellectual keys in subsequent relevant processes.
The main body of knowledge - notions, judgements and abilities - should be affected in a permanent way, to make it grow (like natural minds can).
My observation is that the models are better at evaluating than they are generating, this is the technique used in the o1 models. They will use unaligned hidden tokens as "thinking" steps that will include evaluation of previous attempts.
I thought that was a good approach to vetting bad ideas.