The training samples are at best self-referential, or alternatively referring to the unspoken expertise of whoever the sample came from (something the LLM is not privy to - it has it's own, different, aggregate set of knowledge).
For the model to predict "I don't know" as the continuation of (e.g. answer to) the input, that would have to be the most statistically likely response based on the training samples, but as we've noted the samples are referring to their originator, not to the aggregate knowledge of the training set/model.
Let's also note that LLMs deal in word statistics, not facts, and therefore "learning" something from one training sample does not trump a bunch of other samples professing ignorance about it - statistically a profession of ignorance is the best prediction.
If you wanted to change this, and have the LLM predict not only based on the individual training samples, but also sometimes based on an "introspective" assessment of its own knowledge (derived from the entire training set), then you would have to train it to do this, perhaps as a post-training step. But, think through in detail what it would take to do this ... How would you identify those cases where the model would have hallucinated a response and should be trained to output "I don't know" instead, and how would you identify those cases where a (statistically correct) prediction of ignorance should be trained to be overridden with a factual answer that was present in the training set?
It's really a very fundamental problem. Prediction is the basis of intelligence, but LLMs are predicting the wrong thing - word statistics. What you need for animal/human intelligence is to have the model predict facts/reality instead - as determined by continual learning and the feedback received from reality.
For the model to predict "I don't know" as the continuation of (e.g. answer to) the input, that would have to be the most statistically likely response based on the training samples, but as we've noted the samples are referring to their originator, not to the aggregate knowledge of the training set/model.
Let's also note that LLMs deal in word statistics, not facts, and therefore "learning" something from one training sample does not trump a bunch of other samples professing ignorance about it - statistically a profession of ignorance is the best prediction.
If you wanted to change this, and have the LLM predict not only based on the individual training samples, but also sometimes based on an "introspective" assessment of its own knowledge (derived from the entire training set), then you would have to train it to do this, perhaps as a post-training step. But, think through in detail what it would take to do this ... How would you identify those cases where the model would have hallucinated a response and should be trained to output "I don't know" instead, and how would you identify those cases where a (statistically correct) prediction of ignorance should be trained to be overridden with a factual answer that was present in the training set?
It's really a very fundamental problem. Prediction is the basis of intelligence, but LLMs are predicting the wrong thing - word statistics. What you need for animal/human intelligence is to have the model predict facts/reality instead - as determined by continual learning and the feedback received from reality.