That still wouldn't help here. We don't want the prediction confidence that the ...

That still wouldn't help here. We don't want the prediction confidence that the sequence of words you produced might appear in a valid English-language sentence produced by humans. We want the prediction confidence that the sentence is factually accurate. These models aren't given that kind of data to train on and I'm not sure how they even could be. There are oodles and oodles of human-generated text out there, but little in the way of verification regarding how much of it is true, to say nothing of categories of language like imperative and artistic that don't have truth values at all.