I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the position of your item lists along the context, paying extra attention at the beginning and the end of those list.
See the listwise approach at "Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting", https://arxiv.org/abs/2306.17563
Why shouldn't you ask for uncertainaty?
I love asking for scores / probabilities (usually give a range, like 0.0 to 1.0) whenever I ask for a list, and it makes the output much more usable