I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the...

I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the position of your item lists along the context, paying extra attention at the beginning and the end of those list.

See the listwise approach at "Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting", https://arxiv.org/abs/2306.17563