What is your goal? if d1, d2, d3, etc is the dataset over which you're trying to optimize, then the goal is to find some best performing d_i. In this case, you're not evaluating. You're optimizing. Your acquisition function even says so: https://rentruewang.github.io/bocoel/research/
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
I disagree that the goal in „evaluation is to find a good performing LLM overall“. The goal in evaluation is to understand the performance of an LLm (on average). This approach actually is more about finding „areas“ where the LLm does not behave well and where the LLm behaves well (by the Gaussian process approximation) This is indeed an important problem to look at. Often you just run an LLm evaluation on 1000s of samples, some of them similar and you don’t learn anything new from the sample „what time is it, please“ over „what time is it“.
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
Good luck.