Recently I've been working on making LLM evaluations fast by using bayesian optimization to select a sensible subset.
Bayesian optimization is used because it’s good for exploration / exploitation of expensive black box (paraphrase, LLM).
I would love to hear your thoughts and suggestions on this!
This is a cool idea -- is this an inner-loop process (i.e. after each LLM evaluation, the output is considered to choose the next sample) or a pre-loop process (get a subset of samples before tests are run)?
AFAICT, this is a more advanced way of using Embeddings (which can encode for the vibes similarity (not an official term) of prompts) to determine where you get the most "bang for your buck" in terms of testing.
For instance, if there are three conversations that you can use to test if your AI is working correctly:
(1) HUMAN: "Please say hello"
AI: "Hello!"
(2) HUMAN: "Please say goodbye"
AI: "Goodbye!"
(3) HUMAN: "What is 2 + 2?"
AI: "4!"
Let's say you can only pick two conversations to evaluate how good your AI is. Would you pick 1 & 2? Probably not. You'd pick 1 & 3, or 2 & 3.
Because Embeddings allow us to determine how similar in vibes things are, we have a tool with which we can automatically search over our dataset for things that have very different vibes, meaning that each evaluation run is more likely to return new information about how well the model is doing.
My question to the OP was mostly about whether or not this "vibe differentiated dataset" was constructed prior to the evaluation run, or populated gradually, based on each individual test case result.
That's probably the intent, but I don't know if this actually achieves this (I have another comment that's about the use of bayesopt here). But even if it did, bayesopt operates sequentially (it's a Sequential Model-based Optimizer or SMBO) and so the trajectory of queries different LLMs evaluate would be different. Unless there is something to correct this cascading bias I don't know if you could use this to compare LLMs. Or obtain a score that's comparable to standard reported numbers.
On a different note, if all we want is a diverse set of representative samples (based on embeddings), there are algorithms like DivRank that do that quite well.
What is your goal? if d1, d2, d3, etc is the dataset over which you're trying to optimize, then the goal is to find some best performing d_i. In this case, you're not evaluating. You're optimizing. Your acquisition function even says so: https://rentruewang.github.io/bocoel/research/
And in general if you have an LLM that performs really well on one d_i then who cares. The goal in LLM evaluation is to find a good performing LLM overall.
Finally, it feels that your Abstract and other snippets sound like an LLM wrote them.
I disagree that the goal in „evaluation is to find a good performing LLM overall“. The goal in evaluation is to understand the performance of an LLm (on average). This approach actually is more about finding „areas“ where the LLm does not behave well and where the LLm behaves well (by the Gaussian process approximation) This is indeed an important problem to look at. Often you just run an LLm evaluation on 1000s of samples, some of them similar and you don’t learn anything new from the sample „what time is it, please“ over „what time is it“.
If instead you can reduce the number of samples to look at and automatically find „clusters“ and their performance, you get a win. It won’t be the „average performance number“, but it will give you (hopefully) understanding which things work how well in the LLm.
The main drawback in this (as far as I can say after this short glimpse at it) is the embedding itself. Only if the distance in the embedding space really correlates with performance, this will work great. However we know from adversarial attacks, that already small changes in the embedding space can result in vastly different results
Same. "Evaluate" and "corpus" need to be defined. I don't think OP intended this to be clickbait but without clarification it sounds like they're claiming 10x faster inference, which I'm pretty sure it's not.
Hi, OP here. It's not 10 times faster inference, but faster evaluation. You use evaluation on a dataset to check if your model is performing well. This takes a lot of time (might be more than training if you are just finetuning a pre-trained model on a small dataset)!
So the pipeline goes training -> evaluation -> deployment (inference).
Multiple choice tests, LM Eval (e.g. have GPT-4 rate an answer, or use M-of-N GPT-4 ratings as pass/fail), perplexity (i.e. how accurately can it reproduce a corpus that it was trained on).
Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved.
The "eval" phase is done after a model is trained to assess its performance on whatever tasks you wanted it to do. I think this is basically saying, "don't evaluate on the entire corpus, find a smart subset."
Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!
Perhaps I should clarify it in the project README. It's the phase to evaluate how well your model is performing. So the pipeline goes training -> evaluation -> deployment (inference) corresponding to the datasets in supervised training, training (training) -> evaluation (validation) -> deployment (testing).
This, exactly - what is meant by evaluate in this context? Is this more efficient inference using approximation, so you can create novel generations, or is it some test of model attributes?
What the OP is doing here is completely opaque to the rest of us.
Evaluate refers to the phase after training to check if the training is good.
Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!
I know what evaluation is, and inference, and training. Deployment means to deploy - to put a model in production. It does not mean inference. Inference means to input a prompt into a model and get the next token, or tokens as the case may be. Training and inference are closely related, since during training, inference is run and the error given by the difference between the prediction and target is backpropagated, etc.
Evaluation is running inference over a suite of tests and comparing the outcomes to some target ideal. An evaluation on the MMLU dataset lets you run inference on zero and few shot prompts to test the knowledge and function acquisition of your model, for example.
So is your code using Bayesian Optimization to select a subset of a corpus, like a small chunk of the MMLU dataset, that is representative of the whole, so you can test on that subset instead of the whole thing?
This is becoming so common in AI discussions. Everyone with a real use case is opaque, or just flat out doesn't talk. The ones who are talking have toy use cases. I think its because it's so hard to build a moat, and techniques are one of the ways to build one.
Hi, OP here. I would kind of have to disagree here. You raised some interesting points, but I don't think something can be qualified as *moat* if it is overcome-able by just sharing the use cases. For example, we all know Google's use cases is to search, but no one has built one as well as they do. Their moat is in their technology and brand recognision.
Not to disagree with your argument as a whole, but Google's most hasn't been technological for years, but instead comes from their ability to be the default search engine everywhere they can, including if they need to pay Apple billions for that position.
"Evaluation" has a pretty standard meaning in the LLM community the same way that "unit test" does in software. Evaluations are suites of challenges presented to an LLM to evaluate how well it does as a form of bench-marking.
Nobody would chime in on an article on "faster unit testing in software with..." and complain that it's not clear because "is it a history unit? a science unit? what kind of tests are those students taking!?", so I find it odd that on HN people often complain about something similar for a very popular niche in this community.
If you're interested in LLMs, the term "evaluation" should be very familiar, and if you're not interested in LLMs then this post likely isn't for you.
There’s lots to evaluate. If you’re evaluating model quality, there are many benchmarks all trying to measure different things… accuracy in translation, common sense reasoning, how well it stays on topic, can you regurgitate a reference in the prompt text, how biased is the output along a societal dimension, other safety measures, etc. I’m in the field but not an LLM researcher per se, so perhaps this is more meaningful to others, but given the post it seems useful to answer my question which was what _exactly_ is being evaluated?
In particular this is only working off the encoded sentences so it seems to me that things that involve attention etc aren’t being evaluated here.
Unit testing isn't an overloaded term. Evaluation by itself is overloaded, though "LLM evaluation" disambiguates it. I first parsed the title as 'faster inference' rather than 'faster evaluation' even being aware of what LLM evaluation is, because that's a probable path given 'show' 'faster' and 'LLM' in the context window.
That misreading could also suggest some interesting research directions. Bayesian optimization to choose some parameters which guide which subset of the neurons to include in the inference calculation? Why not.
Hi, OP here, sorry for late reply. I am not actually "evaluating", but rather using the "side effects" of bayesian optimization that allows zoning in/out on some regions on the latent space. Since embedders are so fast compared to LLM, it saves time by saving LLMs from evaluating on similar queries. Hope that makes sense!
I looked through the github.io documentation and skimmed through the code and research article draft. Correct me if I am wrong. What I think you are doing (at a high level) is you are you create a corpus of QA tasks, embeddings, and similarity metrics. Then you are somehow using NLP scoring and Bayesian Optimization to find a subset of the corpus that best matches a particular evaluation task. Then you can jut evaluate the LLM on this subset rather than the entire corpus, which is much faster.
I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).
Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.
Not trying to make excuses tho. Your points are very valid and I would take them into account!
OP here, I came up with this cool idea because I was chatting with a friend about how to make LLM evaluations fast (which is so painfully slow on large datasets) and realized that somehow no one has tried it. So I decided to give it a go!
I designed 2 modes in the project, exploration mode and exploitation mode.
Exploration mode uses entropy search to explore the latent space (used for evaluating the LLM on the selected corpus to evaluate), and eploitation mode is used to figure out how well / bad the model is performing on what regions of the selected corpus.
For accurate evaluations, exploration is used. However, I'm also working on a visualization too s.t. users can see how well the model is performing at what region (courtesy of gaussian process models built in by bayesian optimization) and that is where exploitation mode can come in handy.
Sorry for the slightly messy explanation. Hope it clarifies things!
I don't entirely understand what two models mean here, because typically the search strategy (or acquisition function) in bayesopt - which in your case seems to be some form of entropy search (ES) - decides the explore-vs-exploit tradeoff for itself (possibly with some additional hyperparams ofc). For ex., ES would do this one way, Expected Improvement (EI) would do it differently, etc. - all this in the service of the bayesopt objective you want to maximize (or minimize).
Assuming that you mean this objective when you mention exploitation, which here is based on the model performing well, wouldn't it just pick queries that the model can (or is likely to) answer correctly? This would be a very optimistic evaluation of the LLM.
Hi, OP here. I would say not really because the goals are different. Although both uses retrieval techniques, RAG wants to augment your query with factual information, where here we retrieve in order to evaluate on as few queries as possible (with performance guaranteed by bayesian optimization)