Same. "Evaluate" and "corpus" need to be defined. I don't think OP intended this to be clickbait but without clarification it sounds like they're claiming 10x faster inference, which I'm pretty sure it's not.
Hi, OP here. It's not 10 times faster inference, but faster evaluation. You use evaluation on a dataset to check if your model is performing well. This takes a lot of time (might be more than training if you are just finetuning a pre-trained model on a small dataset)!
So the pipeline goes training -> evaluation -> deployment (inference).
Multiple choice tests, LM Eval (e.g. have GPT-4 rate an answer, or use M-of-N GPT-4 ratings as pass/fail), perplexity (i.e. how accurately can it reproduce a corpus that it was trained on).
Lots of ways to evaluate without humans. Most (nearly all) LLM benchmarks are fully automated, without any humans involved.
The "eval" phase is done after a model is trained to assess its performance on whatever tasks you wanted it to do. I think this is basically saying, "don't evaluate on the entire corpus, find a smart subset."
Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!
Perhaps I should clarify it in the project README. It's the phase to evaluate how well your model is performing. So the pipeline goes training -> evaluation -> deployment (inference) corresponding to the datasets in supervised training, training (training) -> evaluation (validation) -> deployment (testing).
I know what a LLM is and I know very well what is Bayesian Optimization. But I don't understand what this library is trying to do.
I am guessing it's tryng to test the model's ability to generate correct and relevant responses to a given input.
But who is the judge ?