I like the support for Vector DBs and LLaMa-2. I'm curious as to whether and wha...

krawfy · on Aug 1, 2023

Great question, chainforge looks interesting!

We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it.

Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building.

fatso784 · on Aug 1, 2023

Thanks for the clarification! Yes, I see now that auto-evals here is more AI agent-ish, than a one-shot approach. Still has the trust issue.

For suggestions, one thing I'm curious about is how we can have out-of-the-box benchmark datasets and do this responsibly. ChainForge supports most OpenAI evals, but from adding this we realized the quality of OpenAI Evals is really _sketchy_... duplicate data, questionable metrics, etc. OpenAI has shown that trusting the community to make benchmarks is perhaps not a good idea; we should instead make it easier for scientists/engineers to upload their benchmarks and make it easier for others to run them. That's one thought, anyway.

hashemalsaket · on Aug 1, 2023

One approach we've been working on is having multiple LLMs score each other. Here is the design with an example of how that works: https://github.com/HashemAlsaket/prompttools/pull/1

In short: Pick top 50% responses, LLMs score each other, repeat until top response remains

fatso784 · on Aug 1, 2023

What does 'top 50%' responses mean here, though? You'd need to have a ground truth of how 'good' each score was to calculate that --and if you had ground truth, no need to use an LLM evaluator to begin with.

If you mean trusting the LLM scores to pick the 50% 'top' responses they grade, this doesn't get around the issue of overly trusting the LLM's scores.

hashemalsaket · on Aug 1, 2023

For now, the design is basic:

User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"

LLM rates responses of all other LLMs

All other LLMs do the same

Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.