Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs

fatso784 · on Aug 1, 2023

I like the support for Vector DBs and LLaMa-2. I'm curious as to whether and what influences compelled PromptTools, and how it differs from other tools in this space. For context, we've also released a prompt engineering IDE, ChainForge, which is open-source and has many of the features here, such as querying multiple models at once, prompt templating, evaluating responses with Python/JS code and LLM scorers, plotting responses, etc (https://github.com/ianarawjo/ChainForge and a playground at http://chainforge.ai).

One big problem we're seeing in this space is over-trust in LLM scorers as 'evaluators'. I've personally seen that minor tweaks to a scoring prompt can sometimes result in vastly different evaluation 'results.' Given recent debacles (https://news.ycombinator.com/item?id=36370685), I'm wondering how we can design LLMOps tools for evaluation which both support the use of LLMs as scorers, but also caution users about their results. Are you thinking similarly about this question, or seen usability testing which points to over-trust in 'auto-evaluators' as an emerging problem?

krawfy · on Aug 1, 2023

Great question, chainforge looks interesting!

We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it.

Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building.

fatso784 · on Aug 1, 2023

Thanks for the clarification! Yes, I see now that auto-evals here is more AI agent-ish, than a one-shot approach. Still has the trust issue.

For suggestions, one thing I'm curious about is how we can have out-of-the-box benchmark datasets and do this responsibly. ChainForge supports most OpenAI evals, but from adding this we realized the quality of OpenAI Evals is really _sketchy_... duplicate data, questionable metrics, etc. OpenAI has shown that trusting the community to make benchmarks is perhaps not a good idea; we should instead make it easier for scientists/engineers to upload their benchmarks and make it easier for others to run them. That's one thought, anyway.

hashemalsaket · on Aug 1, 2023

One approach we've been working on is having multiple LLMs score each other. Here is the design with an example of how that works: https://github.com/HashemAlsaket/prompttools/pull/1

In short: Pick top 50% responses, LLMs score each other, repeat until top response remains

fatso784 · on Aug 1, 2023

What does 'top 50%' responses mean here, though? You'd need to have a ground truth of how 'good' each score was to calculate that --and if you had ground truth, no need to use an LLM evaluator to begin with.

If you mean trusting the LLM scores to pick the 50% 'top' responses they grade, this doesn't get around the issue of overly trusting the LLM's scores.

hashemalsaket · on Aug 1, 2023

For now, the design is basic:

User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"

LLM rates responses of all other LLMs

All other LLMs do the same

Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.

robszumski · on Aug 2, 2023

I'll put in a friendly request for a Dockerfile in the repo.

I've been trying out AI tools as test cases for our supply chain security platform and had to cobble a Dockerfile together to get this running easily. Really cool tool overall!

Across 200+ transitive dependencies in prompttool, risk prioritization can remove 97% of security investigation in my quick test, and I most of these came from a thick base image. I'd love one curated from y'all.

catlover76 · on Aug 1, 2023

Super cool, the need for tooling like this is something one realizes pretty quickly when starting to build apps that leverage LLMs.

krawfy · on Aug 1, 2023

Glad you think so, we agree! If you end up trying it out, we'd love to hear what you think, and what other features you'd like to see.

esafak · on Aug 1, 2023

I'd like to see support for qdrant.

krawfy · on Aug 1, 2023

We've actually been in contact with the qdrant team about adding it to our roadmap! Andre (CEO) was asking for an integration. If you want to work on the PR, we'd be happy to work with you and get that merged in

kacperlukawski · on Aug 2, 2023

Qdrant here! We're already working on that :D

politelemon · on Aug 1, 2023

Similar tool I was about to look at: https://github.com/promptfoo/promptfoo

I've seen this in both tools but I wasn't able to understand: In the screenshot with feedback, I see thumbs up and thumbs down options. Where do those values go, what's the purpose? Does it get preserved across runs? It's just not clicking in my head.

krawfy · on Aug 1, 2023

For now, we just aggregate those across the models / prompts / templates you're evaluating so that you can get an aggregate score. You can export to CSV, JSON, MongoDB, or Markdown files, and we're working on more persistence features so that you can get a history of which models / prompts / templates you gave the best scores to, and keep track of your manual evaluations over time.

neelm · on Aug 1, 2023

Something like this is going to be needed to evaluate models effectively. Evaluation should be integrated into automated pipelines/workflows that can scale across models and datasets.

krawfy · on Aug 1, 2023

Thanks Neel! We totally agree that automated evals will become an essential part of production LLM systems.

mmaia · on Aug 1, 2023

I like that it's not limited to single prompts and allows to have chat messages. It would be great if `OpenAIChatExperiment` could also handle OpenAI's function calling.

krawfy · on Aug 1, 2023

Good catch! We're looking to add function calling support very soon, and have an open issue for it on our GitHub. If you want to raise a PR and add it, we'll help you land it and get it merged

tikkun · on Aug 1, 2023

This looks great, thanks

See also this related tool: https://news.ycombinator.com/item?id=36907074

krawfy · on Aug 1, 2023

Awesome! Let us know if there's anything from that tool that you think we should add to PromptTools

8awake · on Aug 1, 2023

Great work! We will make use of that with https://www.formula8.ai

nivekt · on Aug 1, 2023

Thank you! If you have any feedback or feature requests, don't hesitate to reach out.

pk19238 · on Aug 1, 2023

This is super cool man!

nivekt · on Aug 1, 2023

Thanks! We would appreciate any feedback or feature requests!