For now, the design is basic: User to LLM: "Rate this response to the following ...

For now, the design is basic:

User to LLM: "Rate this response to the following prompt on a scale of 1-10, where 1 is a poor response and 10 is a great response: [response]"

LLM rates responses of all other LLMs

All other LLMs do the same

Then we take the average score of each response. The LLMs that produced the top 50% of responses will respond again until one response with the highest score remains.