How successful was this effort? How did you know how successful it was?

jarulraj · on Oct 18, 2023

Great question! We iterated on the prompt for several days and manually verified the results for ~100 users.

The results were pretty good: https://gist.github.com/gaurav274/506337fa51f4df192de78d1280...

Another interesting aspect was the money spent on LLMs. We could have directly used GPT-4 to generate the "golden" table; however, it's a bit expensive — costing $60 to process the information of 1000 users. To maintain accuracy while reducing costs significantly, we set up an LLM model cascade in the EvaDB query, running GPT-3.5 before GPT-4, leading to a 11x cost reduction ($5.5).

Query 1: https://github.com/pchunduri6/stargazers-reloaded/blob/228e8...

Query 2: https://github.com/pchunduri6/stargazers-reloaded/blob/228e8...

PaulHoule · on Oct 18, 2023

Did you compile accuracy, F1 numbers, or anything like that? Do you have quantitative comparisons of results you got w/ different models?

jarulraj · on Oct 18, 2023

As we do not have ground truth, we only qualitatively checked for accuracy -- no quantitative metrics. We did find a significant drop in accuracy with GPT 3.5 as opposed to GPT 4.

Are you measuring accuracy with data wrangling prompts? Would love to learn more about that.

PaulHoule · on Oct 18, 2023

Everything I do now is classification and AUC-ROC is my metric. For your problem my first thought is an up-down accuracy metric, but the tricky problem you might have is "do you accept both 'United States' and 'USA' as a correct answer?" and the trouble dealing with that is one reason I stick to classification problems.

I'm skeptical of any claim that "A works better than B" without some numbers to back it up.