Hacker News new | past | comments | ask | show | jobs | submit login

How successful was this effort? How did you know how successful it was?



Great question! We iterated on the prompt for several days and manually verified the results for ~100 users.

The results were pretty good: https://gist.github.com/gaurav274/506337fa51f4df192de78d1280...

Another interesting aspect was the money spent on LLMs. We could have directly used GPT-4 to generate the "golden" table; however, it's a bit expensive — costing $60 to process the information of 1000 users. To maintain accuracy while reducing costs significantly, we set up an LLM model cascade in the EvaDB query, running GPT-3.5 before GPT-4, leading to a 11x cost reduction ($5.5).

Query 1: https://github.com/pchunduri6/stargazers-reloaded/blob/228e8...

Query 2: https://github.com/pchunduri6/stargazers-reloaded/blob/228e8...


Did you compile accuracy, F1 numbers, or anything like that? Do you have quantitative comparisons of results you got w/ different models?


As we do not have ground truth, we only qualitatively checked for accuracy -- no quantitative metrics. We did find a significant drop in accuracy with GPT 3.5 as opposed to GPT 4.

Are you measuring accuracy with data wrangling prompts? Would love to learn more about that.


Everything I do now is classification and AUC-ROC is my metric. For your problem my first thought is an up-down accuracy metric, but the tricky problem you might have is "do you accept both 'United States' and 'USA' as a correct answer?" and the trouble dealing with that is one reason I stick to classification problems.

I'm skeptical of any claim that "A works better than B" without some numbers to back it up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: