Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I am developing an evaluation suite so I can keep watching the progress in a systematic way..

Sounds like something that should be published on github




Open benchmarks are vulnerable to saturation. I think benchmarks should have an embargo periodic, until which only 3% of the question-answer pairs is released, with an explicit warning not to use it 3 months after being released.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: