Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
atleastoptimal
30 days ago
|
parent
|
context
|
favorite
| on:
Measuring AI Ability to Complete Long Tasks
They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: