They should do a 95% and 99% version of the graphs, otherwise it's hard to ascer...

		atleastoptimal 30 days ago \| parent \| context \| favorite \| on: Measuring AI Ability to Complete Long Tasks They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"