The failure modes from 2023 are identical to those today. I agree with the now deleted post that there has been essentially no progress. Benchmark scores (if you think they are a relevant proxy for anything) obviously have increased, but (for example) from 50% to 90% (probably less drastically), not the 99% to 99.999% you'd need for real assurance a widely used system won't make mistakes.
Like in 2023, everything is still a demo, there's nothing that could be considered reliable.
Like in 2023, everything is still a demo, there's nothing that could be considered reliable.