Hacker News new | past | comments | ask | show | jobs | submit login

There are plenty of results supporting my assertion; but the tests must be carefully designed. Of course, LLMs are not databases that store exact answers - so it's not enough to ask it something that it hasn't seen, if it's seen something similar (as is likely the case with your programming language).

One benchmark that I track closely is ConceptARC, which aims to test generalization and abstraction capabilities.

Here is a very recent result that uses the benchmark: https://arxiv.org/abs/2311.09247. Humans correctly solved 91% of the problems, GPT-4 solved 33%, and GPT-4V did much worse than GPT-4.




I wouldn't be surprised if GPT-4 is not too good at visual patterns, given that it's trained on text.

Look at the actual prompt in figure 2. I doubt humans would get a 91% score on that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: