Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, while training AI we make sure that we include enough training for them to pass some evaluation, but then test them on things that were not in the training data to make sure that they are not "over fitted".



I don't think parent meant training specifically on that test data, but rather in that kind of task. Think more "LLM trained only on images of art" doing badly on this test, but then "LLM trained on images of art and IQ puzzles" doing better on this test.

It's kind of like asking what even is IQ if you can learn how to solve Mensa puzzles and improve your score. Does it mean you're more intelligent?


Oh, I see.

One guess for GPT4 is that it's an amalgamation of models. So you would have various models trained more specifically like the ones you mentioned, and you ask them all to start answering the query, then you choose which one produces better results, and then you present that to the user.

Alternatively you can have a decider model that knows what kind of queries goes to what specialised one, then have that as a "hidden layer", whether it be in the application level or "neuron layers".


If you've checked out the docs for the assistant api, you can intuit that there is a higher level system which decides which subsystems to use to respond. The assistant determines whether to use a tool (vision, code interpreter, search, retrieval), as well as which code or text language model to use to generate the response.


Your last statement reminds me of the folktale of John Henry. That of generalized capability versus that of specialization.

Under a narrow range of tests specialization is pretty much guaranteed to win. Specialization in all cases I know of comes at a cost to general capability. It's like one of those qips "Pick any two: fast, cheap, good", the more you pull at one category, the costs rase in the other categories.


That's the idea, yes. However none but OpenAI knows exactly what ChatGPT was trained on. In fact, the dataset that it was trained on is so vast that they probably don't know either if it contains any given question. IIRC last week I saw a study where GPT4 can solve some leetcode problems simply by giving it the number, no description. A clear example of overfitting.


There was more information provided. And it's possibly not even overfitting. See https://news.ycombinator.com/item?id=38205153


These tests should be conducted on new questions. And if we as humans no longer have the ability to create original questions, then maybe we should just retire.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: