I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:
"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."
... because on GPT 3.5 the answer begins like the below and then gets worse:
"Let's break down the process step by step:
Initially, you have 50 red balls and 50 blue balls in the container.
1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."
If I was hiring interns this dumb, I'd be in trouble.
EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.
This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."
The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.
> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.
If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.
GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.
Its answer isn't correct, this isn't a possible ending scenario:
- *Ending Scenario:*
- Red Balls (RB): 0 (all have been drawn)
- Blue Balls (BB): 50 (none have been drawn)
- White Balls (WB): 0 (since no blue balls were drawn, no white balls were added)
- Total Balls: 50
> but it is technically correct answer as it said it depends on the order.
It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:
I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.
So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."
I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)
I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?
Seems like quite a difficult question to compute exactly.
I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.
>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?
These are very good questions that anyone with the ability to reason would ask if given this problem.
You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.
Instead, collaborate with it, while giving it the appropriate tools to help you.
I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.
It can reason better than most humans put into the same situation.
This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)
With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!
Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.
A human could get a valid end state most of the time, gpt-4 seems to mess up more than it got it right based on the examples posted here. So to me it seems like gpt-4 is worse than humans.
Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.
I disagree. Don't assume "most humans" are anything like Silicon Valley startup developers. Most developers out there in the wild would definitely struggle to solve problems like this.
For example, a common criticism of AI-generated code is the risk of introducing vulnerabilities.
I just sat in a meeting for an hour, literally begging several developers to stop writing code vulnerable to SQL injection! They just couldn't understand what I was even talking about. They kept trying to use various ineffective hacky workarounds ("silver bullets") because they just didn't grok the the problem.
>It can reason better than most humans put into the same situation.
On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.
The equivalent of what current-gen LLMs do is an oral examination. Picture standing in the middle of a room surrounded by subject matter experts grilling you for your knowledge of various random topics. You have no tools, no calculator, no pencil and paper.
You’re asked a question and you just have to spit out the answer. No option to backtrack, experiment, or self correct.
“Translate this to Hebrew”.
“Is this a valid criticism of this passage from a Platonic perspective?”
“Explain counterfactual determinism in Quantum Mechanics.”
“What is the cube root of 74732?”
You would fail all of these. The AI gets 3 of 4 correct.
Tell me who’s smarter?
You because of your preconceptions, or because of real superiority?
These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.
There's a couple of scenarios, which depend on when you run out of blue balls:
1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.
2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.
Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.
2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.
... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.
Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses
You are really trying to make it not have reasoning for your own benefit
> You are really trying to make it not have reasoning for your own benefit
This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much
GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.
I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!
I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way
There's a reason you see people walking around in hard hats and steel toed boots in some companies. It's not because everything works perfectly all the time!
Seems like it reasons it's way to this answer at the end to me:
Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?
Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.
"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."
... because on GPT 3.5 the answer begins like the below and then gets worse:
"Let's break down the process step by step:
Initially, you have 50 red balls and 50 blue balls in the container.
1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."
If I was hiring interns this dumb, I'd be in trouble.
EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.