Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:

"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."

... because on GPT 3.5 the answer begins like the below and then gets worse:

"Let's break down the process step by step:

Initially, you have 50 red balls and 50 blue balls in the container.

1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."

If I was hiring interns this dumb, I'd be in trouble.

EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.



This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."


The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.


> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.

If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.


https://pastebin.com/r9bNi8GD

GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.


Its answer isn't correct, this isn't a possible ending scenario:

- *Ending Scenario:* - Red Balls (RB): 0 (all have been drawn) - Blue Balls (BB): 50 (none have been drawn) - White Balls (WB): 0 (since no blue balls were drawn, no white balls were added) - Total Balls: 50


> but it is technically correct answer as it said it depends on the order.

It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:

> This is one possible scenario.


But the reasoning is total garbage, right?

It says the number of blue balls drawn is x and the number of red balls drawn is y, and then asserts x + y = 100, which is wrong.

Then it proceeds to "solve" an equation which reduces to x = x to conclude x = 0.

It then uses that to "prove" that y = 100, which is a problem as there are only 50 red balls in the container and nothing causes any more to be added.

It's like "mistakes bad students make in Algebra 1".


I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.

So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."

I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)


I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

Seems like quite a difficult question to compute exactly.

I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.


>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

These are very good questions that anyone with the ability to reason would ask if given this problem.


"You're holding it wrong."

You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.

Instead, collaborate with it, while giving it the appropriate tools to help you.

I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.

I even got it to save a video of the plot rotating: https://streamable.com/2aphbz


AI can reason! Just not reasonably!


It can reason better than most humans put into the same situation.

This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)

With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!

Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.


A human could get a valid end state most of the time, gpt-4 seems to mess up more than it got it right based on the examples posted here. So to me it seems like gpt-4 is worse than humans.

Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.


> valid end state most of the time

I disagree. Don't assume "most humans" are anything like Silicon Valley startup developers. Most developers out there in the wild would definitely struggle to solve problems like this.

For example, a common criticism of AI-generated code is the risk of introducing vulnerabilities.

I just sat in a meeting for an hour, literally begging several developers to stop writing code vulnerable to SQL injection! They just couldn't understand what I was even talking about. They kept trying to use various ineffective hacky workarounds ("silver bullets") because they just didn't grok the the problem.

I've found GPT 4 outperforms median humans.


>It can reason better than most humans put into the same situation.

On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.


The equivalent of what current-gen LLMs do is an oral examination. Picture standing in the middle of a room surrounded by subject matter experts grilling you for your knowledge of various random topics. You have no tools, no calculator, no pencil and paper.

You’re asked a question and you just have to spit out the answer. No option to backtrack, experiment, or self correct.

“Translate this to Hebrew”.

“Is this a valid criticism of this passage from a Platonic perspective?”

“Explain counterfactual determinism in Quantum Mechanics.”

“What is the cube root of 74732?”

You would fail all of these. The AI gets 3 of 4 correct.

Tell me who’s smarter?

You because of your preconceptions, or because of real superiority?

Your model for human intelligence is probably more like this scene: https://youtu.be/KvMxLpce3Xw?si=Suy0Cj_pL0vru5Uj

The reality is the opposite. The AI could answer questions in this scenario but no nonfictional human could.


It's just a completely baseless comparison the way you are going about it, and you are mistaking intelligence for the recitation of facts

>“Is this a valid criticism of this passage from a Platonic perspective?”

I haven't seen AI answering questions like this correctly at all


This is what I got on a basically brand new OpenAI account: https://chat.openai.com/share/5199c972-478d-406f-9092-061a6b...

All told, I'd say it's a decent answer.

Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...

These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.


GPT-4:

e composition of what's left in the container.

There's a couple of scenarios, which depend on when you run out of blue balls:

1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.

2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.

Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.


Yeah, that's still pretty much nonsense isn't it?

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.


Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses

You are really trying to make it not have reasoning for your own benefit


> You are really trying to make it not have reasoning for your own benefit

This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much


Then why does it do so well on all the reasoning benchmarks?


GPT 3.5 is VERY dumb when compared to GPT 4. Like, the difference is massive.


GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.


It's still a tool after all.

I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!


> It's still a tool after all.

I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way


A smart hammer that sometimes unavoidably hits your thumb. How smart!


There's a reason you see people walking around in hard hats and steel toed boots in some companies. It's not because everything works perfectly all the time!


Yeah but lets not pretend regular hammers don't exist and probably already did the job fine and safer


An argument as old as rocks! ;-)

https://www.youtube.com/watch?v=nyu4u3VZYaQ


I ran this through GPT-4 Advanced Data Analytics version: https://chat.openai.com/share/b84feb03-22ed-4231-be41-cdb725...

Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?


https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-2843e6...

Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.


>Let's assume you draw x blue balls in 100 draws. Then you would have drawn 100−x red balls.

Uhm.


I came at it from a different angle. The simulation code in my case had a bug which I needed to point out. Then it got a similar final answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: