Hacker News new | past | comments | ask | show | jobs | submit login
GPT-4 performs significantly worse on coding problems not in its training data (twitter.com/chhillee)
344 points by atleastoptimal on March 24, 2023 | hide | past | favorite | 282 comments



This has been my experience. I’m really impressed by how well GPT-4 seems to be able to interpolate between problems heavily represented in the training data to create what feels like novelty, eg. Creating a combination of pong and conway’s game of life, but it doesn’t seem to be good at extrapolation.

The type of work I do is highly niche. I’ve recently been working on a specific problem for which there are probably only a hundred at most implementations running on production systems, all of them highly proprietary. I would be surprised if there were any implementations in GPTs training set. With that said, this problem is not actually that complicated. A rudimentary implementation can be done in ~100 lines of code.

I asked GPT-4 to write me an implementation. It knew a decent amount about the problem (probably from Wikipedia). If it was actually capable of something close to reasoning it should have been able to write an implementation, but when it actually started writing code it was reluctant to write more than a skeleton. When I pushed it to implement specific details it completely fell apart and started hallucinating. When I gave it specific information about what it was doing wrong it acknowledged that it made a mistake and simply gave me a new equally wrong hallucination.

The experience calmed my existential fears about my job being taken by AI.


This exact scenario is what I described to a friend of mine who is an AI researcher.

He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

I said this because humans don’t truly have access to infinite knowledge, even when they do, they can’t process all of it. Adding endless information for the AI to feed on doesn’t seem like the solution to figuring out true intelligence. It’s just more of the same hallucinating.

Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily. With limited information, our minds create new representations of understanding. This seems to be impossible for Chat GPT.

I could be completely wrong, but that discussion solidified for me that my role as a dev still has at least a couple more decades of shelf life left.

It’s nice to hear that others are reaching similar conclusions.


Current LLMs decode in a greedy manner, token by token. In some cases this is good enough - namely for continuous tasks, but in other cases the end result means the model has to backtrack and try another approach, or edit the response. This doesn't work well with the way we are using LLMs now, but could be fixed. Then you'd get a model that can do discontinuous tasks as well.

>> Write a response that includes the number of words in your response.

> This response contains exactly sixteen words, including the number of words in the sentence itself.

It contains 15 words.

The model would have to plan everything before outputting the first token if it were to solve the task correctly. Works if you follow up with "Explicitly count the words", let it reply, then "Rewrite the answer".


> but could be fixed

How? The problem is known for a while, for example this article [0] mentions it (as Chain of Thought reasoning). You could think that just having a scratchpad of tokens is enough - you can arguably plan, backtrack and rewrite there [1], right? But this doesn't really work, at least yet - maybe because it wasn't trained for that - and maybe ChatGPT massive logs (probably available only for OpenAI) can help. But the Microsoft report [2] suggests we need a different architerture and/or algorithms? They mention lack of planning and retrospective thinking as a huge problem for GPT-4. Maybe you know some articles on the ideas how to fix this? Backtracking, trying again seems to be linked to human thought - and very well can give us AGI.

[0] https://arxiv.org/abs/2201.11903

[1] https://www.reddit.com/r/ChatGPT/comments/120fi8e/chatgpt_4_...

[2] https://arxiv.org/abs/2303.12712


You may be shocked to hear this but dijkstra’s short path algorithm is the technical answer to this question. We just don’t use it because it’s expensive.


Language chains or tool use where it can also call on itself to solve subproblems. If you don't have to do just one round of LLM interaction you can do complex stuff.


Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed. So I expect autoregressive generation to remain the default until the recomputation effort can be significantly reduced or hardware advances make the cost bearable.


>> Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed.

I can't imagine how training on masked tokens can "easily" solve backtracking, even in theory. Do you have some literature I could read on this?


Discrete diffusion with rewriting can work well. It feels loosely similar to backtracking, if you assume n_steps large enough - need to be able to rewrite any non-provided position though I think (not all setups do this). Downside is the noise in discrete diffusion (in simplest case randomizing over all vocabulary space) is pretty harsh and makes things very difficult practically. Don't have an exact reference on the relationship, but feels similar to backtracking type mechanics in my experience. I found things tend to "lock in" quickly once a good path is found, which feels a lot like pathfinding to me.

Some early personal experiments with adding "prefix-style" context by a cross-attention (in the vein of PerceiverAR) seemed like it really helped things along, which would kind of point to search-like behavior as well.

Probably the closest theory I can think of is orderless NADE, which builds on the "all orders" training of https://arxiv.org/abs/1310.1757 , which in my opinion closely relates to BERT and all kinds of other masked language work. There's a lot of other NAR language work I'm skipping here that may be more relevant...

On discrete diffusion:

Continuous diffusion for categorical data shows some promise "walking the boundary" between discrete and continuous diffusion https://arxiv.org/abs/2211.15089 , personally like this direction a lot.

If you have a pre-made embedding space, SSD-LM is a straightforward method https://arxiv.org/abs/2210.17432

SUNDAE worked well for translation https://arxiv.org/abs/2112.06749 and many other tasks.

My own contribution, SUNMASK, worked reasonably well for symbolic music/small datasets (https://openreview.net/forum?id=GIZlheqznkT), but really struggled with anything text or moderately large vocabulary, maybe due to training/compute/arch issues. Personally think large vocabulary discrete diffusion (thinking of the huge vocabs in modern universal LM work) will continue to be a challenge.

Decoding strategies:

As a general aside, I still don't understand how many of the large generative tools aren't exposing more decoding strategies, or hooks to implement them. Beam search with stochastic/diverse group objectives, per-step temperature/top-k/top-p, hooks for things like COLD decoding https://arxiv.org/abs/2202.11705, minimum Bayes risk https://medium.com/mlearning-ai/mbr-decoding-get-better-resu..., check/correct systems during decode based on simple domain rules and previous outputs, etc.

These kinds of decoding tools have always been a huge boost to model performance for me, and having access to add in these hooks to "big API models" would be really nice... though I guess you would need to limit/lock compute use since a full backtracking search would pretty swiftly crash most systems. Maybe the new "plugins" access from OpenAI will allow some of this.


Backtracking is easily solved with a shortest path algorithm. I don’t see any need for masking if you are simply maximizing likelihood of the entire sequence.


I don't think humans can do this either. What's the problem with producing a result and then fixing it? It's exactly how we do it.


> This exact scenario is what I described to a friend of mine who is an AI researcher. He was convinced that if we trained the AI on enough data, GPT-x would become sentient. My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

It turns out it isn’t just AIs that hallucinate; AI researchers do as well.


"researcher".


> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Is there enough data?

As I understand it, the latest large language models are trained on almost every piece of available text. GPT-4 is multimodal in part because there isn't an easy way to increase its dataset with more text. In the meantime, text is already quite information dense.

I'm not sure that future models will be able to train on an order of magnitude more information, even if the size of their training sets has a few more zeroes added to the end.


what about all the content not yet in text form (e.g. YouTube videos)?


The threshold for sentience is continually falling.

So he might be right but due to time and not due to improved performance.

I believe in the UK all vertibrates are considered sentient (by law not science). That includes goldfish.

And good luck even getting a goldfish to reverse a linked list. Even after 1000 implementations are provided.



I don't think that when people commonly discuss sentience they mean to include goldfish. I don't think the legal definition (which probably exists due to external legal implications) has any bearing on the intellectual debate of AI sentience.


Sentience is just the capacity to experience feelings and sensations. Goldfish can do it and AI can’t (so far).


If I were talking about sentience I would definitely be including goldfish. What about them is so different to us that we would have sentience while they would not?


> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Not saying your friend is right or wrong, but imagine if civilization gives more information, in realtime, to an AI system through sensors: will be at least sentient as the civilization? Seems like a scifi story, a competitor to G-d.


Isaac Asimov wrote a story along those lines, “The Last Question”, which he described as “by far my favorite story of all those I have written.” Full text here:

https://xpressenglish.com/our-stories/the-last-question/


Some versions of divinity (both from real-world beliefs and sci-fi/fantasy) have it being essentially a gestalt of either all the souls that have ever died, or all those alive now—a kind of "oversoul" or collective consciousness.

While that's an interesting thought experiment, I don't think it can meaningfully apply to any kind of AI we have the capability to make today, even if we could hook it up directly to all our knowledge. Information alone can't make something sentient; it requires a sufficiently complex and sophisticated information processing system, one that can reason about its knowledge and itself.


I’m not at all an expert on the topic, but from what I gathered LLMs are fundamentally limited in the kind of problems they can approximate. They can approximate any integrable function quite well, but we can only come up with limits on a case-by-case basis for non-integrable ones, and I believe most interesting problems are of this latter kind.

Correct me if I’m wrong, but doesn’t it mean that they can’t recursively “think”, on a fundamental basis? And sure I know that you can pass “show your thinking” to GPT, but that’s not general recursion, just “hard-coded to N iterations” basically, isn’t it? And thus no matter how much hardware we throw at it, it won’t be able to surpass this fundamental limit (and without proof, I firmly believe that for a GAI we do need the ability to basically follow through a train of thought)


How is it "hard-coded to N iterations"? We don't instruct the model how many lines of working it should show.

Obviously there is a limit to how much it can fit in the context, but that seems to be rising fast (went from 4k to 32k in not that long)


It fundamentally can’t recurse into a thought process. Let’s say I give you a symbol table where each symbol means something and ask you to “evaluate” this list of symbols. You can do that just fine, but even in theory not even GPT-10384 will be able to do that without changing the whole underlying model itself.


I don't understand the task. What does evaluating the list of symbols mean?

Do you mean you define a programming language/bytecode and then feed it into the model?

He's an example where GPT-4 did this perfectly for a very sinple language. This was my first attempt, I did not have to do any trial an error.

https://pastebin.com/4YA5wpie


Could you try writing even in this simple language a longer program? Just simply increase the input to 20x or something around that. I’m interested in whether it will break and if it does, at what length.


Interesting, it screwed up at step 160. I think it probably ran out of context, if I explicitly told it to output each step in a more compact way it might do better. Or if I had access to the 32k context length it would probably get 4x further.

Actually it might be worth trying to get it to output the original instructions again every 100 steps, so that the instructions are always available in the context. The ChatGPT UI still wouldn't let you output that much at once but the API would.


If they aren't already, AIs will be posting content on social media apps. These apps measure the amount of attention you pay to each thing presented to you. If it's more than a picture or a video, but something interactive, then it could also learn how we interact with things in more complex ways. It also gets feedback from us through the comments section. Like biological mutations, AIs will learn which of its (at first) random novel creations we find utility in. It will then better learn what drives us and will learn to create and extrapolate at a much faster pace than us.


> If they aren't already, AIs will be posting content on social media apps.

No, people will be posting content on social media apps that they asked LLMs to write.

It may be done through a script, or API calls, but it's 100% at the instigation, direct or indirect, of a human.

LLMs have no ability to decide independently to post to social media, even if you do write code to give them the technical capability to make such posts.


With the new ChatGPT Plugins, it seems they may actually be able to make POST requests to social media APIs soon. It is likely that an LLM could have "I should post a tweet about this" in its training data.

Granted... currently it is likely humans that have written the code that the new Plugins are allowed to call -- but they have given ChatGPT the ability to execute rudimentary Python scrips and even ffmpeg so I think it is only a matter of time before one outputs a Tweet written by its own code.


> It is likely that an LLM could have "I should post a tweet about this" in its training data.

That only matters if a human has explicitly hooked it up so that when ChatGPT encounters that set of tokens, it executes the "post to Twitter" scripts.

ChatGPT doesn't comprehend the text it's producing, so without humans making specific links between particular bundles of text and the relevant plugin scripts, it will never "decide" to use them.


At a high level, all that would have to happen is a person gives GPT, or something like it, access to a social media page and tells it to post to it with the objective of getting the highest level of interaction and followers.


...which in no way grants GPT sapience, nor would it prove that it has it.

The human is still providing the capability to post, the timing script to trigger posting, and the specific heuristic to be used in determining how to choose what to post.


> Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily.

I think there is some sampling bias in your observation ;-)


More data will only mean more inference. But at some unexpected moment, the newly created "senseBERT" breaks the barrier between intelligence and consciousness.


> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

It sounds like he doesn't even understand the basics of what GPT is, or what sentience is. GPT is an impressive manipulator/predictor of language, but we have evidence from all sorts of directions that there's more to sentience or consciousness than that.


I would like to propose a thought experiment concerning the realm of knowledge acquisition. Given that the scope of human imagination is inherently limited, it is inevitable that certain information will remain beyond our grasp; these are the so-called "known unknowns." In the event that an individual generates a piece of knowledge from this inaccessible domain, how might it manifest in our perception? It is likely that such knowledge would appear incomprehensible to us. Consequently, it is worth considering the possibility that the GPT model is not, in fact, experiencing hallucinations; rather, our human understanding is simply insufficient to fully grasp its output.


Yeah. Maybe when a baby says "gabadigoibygee", he is using an extremely efficient language that is too sophisticated for our adult brains to comprehend.

Yeah, maybe.


> In the event that an individual generates a piece of knowledge from this inaccessible domain, how might it manifest in our perception? It is likely that such knowledge would appear incomprehensible to us.

If what a person says cannot be comprehended by any other person, we usually have a special term for it.


But the hallucinated code doesn’t work.


This is ridiculously “meta”, but I’ve said the same thing, at some point GPT-x will be useless as it will be beyond our comprehension, that’s if it’s actually “smart”.

My honest opinion is the hallucinations are just gibberish, but are they useful gibberish? Maybe we’re saying the same thing ?


> GPT-x will be useless as it will be beyond our comprehension, that’s if it’s actually “smart”.

Things don’t have to be comprehensible before they’re useful. But they have to work to be useful.


Not hard to check whether code compiles or runs.


> The experience calmed my existential fears about my job being taken by AI.

The issue is that among all the 100k+ software engineers, many don't really do anything novel. How many startups are employing dozens of engineers to create online accessible CRUDs to replace a spreadsheet?

In the company I work for I'd say we have about 15 developers or about 3 teams doing interesting work, and everyone else builds integrations, CRUDs, moves a button there and back in "an experiment", ads a new upsell, etc. All these last parts could be done by a PM or good UX person alone, given good enough tools.

The other parts I'm not worried about either.


For the type of engineers you describe the hard part I think is communication with other devs, communication with product owners, understanding the problem, suggesting different ways of solving the problem, figuring out which department personnel (outside other devs) to talk to about a little detail that you don't have... it's not writing the code which is hard, atleast from my experience


Yes. I won't be worried until the day Joe CEO can write a prompt like "build me an app that lets me know where my employees are at all times," and GPT responds with a list of questions about how Joe imagines this being physically implemented, and then calls up the legal department to clear its methods.


I think this is closer than you expect


The question is... writing the code is a very small part of the job.

Figuring out what code to write is one of the big parts.

Fixing it when it breaks in many creative ways is the other big part.

How good is ChatGPT at fixing bugs? Security bugs or otherwise?


Sure but the other parts you don't need an engineering degree for, the other parts amount to design / product work, not engineering.


1. You don't need an engineering degree for software development in many, many cases. So I don't understand your argument.

2. Engineers design stuff :-) I'm not sure what you mean with "product work". Also, engineers debug and fix stuff :-)


Product work = figuring out what to build


Product work is a fractal and you don't want "product people" designing things past the 2nd or 3rd fractal step, in my experience.


I had a similar experience. I wanted it to write code to draw arcs on a world map, with different bends rather than going on a straight bearing. I did all the tricks, told it to explain its chain of thought, gave it a list of APIs to use (with d3-geo), simplified and simplified and spent a couple hours trying to reframe it.

It just spit out garbage. Because (afaict) there aren't really examples of that specific thing on the Internet. And it's just been weirdly bad at all the cartography-related programming problems I've thrown at it, in general.

And yeah, I'm much less worried about it replacing me now. It's just not.. lucid, yet.


GPT-4 is reasonably good at D3 and drawing arcs on a projection (e.g. orthographic) is not that unique, you’ll find examples of it on observable. However I wonder if you broke down the problem into a small enough task. It performs best if you provide a clear but brief problem description with a code snippet that already kind of does what you want (e.g. using straight lines) and then just ask it to modify your code to calculate arcs instead. The combination of clear description + code I found decreases the likelihood of it getting confused about what you’re asking and hallucinate. If you give it a very long-winded request with no code as basis for it then good luck.


I did try the code snippet technique, but unfortunately it got it wrong. For example, I gave it code that drew arcs but didn't follow the shortest great-circle distance, and it gave me several plausible-looking approaches that were completely wrong (e.g. telling ctx.arc to draw counterclockwise, which does the wrong thing because it needs to use projections instead.)

I eventually just asked it to compute coordinates to a point c perpendicular to the midpoint on the great arc between a and b, such that the angle between ab and ac is alpha. I tried for hours, asking it to work out equations and name the mathematical identities it used etc. but it was all gibberish.


So the closer you come to writing the code for it the better it does


I imagine that creative approaches to spacial problem solving would be one of the harder areas for it - not just because there are by definition fewer public examples of one-off or original solutions, but also because one has to visualize things in space before figuring out how to code it. These bots don't have a concept of space. I'm thinking of DALL-E (et. al) having problems with "an X above Y, behind Z".


GPT4 has its hands tied behind its back. It does not have active learning and it does not have a robust system of memory or a reward/punishment mechanism. We only now start seeing work on this side [1]

It might not know more than you about your niche. I don't. I would search and I would try to reason, but if I was forced to give a token by token output that is answering the question as truthfully as possible, I might have started saying bullshit as well.

I don't think that the fact that gpt doesn't know things or does some things wrong is sufficient to save dev work from automation.

[1]: https://github.com/noahshinn024/reflexion-human-eval


> The experience calmed my existential fears about my job being taken by AI.

Same for me. I didn't try GPT-4 yet, and not on code from work anyway but GPT-3 seems borderline useless at this point. The hallucinations are quite significant. Also I tried to produce advice for Agile development with references and as stated in other articles the links where either 404s or even completely unrelated articles.

Still I'm taking this seriously. Just considering the leaps that happened with AlphaGo/AlphaZero or autonomous driving, that was considered unthinkable in the respective domains before.


Even if AI only takes over “easy” programming jobs, it might still create a huge downward pressure on compensation.

After all, just look at manufacturing. Compared to 1970 we produce 5x the real output but employ only 50% the people. The same will likely happen to fields like programming as AI improves.


For the crap devs maybe, but high skill devs and arcitechts will be able to charge more than ever to oversee all of this «productivity» from the AIs.


I asked it to write a trivial c#/dotnet example of two actors where one sends a ping message and the other responds with pong. It couldn't get the setup stage right, called several methods that don't exist, and and had a cyclic dependency between actors that would probably take some work to resolve.

Event after several iterations of giving it error messages and writing explanations of what's not working, it didn't even get past the first issue. Sometimes it would agree that it needs to fix something, but would then print back code with exactly the same problem.


Yes, exactly this.

I wrote some questions in the specialist legal field of someone in my household, then started to get into more specialist questions, and then specifically asked about a paper that she wrote innovating a new technique in the field.

The general question answers were very impressive to the attny. The specialist questions started turning up errors and getting concepts backwards - bad answers.

When I got to summarizing the paper with the new technique, it could not have been more wrong. It got the entire concept backwards and wrong, barfing generic and wrong phrases, and completely ignored the long list of citations.

Worse yet, to the point of hilariously bad, when asked for the author, date, and employer of the paper, it was entirely hallucinating. Literally, the line under the title was the date, and after that was "Author: [name], [employer]". It just randomly put up dates and names (or combinations of real names) of mostly real authors and law firms in the region. Even when pointed out the errors, it would apologize, and then confidently spout a new error. Eventually it got the date correct, and that stuck, but even when prompted with "Look at where it says 'Author: [fname]" and tell me the full name and employer, it would hallucinate a last name and employer. Always with the complete confidence of a drunken bullshit artist.

Similar for my field of expertise.

So, yes, for anything real, we really need to keep it in the middle-of-the-road zone of maximum training. Otherwise, it will provide BS (of course if it is BS we want, it'll produce it on an industrial scale!).


Yeah, in that sense I think one of the next logical steps will be providing on-demand lightweight learning/finetuning of LLM versions/forks (maybe as LoRAs?) as an API and integrated UX based on user chat feedback, while abstracting away all the technical hyperparameter and deployment details involved in a DIY setup. With a lucrative price tag of course.


> but it doesn’t seem to be good at extrapolation.

This is true to varying degrees for every statistical model ever.


Yeah that’s basically my point. The hype on HN/Twitter/etc. forget this.


What would you be able to write with similar requests, if you'd only ever be allowed to use Notepad, and never run compiler/linter/tests, and not allowed to use Internet?


Given I don't have petabytes of information accessible for instant retrieval (including perfect copies of my language of choice's entire API) I don't think that's comparable. I wouldn't need the entire if I'd memorized a large portion of it.


GPTs don't have access to petabytes of information, that's the point. Only to some internal representation.


Unlike current LLMs, your typical competent programmer would not hallucinate.


Quants jobs are safe because if it’s public there’s no edge


I feel vindicated reading this. Yesterday in a separate thread I claimed that it was wrong on 80% of the coding problems I gave it, and received the response from multiple readers that I was probably phrasing my questions poorly.

I started to believe them, too. Unfortunately, my brain is structured in such a way that a unanimous verdict from a few strangers is enough to make me think I’m probably the one who’s wrong. I need to make note of these events as a way to remind myself that this isn’t always the case.


I think part of the issue is that your mileage will vary greatly depending on what your problem domain and language of choice is. People working with languages and problems chatgpt works well in have a hard time believing the hard fails in other domains and vice versa. I wrote a python script the other day to delete some old xcode devices lower than a certain ios version complete witth options with a just a few back and forths with chatGPT. My knowledge of python is extremely basic and the code just worked out of the box. Then yesterday I asked for the code to tell if a device is lidar enabled in Objective-C and it failed to give me compilable code 4 times in a row until I finally gave up and went back to the docs. The correct answer is one line. I for one am pretty excited about this, things that a lot of people have done before should be easy, leaves more brain space for the tough stuff.


What was the answer, if you don't mind? ChatGPT (3.5) days to use isCollaborationEnabled on the ARWorldTrackingConfiguration class which doesn't seem quite right based on the docs.

I wonder if this is a GI/GO problem. Apple's poor documentation of features being the garbage in.


Sure, this was what I ended up using:

https://developer.apple.com/documentation/arkit/arworldtrack...

I believe there are multiple ways, part of the problem might be Apple doing the Apple thing where their user philosophy bleeds into their tech. They don't want you checking for a specific sensor or device capability they want you to check if whatever feature you want is enabled.


Vindicated and excited. Gradient descent is likely not enough. I love it when we get closer to something but are still missing the answer. I would be very happy if "add more parameters and compute" isn't enough to get us to AGI. It means you need talent to get there, and money alone will not suffice. Bad news for OpenAI and other big firms, good news for science and the curious.

I imagine physicists got very excited with things like the ultraviolet catastrophe, and the irreconcilable nature of quantum mechanics and general relativity. It's these mysteries that keep the world exciting.


There's something ironic about implying that us not having a path to AGI is good news for the curious. If you're supervisiously curious then sure, we need to unlock another piece of a puzzle, more puzzle pieces means more puzzle solving.

But if you're able to actually take a step back, AGI would be the the ultimate source of new puzzles for the curious. We don't even all agree on how to define the "GI", approaching AGI wouldn't be unlike meeting extraterrestrial life sitting on a computer.


I think you misunderstood the parent, who was probably saying that the process for achieving AGI would be more interesting if it isn't just "more compute/training".


I think you misunderstood me since that's exactly my point.

From the trees, it's great for the curious that it might take more than compute and training.

From the forest, it'd be infinitely preferable if AGI were just a matter of more money. There are mysteries we can't even envision yet that would more than make up for any "lost curiosity"


Al right, all right, but to me the implication was that only those who have ludicrous amounts of money would be able to play with it. And I don't think that's the most desirable outcome, is it?


Woah. I've never seen someone so self aware of the asch conformity test without literally talking about the asch conformity test.

https://en.wikipedia.org/wiki/Asch_conformity_experiments


Easiest fix for that is show the prompt you gave and the output you want. Force the people who tell you it's easy to actually do it. You'll get one of 3 outcomes.

-They try then fail (you were right) -They try then succeed (you learn) -They keep telling you it's easy but don't demonstrate. (You know that this person is full of it and you can ignore them in future)


Yup, always drill down and get/provide more context for when something doesn't align/seems fishy.

I've found WAYYYY too many of my issues were really just communications issues. Shits hard to navigate socially.

That said, mind sharing the specific tasks/prompts gpt-4 failed you?


“Reflection-Based GPT-4 Agent is State-of-the-Art on Code Gen

Iteratively refines code, shifting “accuracy bottleneck” from correct code gen to correct test gen

HumanEval accuracy:

-Reflexion-based GPT-4 88%

-GPT-4 67.0%

-CodeT 65.8%

-PaLM 26.2%”

with link to code in the Tweet:

https://mobile.twitter.com/johnjnay/status/16393620718075494...

21% improvement after adding a feedback loop and self-reflection to GPT-4, which just went public 12 days ago. (The approach is based on a preprint published 4 days ago.)

Human coders often need a feedback loop and self-reflection to properly “generate” code for problems novel to them as well.

-----

A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?


Someone asked GPT-4 to build a complete app from scratch. It's now on the store. He seems to apply good prompt engineering techniques.

Screenshots, video demo, and process here: https://mobile.twitter.com/mortenjust/status/163927657157489...

It seems plausible to me now that a junior developer position would be hard to find in 2-3 years (I thought it would be ~5 years).


So this is very impressive and looks like a solid lowering of barrier to entry which is great….but that app is around 300 lines of code in one file that fetches data on 5 movies, screenshots are not cropped correctly and the pager doesn’t swipe back to the first dot. I am bit surprised it made it through the review process. Not hating I think it is great to make this stuff more approachable but not convinced junior devs are in danger yet.


GPT-3 was released less than 3 years ago and it was far from capable of this.


Exponential curves are theoretical constructs: all actual phenomena are S-shaped.

The question is only when does the "exponential regime" turn into the flat; and the answer is often fairly obvious if you don't begin from the "time = magic" premise.

There's an entire industry of public (pseudo)-intellectual who's schtick is to draw logistic phenomena with exponential curves and then cry, "the sky is falling!".


>Exponential curves are theoretical constructs: all actual phenomena are S-shaped.

This. I've been thinking much the same recently, although not expressed so succinctly. Nothing in nature is ever exponential (for very long).


On the contrary, few experts expected this performance from an AI this soon too.

If you can identify one or two aspects of the human “general” intelligence that an AI cannot ever possess, even in principle, I think a lot of people would be grateful.


In animals, propositional knowledge is built from procedural knowledge; and it can't really be otherwise.

What AI does at the moment is approximate propositional knowlegde with statistical associations, rather than take the procedural route. But this fails because P(A|B) doesnt say whether A causes B, B causes A, A is B, A and B are causally unrelated, etc.

What is the procedural route? To perform actions with your body so as to disambiguate the cases. Animals have causal models of their bodies which are unambiguous and their actions are intentional and goal-directed and effectively "express hypotheses" about the nature of the world. In doing so, they can build actual knowledge of it.

There's at least some good reasons to suppose that "bodies which express hypotheses in their actions" require organic properties to do so: becuase you have to have adaption from bottom-up to top-down to really have "the mind" grow the body in the relevant ways.

In other words, every action an animal performs isnt clockwork: in acting, it's body and mind change. Every action is a top-down, bottom-up whole change to the animal.


This is a very interesting hypothesis that could be quite true for living beings. What I disagree with is that having an animal-like body is necessary for the process of forming a world model. A simulation could be sufficient. And there is already work on that front. (Also, I would not characterize deep-learning-based AI as trying to form propositional knowledge. In fact, its great performance partly stems from not dealing with propositional knowledge directly.)

If a body is in fact necessary, PaLM-E could be paving a way toward it as well. https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...


You might be interested in this thread by a DeepMind’s research scientist. https://mobile.twitter.com/AndrewLampinen/status/16396602197...


Sure and it is impressive but the upper bounds of AI products are hard to predict. That said things are changing fast I don’t know what tomorrow will bring.


the paper: https://arxiv.org/pdf/2303.11366.pdf

it reminds me of the thinking-fast vs. thinking-slow dichotomy. Current llms are the thinking fast type. Funnily people’s complains about its errors are reminiscent of this. It answers just to quick and only with its instant response neural net. A thinking slow answer would be more akin to a chain of thought answer. Allowing the llm a more flexible platform than CoT promptin might well be the next step. Of course it would als multiply compute cost. So it might not be in your 20$ subscription


A narrower question: can we perhaps stop putting AGI and ChatGPT in the same paragraphs as if they are somehow relevant to each other? Intelligence has very little to do with a glorified Google search trained by statistical crunching of superhuman amounts of data; there is not even a chicken-sized trace of intelligence in ChatGPT that is not a reflection of the training set or of the embedded human-designed models it uses to mimicry problem solving and conversation.


Several necessary ingredients of human intelligence are present in GPT-4: complex pattern matching, abstraction from concrete examples and apply the abstract patterns to new examples, pattern interpolation, basic reasoning.

This is evident by its ability to generalize from the training set to new problems within many domains.

It's still unable to generalize as well as a smart human beyond the distribution it was specifically trained on, which is evident by its poor performance on AMC, Leetcode medium and hard, and Codeforces problems. But most humans are not great at these kinds of problems either.

Benchmark and test results: https://openai.com/research/gpt-4


> A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?

We are running towards a brick wall and people are not paying attention. Setting up self-reflection loops today is actually fairly trivial and can be done programmatically, all the model needs is to produce a solution, invoke the evaluation and keep iterating.


so how do we know chat gpt didn't have humaneval in training data?


The point is adding a couple components can improve GPT-4 significantly, as shown above. The data it originally trained with is presumably held constant in the evaluation above.


the point is if humaneval was in gpt training data, then this component improved memorization from mediocre to Ok-ish, and actual coding skills still not tested.


From other people’s and my experiences, GPT-4 can do more than simply memorizing. It can at least interpolate and reason a little bit too.

A few other tests show that GPT-4 would achieve much better results than 67% for something it has sufficient training data on like GRE Verbal and AP Macroeconomics.

https://openai.com/research/gpt-4

Yes, it still can’t generalize properly outside its training distribution. However, when armed with feedback and self-reflection, it seems better at that too.


yup. we're (we=the public) far from getting access to the full model. that said one of the commentors in the twitter thread brings up how openapi isnt being fully forthcoming about their methods.

AI Explained has a good summary of many of these topics


Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

LLMs are inherently incapable of any that - as in mechanically incapable, in the same way a washing machine is incapable of being an airplane.

Now my understanding is that the actual systems we have access to now have other components, with the LLM being the _core_, but not _sole_ component.

Can anybody point me to any papers on those "auxiliary" systems?

I would find it very interesting to see if there are any LLMs with logic components (e.g. Prolog-like facts database and basic rules that would enforce factual/numerical correctness; "The number of humans on Mars is zero." etc.).


Because they don't distinguish between properties of the output and properties of the system which generated it. Indeed, much of the last decade of computer-science-engineering has basically been just insisting that these are the same.

An LLM can generate output which is indistinguishable from a system which reasoned/knew/imagined/etc. -- therefore the "hopeium / sky is falling" manic preachers call its output "reasoned" etc.

Any actual scientist in this field isn't interested in whether measures of a system (its output) are indistinguishable, they're interested in the actual properties of the system.

You don't get to claim the sun goes around the earth just because the sky looks that way.


Do submarines swim? No, but they are faster underwater than all swimmers. Therefore they are the best swimmers despite being unable to swim....

LLMs are producing human level reasoning in many domains, therefore they are the best at reasoning despite being unable to reason...

This whole debate hangs on the definition of "reasoning"


Scientists are extremely interested in measurable results of experiments. I think you are thinking of philosophers.


Can an airplane fly? Can a submarine swim?

Yes, AI may be constructed quite differently from human intelligence. Can it accomplish the same purposes? For some purposes, the answer is a resounding yes as can be seen from its applications around the world by millions of people.

Can an animal ‘think’, ‘understand’, or ‘reason’? Maybe not as well as a homo sapiens. But it’s clear that a raven, a dolphin, or a chimp can do many things we assume require intelligence. (A chimp may even have a slightly larger working memory than a human, according to some research.)

Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?


> Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?

You’re right, AI doesn’t need to be AGI to be useful. Most SEO content on the internet is probably even worse than ChatGPT can do. And LLM could hallucinate another Marvel movie since they’re so similar.

My problem is that people make ungrounded claims about these systems either already having sentience or being just few steps away from it. It’s a religion at this point.


some prompts results are only explainable if chatgpt has the ability to produce some kind of reasoning.

As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.


The reasoning occurred when people wrote the text it was trained on in the first place; it's training data is full of the symptoms of imagination, reason, intelligence, etc.

Of course if you statistically sample from that in convincing ways it will convince you it has the properties of the systems (ie., people) which created its training data.

But on careful inspection, it seems obvious it doesnt.

Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.


> Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.

Excellent analogy, and I appreciate analogies (perhaps even a bit too much). Will be using this one. Thank you!


If you “sample” this enough to be reasoning in a general manner, what is exactly the problem here?

Magic “reasoning fairy dust” missing from the formula? I get the argument and I think I agree. See Dreyfus and things like “the world is the model”.

Thing is, the world could contain all intelligent patterns and we are just picking up on them. Composing them instead of creating them. This makes us automatons like AI, but who cares if the end result is the same?


The distribution to sample from mostly doesn't exist.

Data is produced by intelligent agents, it isn't just "out there to be sampled from". That would mean all future questions already have their answers in some training data: they do not.

See for example this exact tweet: pre-2021 coding challenges are excellent, post-2021 are poor. Why? Because post-2021 didnt exist to sample from when the system was built.


At the minimum, chatgpt displays a remarkable ability to maintain a consistent speech throughout a long and complex conversation with a user, taking into account all the internal implicit references.

this to me is the proof it is able to correctly infer meaning, and is clearly a sign of intelligence. (something a drunk human has trouble doing, for example).


"I have seen the output and it matches what I consider to be conversation"

Well yeah, it's been trained to produce output that would look like conversation.


it's not what i meant : you can have a full conversation and then at some point use "it" or "him" , and based on the rest of the sentence, it will understand what previous element of the conversation you were mentionning..

This requires at least "some" conceptualisation of the things you're talking about. It's not just statistics.


It does not require conceptualization, pretty sure the "understanding" of previous references comes from this: https://arxiv.org/abs/1706.03762


This is exactly statistics.


> As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.

If there's one field of expertise I trust programmers to not have a clue about it's how human intelligence works.


What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly? This answer is worth billions so why not enlighten us.

You don’t know? But you feel you have “it”, this magical substance, this substrate of reason itself? Reasoning is build out of what, exactly?

Sorry to be that guy, but I fail to see more than word play.


The thing is that we've known how to do reasoning with computers since the 1960's at least. Here:

https://dl.acm.org/doi/10.1145/321250.321253

That's the paper introducing the Resolution principle, which is a sound and complete system of deductive inference, with a single inference rule simple enough that a computer can run it.

The paper is from 1965. AI research had reasoning down pat since the 1970's at least. Modern systems have made progress in modelling and prediction, but lost the ability to reason in the progress.

Yeah, we totally "scienced that shit" as you say in a comment below. And then there was an AI winter and we threw the science out because there wasn't funding for it. And now we got language models that can't do reasoning because all the funding comes from big tech corps that don't give a shit about sciencing anything but their bottom line.


What makes you so sure diluting things doesn't make them stronger? I mean, you don't know any physics, chemistry or biology -- but it's just word play right?

I mean, there isnt anything called science we might used to study stuff. You can't actually study any intelligent things empirically: what would you study? Like animals, and people and things? That would be mad. No no, it's all just word play.

And you know it's wordplay because you've taken the time to study the philosophy of mind, cognitive science, empirical psychology, neuroscience, biology, zoology and anthropology.

And you've really come to a solid conclusion here: yes, of course, the latest trinket from silicon valley really is all we need to know about intelligence.

That's how the scientific method works, right?

Sillicon Valley releases a gimmik and we print that in Nature and all go home. It turns out what Kant was missing was some VC funding -- no need to write the critique of pure reason.


>What makes you so sure diluting things doesn't make them stronger

Alcohol diluted with around 30% water makes it 'stronger' at killing bacteria...

I mean it's easy to say "just science that shit", and then forget we've been spending decades and billions of dollars doing just that.


Let’s all see and be amazed by the absolutely breathtaking achievements of those fields in the domain of AI…


> What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly?

It never fails: When faced with the reality of what the program is your average tech bro will immediately fall back to trying to play their hand at being a neuroscientist, psychologist, and philosopher all at once.


You did not answer a single thing and maybe I am those things. You don’t know me.


Are you any of those things?


>> Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

It's very confusing when you come up with some idiosyncratic expression like "advanced token-based probabilistic sequence generators" and then hold it up as if it is a commonly accepted term. The easiest thing for anyone to do is to ignore your comment as coming from someone who has no idea what a large language model is and is just making it up in their mind to find something to argue with.

Why not just talk about "LLMs"? Everybody knows what you're talking about then. Of course I can see that you have tied your "definition" of LLMs very tightly to your assumption that they can't do reasoning etc., so your question wouldn't be easy to ask unless you started from that assumption in the first place.

Which makes it a pointless question to ask, if you've answered it already.

The extravagant hype about LLMs needs to be criticised, but coming up with fanciful descriptions of their function and attacking those fanciful descriptions as if they were the real thing, is not going to be at all impactful.

Seriously, let's try to keep the noise down in this debate we're having on HN. Can't hear myself think around here anymore.


Just curious why you’d respond so much to him and also add nothing to the discussion?


Hang on, how is it fair to ask me why I "add nothing to the discussion" when all your comment does is ask me why I add nothing to the discussion? Is your comment adding something to the discussion?

I think it makes perfect sense to discuss how we discuss, and even try to steer the conversation to more productive directions. I bet that's part of why we have downvote buttons and flag controls. And I prefer to leave a comment than to downvote without explanation, although it gets hard when the conversation grows as large as this one.

Also, can I please separately bitch about how everyone around here assumes that everyone around here is a "he"? I don't see how you can make that guess from the user's name ("drbig"). And then the other user below seems to assume I'm a "him" also, despite my username (YeGoblynQueenne? I guess I could be a "queen" in the queer sense...). Way to go to turn this place into a monoculture, really.


Not him but I am also extremely frustrated by the fact it is impossible to have a real discussion about this topic, especially on HN. Everyone just talks past each other and I get the feeling that a majority of the disagreement is basically about definitions, but since no one defines terms it is hard to tell.


I don't think there's anything inherently different algorithmically or conceptually.

Our brain is just billions of neurons and trillions of connections, with millions of years of evolution making certain structural components of our network look a certain way. The scale makes it impossible to replicate.


What do you mean 'impossible to replicate'. With current technology, or in general?


Possibly both? Certainly the first of the two.


it kind of does “understand” when humans supervise it during training and they are able somehow relate and give mostly coherent responses. It may not be feeling it but it does seem to “understand” a subject more than a few people


But we do not even know whether GPT-4 is 'just a LLM'. Given the latest addons and the fact it can do some mathematics, I think there is more under the hood. Maybe it can query some reasoning engine.

This is why I think it is so important for OpenAI be more open about the architecture, so we can understand the weaknesses.


I threw a challenging rendering problem at it and I was pretty impressed with the overall structure and implementation. But as I looked deeper, the flaws became apparent. It simply made up APIs that didn’t exist, and when prompted to fix it, couldn’t figure it out.

Still, despite being fundamentally wrong it did send me down some different paths.


Using APIs that don't exist is the biggest problem I've seen with ChatGPT, and it seems GPT-4 as well.


I asked chatgpt about the api for an old programming game called chipwits.. it invented a whole programming language that it called chiptalk with an amalgam of the original chipwits stuff, missing some bits and adding others, and generated a parser for it, which I implemented and got to work, before figuring out how much was imaginary, after talking to the original chipwits devs. They found it pretty amusing.


> and got to work

Can you elaborate?


I'm fast learning Django and even though it's an extremely well documented space, ChatGPT has sent me down the wrong path more than a handful of times.

This is especially difficult because I don't know when it's wrong and it's so damn confident. I've gotten better at questioning its correctness when the code doesn't perform as expected but initially it cost me upwards of 30min per time.

Still, I would say between ChatGPT and Copilot - I'm WAY further ahead.


chatgpt or gpt4?

public copilot uses gpt3.5, as does non premium chatgpt.


my biggest problem with it is that it doesn't seem to understand its own knowledge. If you talk to it for a while and you go back and forth on a coding problem it will often suddenly start using wrong syntax that doesn't exist. Even though at this point it should already know and have looked up for sure that this syntax can't possibly exist because many times it responded correctly. So in human terms it has read the documentation and must know that this syntax can't possibly exist and yet it doesn't know that 10 sec later. That's currently what makes it seem like a not real intelligence to me.


One of the advantages of Bing, and do guess now ChatGPT with browsing plugin, is that it's able to search on the web for the right API.


To be fair, using APIs that I think should exist, is how I develop most of my APIs.


Except that I wasn't asking it to develop a new API.


It's very likely it was using other languages' as "inspiration" given there's very little Zig code out there... so it's maybe natural it would use APIs that don't yet exist... perhaps informing it that it also needs to implement those APIs could work?


Then I guess you're not using it to its fullest potential ;)


We can’t keep blaming the prompter.


A simple metric on confidence interval could do the trick. As the model grows larger, it is getting more difficult to understand what is going on, but that doesn't mean that it needs to be a total black box. At least let it throw some proxy metrics. In due course, will learn to interpret those metrics and adjust our internal trust model.


You can just ask it to give you confidence in the output on a scale 0 to 1


I wonder if a plugin to let it query API docs would solve this problem.


Also it makes up Python libraries, macOS apps to do certain tasks, etc.


I’ve had very good results from running the code and pasting the errors back into ChatGPT and asking it what to do. Sometimes it corrects itself quite well


Put that in a loop and see if AGI emerges.


>It simply made up APIs that didn’t exist

That has been my experience with Zig. It led me to the conclusion that there are just too many 'non indexed' developer tools in use these days, so there isnt much training data for new topics. But it was happy to hallucinate API's and their proof of existence.


yea I find it to be wrong a lot when coding But its faster for me to fix existing code than to write code from scratch so its still better than nothing for me


Same. It seems similar to Copilot in that regard, but better at text-to-code, porting between languages or frameworks, and generating test cases and readmes: https://notes.osteele.com/gpt-experiments/using-chatgpt-to-p...


Most of us are much worse on coding problems not in our training set!

(Looks down at dynamic programming problem involving stdin/stdout and combining two data structures).


The reason we're being kept around still is that you can solve the problem without it ever appearing in your training set, and once you have, it has.


Hints of The Nine Billion Names of God for sure.

https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God


Oh wow - Unsong [1] must have taken some inspiration from that. Into the queue it goes!

[1] https://unsongbook.com/


Allow me to offer you this twitter thread:

https://twitter.com/chaosprime/status/1607895175799373830


What is that exactly? The site doesn't say.


A work of online serial fiction by blogger Scott Alexander, formerly of Slate Star Codex, now of Astral Codex Ten.

Goodreads has a decent intro blurb: https://www.goodreads.com/fr/book/show/28589297-unsong

If you're unsure whether his writing style is your thing, feel free to sample his shorter fiction from his blog:

https://slatestarcodex.com/2015/06/02/and-i-show-you-how-dee...

https://slatestarcodex.com/2015/04/21/universal-love-said-th...

https://astralcodexten.substack.com/p/idol-words


Thanks. The goodreads blurb makes me think it's something like Salmon Rushdie or a Kevin Smith (with less toilet humor) take on things.


A short film adaptation released a year ago.

https://www.youtube.com/watch?v=UtvS9UXTsPI


A better question is, given a significant corpus of complex functionality, can it implement complex code in a language that it knows, but in which it has only seen lower complexity code?

Can it transfer knowledge across languages with shared underlying domains?


I think given that it's been trained on everything ever written, we should suppose the answer is no.

It has always been possible, in the last century, to build a NN-like system: it's a trivial optimization problem.

What was lacking was the 1 exabyte of human-produced training data necessary to bypass actual mechanisms of intelligence (procedural-knowledge generation via exploration of one's environment, etc.).


the implication here is that GPT is just brute force memorizing stuff and that it can't actually work from first principles to solve new problems that are just extensions/variations of concepts it should know from training data it has already seen

on the other hand, even if that's true GPT is still extremely useful because 90%+ of coding and other tasks are just grunge work that it can handle. GPT is fantastic for data processing, interacting with APIs, etc.


No, the implication is that most of us fake it until we make it. And The Peter Principle says we're all always faking something. My comment was just about humanity. ChatGPT isn't worth writing about.


We aren't state machines. We are capable of conscious reasoning which GPT or any computer is not.

We can understand our own limitations, know what to research and how, and how to follow a process to write new code to solve new problems we have never encountered.

Our training set trains our learning and problem solving abilities, not a random forest.


I've been adding C# code completion functionality to my REPL tool, and ended up reverting to the text-davinci model.

The codex (discontinued?) and text-davinci models gave much better results than GPT3.5-turbo, specifically for code completion scenarios. The latest models seem to produce invalid code, mostly having trouble at the boundaries where they start the completion.

My suspicion is that these latter models focus more on conversation semantics than code completion, and completing code "conversationally" vs completing code in a syntactically valid way has differences.

For example, if the last line of code to be completed is a comment, the model will happily continue to write code on the same line as the comment. Not an issue in a conversation model as there is a natural break in a conversation, but when integrating with tooling it's challenging.

Most likely the issue is that I'm not yet effective at prompt engineering, but I had no issues iterating on prompts for the earlier models. I'm loving the DaVinci model and it's working really well -- I just hope it's not discontinued too soon in favor of later models.


I can corroborate that text-davinci gives much better results than for tasks involving summarization or extraction of key sentences among a large corpus. I wonder what empirical metrics OpenAI uses to determine performance benchmarks for practical tasks like these. You can see the model in action for analysis of reviews here: https://show.nnext.ai/

[Disclaimer - I work at nnext.ai]


I was just talking about this the other day:

> it's more hacking than crareful and well specified engineering, and that could lead down a path of instability in the product where some features get better while others get worse, without understanding exactly why.

https://news.ycombinator.com/threads?id=pffft8888&next=35269...


Will take a bit of time before AI can consistently beat us on coding/proofs but the raw ingredients imo are there. As someone who was skeptical of AGI via just scaling things up even after GPT-3, what convinced me was the chain of thought prompting paper. That shows the LLM can pick up on abstract thought and reasoning patterns that humans use. Only a matter of time before it picks up on all of our reasoning patterns (or maybe it already has and is just waiting to be prompted properly...), is hooked up to a good memory system so it's not limited by the context window and then we can watch it go brrrr

It can still make stupid mistakes in reasoning but I don't think that's fundamentally unsolvable in the current paradigm


> That shows the LLM can pick up on abstract thought and reasoning patterns that humans use.

Does it? I’m still unconvinced it’s more than copying other examples of “show your work”.


It's definitely not just copying verbatim. If you mean it's emulating the reasoning pattern it sees in the training data well...don't humans do that as well to get answers to novel problems?


We don't know all the different ways humans arrive at answers to novel problems.

And while these LLMs aren't literally just copying verbatim, they are literally just token selection machines with sophisticated statistical weighting algorithms biased heavily towards their training sets. That isn't to say they are overfitted, but the sheer scale/breadth gives the appearance of generalization without the substance of it.


Here's an argument that GPT does actually build an internal representation of the game Othello, it's not just token selection: https://thegradient.pub/othello/


Keep in mind that the Othello example is model specifically trained on only Othello games. I haven’t seen any claims that general purpose models like GPT-4 have internal representations of complex abstract structures like this.


Why wouldn't they? Text-moves of Othello games are presumably a subset of the training data for a general LLM. If anything the general LLM has the chance to derive more robust internal world representations given similarly laid out board games.

This is very reminiscent of position-encoding neurons: https://en.wikipedia.org/wiki/Grid_cell

It is also not surprising that if you force a system to succinctly encode input-output relationships, eventually it discovers the underlying generating process or its equivalent as implied by Kolmogorov complexity theory. Language is just a convenient encoding for inputs and outputs, not fundamental. So yes it is regurgitating statistics, but statistics are non-random because of some non-trivial underlying process, always, and if you can regurgitate those statistics consistently you're guaranteed to have learned a representation of the process. There is no difference and biological systems aren't any different.


This morning I asked GPT-4 to play duck chess with me. Duck chess is a very simple variant of chess with a duck piece (that acts like an impassable brick) that each player moves to an empty square after their normal move. [I gave GPT-4 a more thorough and formal explanation of the rules of course.]

To a human, board state in chess and in duck chess is very simple. It’s chess, but with one square that’s blocked off. Similarly, even a beginner human chess player can understand the basics of duck chess strategy (block your opponent’s development in the beginning, block your opponent’s attacks with the duck to free up your other pieces, etc.).

GPT-4 fell apart, often failing to make legal moves, and never once making a strategically coherent duck placement. To me this suggests that it does not have an internal representation of the 64 squares of the board at all. Even if you set aside the strategic aspect, the only requirement for a duck move to be legal is that you place it on an empty square, which it cannot consistently do, even at the very beginning of the game (it like to place the duck on d7 as black after 1. …e5, even when its own pawn is there).


It is a matter of degree. GPT-4 may, for various reasons some of which are artificial handicaps, have only a weak grasp of a board representation now. But if it has any such representation at all, that's already a different story than if it did not. I think all evidence points this way, even from other networks, e.g. image classification networks that learn common vision filters. It's a pretty general phenomenon.


No, humans don't do that. If humans did that nothing new would ever be created.


They are remixing not reasoning


It has been proven it creates internal abstract representation models many times. Most trivial one is playing chess or go via text.


The statistical distribution of historical chess games is a approximate statistical model of an actual model of chess.

It's "internal abstract representation" isnt a representation; it's an implicit statistical distribution across historical cases.

Consider the difference between an actual model of a circle (eg., radius + geometry) and a statistical model over 1 billion circles.

In the former case a person with the actual model can say, for any circle, what it's area is. In the latter case, the further you get outside the billion samples, the worse the area will report. And even within them, it'll often be a little off.

Statistical models are just associations in cases. They're good approximations of representational models for some engineering purposes; they're often also bad and unsafe.


It's not some kind of first order statistical gibberish.

It exhibits internal abstract modeling.

It's a bit silly to argue against it at this time.

To produce answers with quality we see it'd have to use orders of magnitude more memory than it actually does.

It's also easy to test yourself.

Simple way is to create some role playing scenario with multiple characters when same thing is seen differently by different actors at different time and probe it with questions (ie. somebody puts X into bag labelled Y, other person doesn't see it and asking what different actors think is in the bag at specific time in the scenario etc).

Or ask for some crazy analogy.

Why am I even saying it, just ask it to give you list of examples how to probe LLM to discover if it creates abstract internal models or not - it'll surely give you a good list.


Most things in life aren’t mathematical objects and therefore don’t have perfect theoretical models anyway. For example, what is a “chair”?


It seems like chain-of-thought will work pretty well when backtracking isn't needed. It can look up or guess the first step, and that gives it enough info to look up or guess the second, and so on.

(This can be helpful for people too.)

If it goes off track it might have trouble recovering, though.

(And that's sometimes true of people too.)

I wonder if LoRA fine-tuning could be used to help it detect when it gets stuck, backtrack, and try another approach? It worked pretty well for training it to follow instructions.

For now, it seems like it's up to the person chatting to see that it's going the wrong way.


The perfect reasoner is upon us ?


I would prefer to say that we've seen a glimpse of what a future world with a perfect reasoner will be like


And I imagine even the glimpse would cause a lot of venture capital to be flowing into AI... and also government/military funds.


Man America is really such a bore: VC, Military, solving problems, weapons, war, getting it done quicker, “freedom”, is there anything else to all this, to life ?

I used to be one of those “why do people always rag on American culture?” types, but I’m getting it.

It makes me laugh how we want to automate everything without the slightest idea of what we’ll be doing once it’s all automated ? Is that the point where the USA figures out we already had a lot of good things to do ? Ha, we’ll invent the matrix and plug ourselves in.

Sorry it’s not personal but it just seems like a never ending grind instilled with the same themes over and over. Now we have “hustle culture with no job prospects due to automation”, cool plan.


Well this approach brings us also modern medicine, so we are not dying in the pool of mud from every trivially treatable malaise. We break atoms, we can almost reach the stars which is probably the only long term way to preserve mankind.

You can't have one without the other. Look at any society in history, either push forward or eventual downfall and ending up as small history lesson. Some say its human nature, some say its nature's nature.


I’m sorry but modern medicine and technologic / scientific advancements happen without the modern American psyche.

If “America” wasn’t a thing , we’d all be completely fine.

In my opinion America is pushing us to “need” to be a multi planetary species. Not the other way around. We’re taking greater risks for the growing need for monetary reward it’s unavoidable.

As others have put it, the Earth is the greatest starship we’ll ever know. We know that the trajectory we’re on will require a backup.


Sorry I was being sarcastic but sounds like you’re actually into it?


This guy seems to laugh away the fact that he gave the prompt in terribly broken and chunked up formats. I don’t think it’s surprising the model did poorly. Maybe the contamination issue is true. But maybe it’s also true that the model does fine on novel codeforce problems when you don’t feed it a garbage prompt?


Unless the formatting is somehow very different for pre 2021 problems it is still a strong signal that it is (up to a point) just parroting a solution it had heard somewhere rather than inferring it some way else.

This is neither good nor bad, that will depend on what you want to use the model for.


That’s not my point. Maybe it is parroting a solution for something it has seen before, but still capable of writing solutions to problems it has not seen before when they’re well specified.

The presence of contamination does not mean the general capability is intrinsically poor.


The prompting might be an issue, but I think the larger sign that it's not quite there yet in terms of symbolic reasoning is that even based on their own paper, gpt-4 gets only a 30 on AMC 10 (out of 150), whereas leaving the test blank would get you a 37.5. And this is on a closed-ended multiple choice-test, so the conditions should be favorable for it to dominate.

Edit: Although this might be unfair considering that LLMs are known to be poor at calculation. (Maybe it would do better at proof-style, like USAMO?). I wonder how well chatGPT with WA integration would do.


>> I don’t think it’s surprising the model did poorly.

But it did poorly only on the problems it hadn't seen before. Was it prompted differently on one kind of problem, compared to the other?


But you can do a task you’ve done before with poor specification too. Sure, maybe it is contamination. But who cares? We only ought to judge the tool on its performance for carrying out good instructions.


On code gen with some reflection baked in (ie feedback and memory), it shoout up to 88% (from 67%).

There's some contamination to be sure but also still quite a lot to be done to better output


The need for prompt engineering is an indicator of GPT failing the problem, not succeeding at it.


Like how the need for roads is an indication that cars don't solve transportation problems.

People adapt to maximize the utility of their tools; always have, always will.


I disagree. We cannot expect these tools to work if we don’t communicate what we want from them. Can’t do that for humans either.

We shouldn’t be confusing clear communication with “prompt engineering”.


The need for “tools” is a bit of a red flag too.


Why? I'm significantly less productive and competent without tools, and this applies to every human I've ever known.


Maybe the OP is also contaminated a bias of wanting to find errors in GPT-4?

Finding flaws in GPT-4 and ignoring the fact that we are at the dawn of AI, that seems like a good remedy to calm the AI existential crisis anxiety nowadays.


One another example I encountered today.

<Prompt> show me an example of an enum LifecycleStatus deriving from an abstract class Status with fields status_type String and reason String (with default getters). Use a lombok builder on the enum class and show an example of building a LifecycleStatus enum setting only the status_type. </Prompt>

Answer: ChatGPT tells me it absolutely is possible, and generates the code as well. Except, java Enums cannot derive an abstract class (since they implicitly derive Enum<T>). You can only have an enum implement an interface. Here, I realised that I had not mentioned the word "java" in this prompt (though I had started off the chat with the word "java" so I assume it was in the context.

In any case, my next prompt was:

<Prompt> Is the code example you shared for java? </Prompt>

Answer: Yes, the code example I shared is for Java. Specifically, it is for Java 8 or newer, as it uses the @Builder annotation from Lombok, which was introduced in version 1.16.8 of Lombok.

I continued the conversation and had it try out other methods of achieving the same thing. (abstract class with static inner builder class, enums implementing an interface). But it basically follows your prompt and provides whatever you ask it to provide. When I asked it to have an enum derive an abstract class, it did that for me without any complaints. When I later said, this was not possible, it changed tracks.

I echo other users in saying that, I would not want to use this for learning anything new. There is no point if I have to break my flow and verify after every query. But it is a good rubber duck. The chat interface can be used as a deprocrastination tool - when you are not able to come up with anything fruitful, just ask a few questions. This gets the ball rolling at least.


I wonder if he tried the trick that Microsoft recommended in their GPT4 evaluation paper, that is ask gpt to go step by step with explanations. It tends to produce much better results, simply because it is more fitting to the prediction mechanism that GPT uses . It tends to predict better when steps are smaller.



I'm confident that if chatgpt trained on all code submitted to topcoder and code forces, then iteratively refined responses by compiling and evaluating the output, it could get to red topcoder quite easily. Just that programming challenges probably weren't that high of a priority and data is in silos.


I don't understand why anyone is seriously expecting "GPT" to write code? Writing code is really analogous to proving theorems (the Curry-Howard correspondence). IIUC that is not what the GPT "language model" was ever designed for.


I don't think anyone expected LLMs to do this well on code problems. Just a few years ago they were struggling to get the syntax right. But since it does seem to work in many cases, it's only reasonable to explore its limits. But in the end, automated math & programming will most likely not be done using LLMs, but with purpose-built systems. The major players are of course already working on this, including OpenAI [1] and DeepMind [2].

[1] https://openai.com/research/formal-math (feb '22)

[2] https://www.deepmind.com/blog/competitive-programming-with-a... (feb '22)


I ask this earnestly, but have you tried it? It most certainly can write a pretty huge variety of code. Yes, it does often have errors, and it struggles with novel or complex problems (as the article points out), but for well known languages and frameworks it is still insanely impressive.


I tried it yes, it did not go well. First it gave me a snippet that looked correct but was not what I was asking for (I guess similar to a search engine). When I tried to point out why the proposed solution did not solve the problem, it could not modify it correctly. It kept oscillating between two incorrect solutions.


On top of that, I think we might have reached peak training data. Or rather, peak human-generated training data. From here on AI will produce much more data than humans. Consequently, it should become ever harder to train a model on something novel.

I even wonder if humans in 50 years will recognize AI by its particular (early 21st century) style to write and speak.


> I even wonder if humans in 50 years will recognize AI by its particular (early 21st century) style to write and speak.

I wonder if writing and speaking style will be frozen in time if people are bombarded with AI output all day.


A question to the adept in the filed. Isn't that expected? These models are trained by making them guess the next word given a sequence of words. This means that are not modeling causal relationships between the given context (the prompt) and the answer they output. And I believe for solving new problems making these kinds of causal connection is necessary. So, the fact they can actually do better than nothing does it mean that they have a very basic notion of causality or just that human predicting the next word from human content is a poor proxy for such kind of causal relationships?


This has not been my experience. If you describe what you want, it gets very close to writing correct code.

I’ve walked through a few very complex coding problems and although it took a few attempts, it eventually got there.


Because training data is just weighted data for accuracy.

It is like holding two balls one vibrant blue the other navy. You ask it to pick the blue ball and it goes constant to the vibrant one. This is because it is accurately blue.

Sure it is some aspect of AI. But hopefully we don't go down the path of assuming it is sentient or capable of decision making on the basis of emotional intelligence.


I'm surprised that so many people here seem to find this surprising.

There's been a lot of "you aren't prompting it correctly" discourse. Some of it valid, some of it not.

But there's also been many examples of people asking chatGPT to make minor modifications to a well-known problem and it completely flounders.


The insidious thing is "you aren't prompting it correctly" is kind of a truism. For every possible output there almost certainly is a prompt that produces it (at worst you can just tell it exactly what to output verbatim). The true believers can already do all their programming via ChatGPT regardless of whether or not there are real productivity gains. Not that different from all the other tools people claim turn them into 10x programmers while others remain unconvinced. So as long people enjoy the format, it's here to stay.


I never use ChatGPT for anything I don't already know how to do.

The biggest thing I've had it writing for me recently though - unit tests. It writes them much quicker and even though I have to tweak them, the shell is done and allows me to focus on more important things after I verify and modify them.


Yeah, I'm an AI skeptic.

I'm suspecting some people try to make money but they cannot uphold their promise.


I’m always asking ChatGPT about questions related to Smalltalk and Scheme and it does a terrible job with them. To be expected, but makes me wonder if it will ever be good enough for more niche programming languages and aspects of programming


It's only a matter of time before chatGPT is trained on so many examples that it can provide a solution to 99.99% of coding prompts. I am curious to see how much that 0.01% will matter.


There will always be new programs that matter; they matter because they are new.


I love how AI apologists trip over themselves to explain away deficiencies in GPT and explain how it will “get better with time”.

Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available. But new content will come in slowly, and it will probably take decades just to double the amount of training data we have today.

Worse, since we expect AI generated content to start becoming widespread and difficult to distinguish from human content, GPT will eventually start cannibalizing its own outputs as training data, leading to shittier and shittier models over time that get overfitted and no longer seem human enough for any practical use.

It’s a fading dream.


Not sure why so many down votes for you. HN is clearly very pro-AI.

I think you're correct about the training data problem. This is starting to remind me of the Final Question story by Asimov. "Insufficient data for meaningful answer". While in that story the AIs kept progressing, I think in reality we will forever be stuck with diminishing quality of training data.

Even just consider post-Copilot Github. Presumably there is now code publicly available that was generated by an AI. Next time somebody slurps up Github to train a new model, some of that code will be included. Overfitting ensues.


How is that over fitting? If code accomplishes a task with given requirements, it satisfies the problem.

So many devs just copy paste code as is.


>Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available. But new content will come in slowly, and it will probably take decades just to double the amount of training data we have today.

One word: AlphaZero. Deepmind ran out of human go games to study, but it turned out that self-play was dramatically better. Your argument only holds if a) there's a linear relationship between the amount of training data and the quality of a model and b) GPT is close to maximally efficient in converting training data into useful weights. Both of these premises are demonstrably false.

GPT-4 is, in the scheme of what's possible, an incredibly primitive model that uses training data very inefficiently. In spite of that, a dumb brute force architecture still managed to vastly exceed everyone's expectations and advance the SOTA by a huge leap.


In go, or similarly chess, the AI can play stupendous number of games against itself and get accurate feedback for every single game. Everything is there to create your own training set just from knowing the rules. But outside of such games, how does an AI create it's own training data when there is no function to tell you how well you are doing? This might be a dumb question, I don't have any idea on how LLMs work


One such function is “what happens next?” which may work as well in the real world as on textual training data. Certainly it’s part of how human babies learn, via schemas.


Creating something is much harder than verifying it.

A simple setup for improving coding skills is the following:

1. GPT is given a coding task to implement as a high level prompt.

2. It generates unit tests to verify that the implementation is correct.

3. It generates code to implement the algorithm.

4. It runs the generated code against the generated unit tests. If there are errors generated by the interpreter/compiler, go back to Step 3, modify the code appropriately and try again.

5. If there are no errors found, take the generated code as a positive example and update the model weights with reinforcement learning.


What if it’s wrong at step 2?


The most naive way you could do things could be to procedurally generate immense amounts of python code, then ask the model to predict whether the code will compile, whether it will crash, what its outputs will be given certain inputs, etc.


Code execution is also a good way to collect feedback signals.


Well, there sort of is a linear relationship between the ammount of training data and the quality of the model [1]

[1]: https://arxiv.org/abs/2203.15556


That's why I'm advocating against being (almost) unconditionally "for" or "against" a certain technology.

It seems like some people struggle with the notion of something being good in some areas and bad in others.

Why not Evaluate it on the merits of what is possible now and extrapolate in the near future. Any prediction beyond that is most likely futile anyway.

Unless you are planning to run a business or work in academy there is most likely no need to overreact even if it ground breaking next month. Everything moves more slowly than we expect anyway.

In the end useful technologies will stay while the rest will disappear into the void sooner or later.


The internet has turned people into a mindless mob, also including myself.

Most people tend to take the same side their favourite celebrities do, and then argue for that to the extremes.


This is pretty pessimistic. I don't know what kind of expectations you have about LLMs. Less than 10 years have passed since the original Transformers paper and we're seeing tangible and useful software out of it. I've seen far worse vaporware.

PS: Anyways, regarding your argument, yes, transformer-based models right now are "shit" (by whatever measure you are using, though I still don't know what you are comparing them with. I suppose human level intelligence), but more training data is not the only way to make better models.


It's a good point for text input, but if you go multi-modal and somehow find a way to make good use of audio and video, there's practically unlimited data available.

Also considering that humans will probably still only publish the output that looks good, even that still provides a weak signal on quality.


I'm skeptical that multimodal input can help with programming or logic problems - or even most scientific problems.


Having diagrams (think free body diagrams in static mechanics, or a T-s diagram in thermodynamics) make a lot of non-trivial problems a lot simpler to communicate. And correctly understanding an unambiguous definition of a problem is a major step towards solving it.

If language was enough (or a similar idea, that multimodal input is not useful), college math professors wouldn't use so much chalk making drawings and diagrams to explain their ideas.


> Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available.

This assumes that GPT4 is trained on all currently-available training data.

Is that true?


Will you eat a hat if GPT5 and 6 are huge improvements?


This is such an extraordinary claim. That this technology arms race the like of which we’ve possibly never seen before is somehow going to end at GPT-4 because ‘we’ve run out of training data’. Do you think the entire research staff at OpenAI is just there to figure out how to scale up transformers?


Agreed... I'm just disturbed by how badly people want to believe


"Model designed to guess next tokens from a sequence, based on its training data, fails to guess next tokens from sequences not similar to its training data."

What's the news?


Now I'm curious what kind of code GPT-4 produced here. I thought an LLM would be at least able to detect the suitable algorithm to use based on text analysis?


So it only does great at everything we've ever done? We only need to demonstrate once a new approach, and it eats that too.


Are we ready to hand over the lollipop yet?


I find it interesting that we have a single word for creativity in the context of novel work.

These systems are incredible at generating novel art or poetry. But when it comes to problem solving, they struggle adapting to a novel problem.

With most humans, the ability to adapt to novelty seems to correlate with the ability to generate it.

I have two theories on this.

1. Perhaps these are two very different things. Novel art is closer to having a high performing random number function. Adapting to novel situations is closer to mathematical or strategic thinking.

2. Perhaps we are deeply fooling ourselves by building the greatest Chinese room machine imaginable - and in reality this machine does not build general understanding the same way we do. Douglas Hofstadter would be very disappointed with us if this is the case.

I still don't know which of these two seems more correct. Of course it might be both of them. I would like to see some more dedicated testing on the second point, as it does seem to be experimentally verifiable. For example, if this system truly has the ability to reason, it should be able to reason equally well in contexts where significant training data exists as well as contacts without any training data. We should be able to present the system with isomorphic problems across two different contexts and compare its performance to see if it has flat reasoning abilities.


In GPT-4 defense I do significantly worse(at least at first) on problems not in my training set.


I am skeptical this guy did a correct evaluation. I would like to see someone replicating this.


The fact that it performs at all is amazing. It's like the old dog playing a piano joke.


Usain Bolt performs significantly worse in a marathon than in a sprint. Yes.


Does this even matter? Sure GPT4 can’t solve every conceivable problem out there, but if it can solve the overwhelming majority of problems that humans have already seen then that is a huge improvement over what we had before.


> but if it can solve the overwhelming majority of problems that humans have already seen then that is a huge improvement over what we had before

“the overwhelming majority of problems humans have” are not going to solved by generating code. At this point these LLMs are feeling like blockchain a few years back in the sense people were trying to tell me it was going to solve every problem that’s existed since the dawn of human civilization.


I wouldn’t say it doesn’t matter. You can say it’s a weakness that can be improved.


Thanks for little bit of hopium. But we are still doomed.


I guess the meaning of our existence was always handing the universe over to the AI overlords lol


We are the Borg?


Maybe we really are.

Many sifi stories feature heroes encountering an advanced AI doing "sth". We usually see ourselves as part of the hero group in these stories but maybe ultimately we will end up the side note about the civilization who built the AI.


If youre doomed by a fancy markov chain chatbot, you were never gonna make it in the first place.


Please stop using this 'fancy markov chain' cope, it just makes you sound like you have absolutely no awareness of anything.


It may be slightly simplistic, but I don't think calling something that selects tokens to chain together based on a statistical model a "fancy markov chain" is too far off the mark.


> A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

vs

> An attention mechanism allows the modelling of dependencies without regard for the distance in either input or output sequences.

See the difference? In Markov Chains it is enough to know the previous state, while in transformers you need all previous states. It would be a great thing if we could reduce dependency of all previous states, like an RNN, maybe RWKV will do it.


The state for old school text markov chains are N words or similar, so you use the past N words to generate a new word, append it to the last N-1 words and now you got your next state. That is exactly what these language models does, you feed them a limited number of words as a state, and the next state is that word appended to the last and cut words in excess of the models limit.

The attention layer just looks at that bounded state. GPT-3 for example looks at a few thousand tokens, those are its state, it is bounded so it doesn't look at all previous tokens.


If you continue reading that Wikipedia article, you'll reach this point:

> A second-order Markov chain can be introduced by considering the current state and also the previous state, as indicated in the second table.

i.e., a higher-order Markov chain can depend on several of the previous states.

So, if a certain transformer model accepts up to 20k tokens as input, it can certainly be seen as a 20000'th order Markov chain process (whether it is useful to do so or not can be debated, but not the fact that it can be seen as such, since it complies with the definition of a Markov chain).


> makes you sound like you have absolutely no awareness of anything

almost like a fancy markov chain? :)


Sorry, its not an AI, just an autocomplete with self-attention mechanism. ¯\_(ツ)_/¯


Evolution might play some role.


Cope brother


By climate change, nuclear war, habitat destruction, plastic waste, insect apocalypse, gray goo, gay frogs, terminators, late-stage capitalism, or chat bots? Helps to be specific.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: