GPT-4 performs significantly worse on coding problems not in its training data

gateorade · on March 25, 2023

This has been my experience. I’m really impressed by how well GPT-4 seems to be able to interpolate between problems heavily represented in the training data to create what feels like novelty, eg. Creating a combination of pong and conway’s game of life, but it doesn’t seem to be good at extrapolation.

The type of work I do is highly niche. I’ve recently been working on a specific problem for which there are probably only a hundred at most implementations running on production systems, all of them highly proprietary. I would be surprised if there were any implementations in GPTs training set. With that said, this problem is not actually that complicated. A rudimentary implementation can be done in ~100 lines of code.

I asked GPT-4 to write me an implementation. It knew a decent amount about the problem (probably from Wikipedia). If it was actually capable of something close to reasoning it should have been able to write an implementation, but when it actually started writing code it was reluctant to write more than a skeleton. When I pushed it to implement specific details it completely fell apart and started hallucinating. When I gave it specific information about what it was doing wrong it acknowledged that it made a mistake and simply gave me a new equally wrong hallucination.

The experience calmed my existential fears about my job being taken by AI.

softfalcon · on March 25, 2023

This exact scenario is what I described to a friend of mine who is an AI researcher.

He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

I said this because humans don’t truly have access to infinite knowledge, even when they do, they can’t process all of it. Adding endless information for the AI to feed on doesn’t seem like the solution to figuring out true intelligence. It’s just more of the same hallucinating.

Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily. With limited information, our minds create new representations of understanding. This seems to be impossible for Chat GPT.

I could be completely wrong, but that discussion solidified for me that my role as a dev still has at least a couple more decades of shelf life left.

It’s nice to hear that others are reaching similar conclusions.

visarga · on March 25, 2023

Current LLMs decode in a greedy manner, token by token. In some cases this is good enough - namely for continuous tasks, but in other cases the end result means the model has to backtrack and try another approach, or edit the response. This doesn't work well with the way we are using LLMs now, but could be fixed. Then you'd get a model that can do discontinuous tasks as well.

>> Write a response that includes the number of words in your response.

> This response contains exactly sixteen words, including the number of words in the sentence itself.

It contains 15 words.

The model would have to plan everything before outputting the first token if it were to solve the task correctly. Works if you follow up with "Explicitly count the words", let it reply, then "Rewrite the answer".

Dzugaru · on March 25, 2023

> but could be fixed

How? The problem is known for a while, for example this article [0] mentions it (as Chain of Thought reasoning). You could think that just having a scratchpad of tokens is enough - you can arguably plan, backtrack and rewrite there [1], right? But this doesn't really work, at least yet - maybe because it wasn't trained for that - and maybe ChatGPT massive logs (probably available only for OpenAI) can help. But the Microsoft report [2] suggests we need a different architerture and/or algorithms? They mention lack of planning and retrospective thinking as a huge problem for GPT-4. Maybe you know some articles on the ideas how to fix this? Backtracking, trying again seems to be linked to human thought - and very well can give us AGI.

[0] https://arxiv.org/abs/2201.11903

[1] https://www.reddit.com/r/ChatGPT/comments/120fi8e/chatgpt_4_...

[2] https://arxiv.org/abs/2303.12712

bertday · on March 25, 2023

You may be shocked to hear this but dijkstra’s short path algorithm is the technical answer to this question. We just don’t use it because it’s expensive.

visarga · on March 25, 2023

Language chains or tool use where it can also call on itself to solve subproblems. If you don't have to do just one round of LLM interaction you can do complex stuff.

yorwba · on March 25, 2023

Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed. So I expect autoregressive generation to remain the default until the recomputation effort can be significantly reduced or hardware advances make the cost bearable.

YeGoblynQueenne · on March 25, 2023

>> Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed.

I can't imagine how training on masked tokens can "easily" solve backtracking, even in theory. Do you have some literature I could read on this?

kastnerkyle · on March 25, 2023

Discrete diffusion with rewriting can work well. It feels loosely similar to backtracking, if you assume n_steps large enough - need to be able to rewrite any non-provided position though I think (not all setups do this). Downside is the noise in discrete diffusion (in simplest case randomizing over all vocabulary space) is pretty harsh and makes things very difficult practically. Don't have an exact reference on the relationship, but feels similar to backtracking type mechanics in my experience. I found things tend to "lock in" quickly once a good path is found, which feels a lot like pathfinding to me.

Some early personal experiments with adding "prefix-style" context by a cross-attention (in the vein of PerceiverAR) seemed like it really helped things along, which would kind of point to search-like behavior as well.

Probably the closest theory I can think of is orderless NADE, which builds on the "all orders" training of https://arxiv.org/abs/1310.1757 , which in my opinion closely relates to BERT and all kinds of other masked language work. There's a lot of other NAR language work I'm skipping here that may be more relevant...

On discrete diffusion:

Continuous diffusion for categorical data shows some promise "walking the boundary" between discrete and continuous diffusion https://arxiv.org/abs/2211.15089 , personally like this direction a lot.

If you have a pre-made embedding space, SSD-LM is a straightforward method https://arxiv.org/abs/2210.17432

SUNDAE worked well for translation https://arxiv.org/abs/2112.06749 and many other tasks.

My own contribution, SUNMASK, worked reasonably well for symbolic music/small datasets (https://openreview.net/forum?id=GIZlheqznkT), but really struggled with anything text or moderately large vocabulary, maybe due to training/compute/arch issues. Personally think large vocabulary discrete diffusion (thinking of the huge vocabs in modern universal LM work) will continue to be a challenge.

Decoding strategies:

As a general aside, I still don't understand how many of the large generative tools aren't exposing more decoding strategies, or hooks to implement them. Beam search with stochastic/diverse group objectives, per-step temperature/top-k/top-p, hooks for things like COLD decoding https://arxiv.org/abs/2202.11705, minimum Bayes risk https://medium.com/mlearning-ai/mbr-decoding-get-better-resu..., check/correct systems during decode based on simple domain rules and previous outputs, etc.

These kinds of decoding tools have always been a huge boost to model performance for me, and having access to add in these hooks to "big API models" would be really nice... though I guess you would need to limit/lock compute use since a full backtracking search would pretty swiftly crash most systems. Maybe the new "plugins" access from OpenAI will allow some of this.

bertday · on March 25, 2023

Backtracking is easily solved with a shortest path algorithm. I don’t see any need for masking if you are simply maximizing likelihood of the entire sequence.

SanderNL · on March 30, 2023

I don't think humans can do this either. What's the problem with producing a result and then fixing it? It's exactly how we do it.

andsoitis · on March 25, 2023

> This exact scenario is what I described to a friend of mine who is an AI researcher. He was convinced that if we trained the AI on enough data, GPT-x would become sentient. My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

It turns out it isn’t just AIs that hallucinate; AI researchers do as well.

physPop · on March 25, 2023

"researcher".

Majromax · on March 25, 2023

> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Is there enough data?

As I understand it, the latest large language models are trained on almost every piece of available text. GPT-4 is multimodal in part because there isn't an easy way to increase its dataset with more text. In the meantime, text is already quite information dense.

I'm not sure that future models will be able to train on an order of magnitude more information, even if the size of their training sets has a few more zeroes added to the end.

call-me-al · on March 26, 2023

what about all the content not yet in text form (e.g. YouTube videos)?

psychphysic · on March 25, 2023

The threshold for sentience is continually falling.

So he might be right but due to time and not due to improved performance.

I believe in the UK all vertibrates are considered sentient (by law not science). That includes goldfish.

And good luck even getting a goldfish to reverse a linked list. Even after 1000 implementations are provided.

dbsmith83 · on March 25, 2023

Goldfish are likely more intelligent than you give them credit for

https://petkeen.com/can-goldfish-be-trained/

https://www.npr.org/2022/01/11/1072095219/goldfish-driving-c...

delusional · on March 25, 2023

I don't think that when people commonly discuss sentience they mean to include goldfish. I don't think the legal definition (which probably exists due to external legal implications) has any bearing on the intellectual debate of AI sentience.

InfamousRece · on March 25, 2023

Sentience is just the capacity to experience feelings and sensations. Goldfish can do it and AI can’t (so far).

narwally · on March 25, 2023

If I were talking about sentience I would definitely be including goldfish. What about them is so different to us that we would have sentience while they would not?

wslh · on March 25, 2023

> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Not saying your friend is right or wrong, but imagine if civilization gives more information, in realtime, to an AI system through sensors: will be at least sentient as the civilization? Seems like a scifi story, a competitor to G-d.

antonvs · on March 25, 2023

Isaac Asimov wrote a story along those lines, “The Last Question”, which he described as “by far my favorite story of all those I have written.” Full text here:

https://xpressenglish.com/our-stories/the-last-question/

danaris · on March 25, 2023

Some versions of divinity (both from real-world beliefs and sci-fi/fantasy) have it being essentially a gestalt of either all the souls that have ever died, or all those alive now—a kind of "oversoul" or collective consciousness.

While that's an interesting thought experiment, I don't think it can meaningfully apply to any kind of AI we have the capability to make today, even if we could hook it up directly to all our knowledge. Information alone can't make something sentient; it requires a sufficiently complex and sophisticated information processing system, one that can reason about its knowledge and itself.

kaba0 · on March 25, 2023

I’m not at all an expert on the topic, but from what I gathered LLMs are fundamentally limited in the kind of problems they can approximate. They can approximate any integrable function quite well, but we can only come up with limits on a case-by-case basis for non-integrable ones, and I believe most interesting problems are of this latter kind.

Correct me if I’m wrong, but doesn’t it mean that they can’t recursively “think”, on a fundamental basis? And sure I know that you can pass “show your thinking” to GPT, but that’s not general recursion, just “hard-coded to N iterations” basically, isn’t it? And thus no matter how much hardware we throw at it, it won’t be able to surpass this fundamental limit (and without proof, I firmly believe that for a GAI we do need the ability to basically follow through a train of thought)

sebzim4500 · on March 25, 2023

How is it "hard-coded to N iterations"? We don't instruct the model how many lines of working it should show.

Obviously there is a limit to how much it can fit in the context, but that seems to be rising fast (went from 4k to 32k in not that long)

kaba0 · on March 25, 2023

It fundamentally can’t recurse into a thought process. Let’s say I give you a symbol table where each symbol means something and ask you to “evaluate” this list of symbols. You can do that just fine, but even in theory not even GPT-10384 will be able to do that without changing the whole underlying model itself.

sebzim4500 · on March 25, 2023

I don't understand the task. What does evaluating the list of symbols mean?

Do you mean you define a programming language/bytecode and then feed it into the model?

He's an example where GPT-4 did this perfectly for a very sinple language. This was my first attempt, I did not have to do any trial an error.

https://pastebin.com/4YA5wpie

kaba0 · on March 25, 2023

Could you try writing even in this simple language a longer program? Just simply increase the input to 20x or something around that. I’m interested in whether it will break and if it does, at what length.

sebzim4500 · on March 25, 2023

Interesting, it screwed up at step 160. I think it probably ran out of context, if I explicitly told it to output each step in a more compact way it might do better. Or if I had access to the 32k context length it would probably get 4x further.

Actually it might be worth trying to get it to output the original instructions again every 100 steps, so that the instructions are always available in the context. The ChatGPT UI still wouldn't let you output that much at once but the API would.

aiphex · on March 25, 2023

If they aren't already, AIs will be posting content on social media apps. These apps measure the amount of attention you pay to each thing presented to you. If it's more than a picture or a video, but something interactive, then it could also learn how we interact with things in more complex ways. It also gets feedback from us through the comments section. Like biological mutations, AIs will learn which of its (at first) random novel creations we find utility in. It will then better learn what drives us and will learn to create and extrapolate at a much faster pace than us.

danaris · on March 25, 2023

> If they aren't already, AIs will be posting content on social media apps.

No, people will be posting content on social media apps that they asked LLMs to write.

It may be done through a script, or API calls, but it's 100% at the instigation, direct or indirect, of a human.

LLMs have no ability to decide independently to post to social media, even if you do write code to give them the technical capability to make such posts.

galleywest200 · on March 25, 2023

With the new ChatGPT Plugins, it seems they may actually be able to make POST requests to social media APIs soon. It is likely that an LLM could have "I should post a tweet about this" in its training data.

Granted... currently it is likely humans that have written the code that the new Plugins are allowed to call -- but they have given ChatGPT the ability to execute rudimentary Python scrips and even ffmpeg so I think it is only a matter of time before one outputs a Tweet written by its own code.

danaris · on March 25, 2023

> It is likely that an LLM could have "I should post a tweet about this" in its training data.

That only matters if a human has explicitly hooked it up so that when ChatGPT encounters that set of tokens, it executes the "post to Twitter" scripts.

ChatGPT doesn't comprehend the text it's producing, so without humans making specific links between particular bundles of text and the relevant plugin scripts, it will never "decide" to use them.

aiphex · on March 26, 2023

At a high level, all that would have to happen is a person gives GPT, or something like it, access to a social media page and tells it to post to it with the objective of getting the highest level of interaction and followers.

danaris · on March 26, 2023

...which in no way grants GPT sapience, nor would it prove that it has it.

The human is still providing the capability to post, the timing script to trigger posting, and the specific heuristic to be used in determining how to choose what to post.

dmichulke · on March 25, 2023

> Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily.

I think there is some sampling bias in your observation ;-)

oliveiracwb · on March 25, 2023

More data will only mean more inference. But at some unexpected moment, the newly created "senseBERT" breaks the barrier between intelligence and consciousness.

antonvs · on March 25, 2023

> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

It sounds like he doesn't even understand the basics of what GPT is, or what sentience is. GPT is an impressive manipulator/predictor of language, but we have evidence from all sorts of directions that there's more to sentience or consciousness than that.

braindead_in · on March 25, 2023

I would like to propose a thought experiment concerning the realm of knowledge acquisition. Given that the scope of human imagination is inherently limited, it is inevitable that certain information will remain beyond our grasp; these are the so-called "known unknowns." In the event that an individual generates a piece of knowledge from this inaccessible domain, how might it manifest in our perception? It is likely that such knowledge would appear incomprehensible to us. Consequently, it is worth considering the possibility that the GPT model is not, in fact, experiencing hallucinations; rather, our human understanding is simply insufficient to fully grasp its output.

meindnoch · on March 25, 2023

Yeah. Maybe when a baby says "gabadigoibygee", he is using an extremely efficient language that is too sophisticated for our adult brains to comprehend.

Yeah, maybe.

andsoitis · on March 25, 2023

> In the event that an individual generates a piece of knowledge from this inaccessible domain, how might it manifest in our perception? It is likely that such knowledge would appear incomprehensible to us.

If what a person says cannot be comprehended by any other person, we usually have a special term for it.

pharmakom · on March 25, 2023

But the hallucinated code doesn’t work.

ChatGTP · on March 25, 2023

This is ridiculously “meta”, but I’ve said the same thing, at some point GPT-x will be useless as it will be beyond our comprehension, that’s if it’s actually “smart”.

My honest opinion is the hallucinations are just gibberish, but are they useful gibberish? Maybe we’re saying the same thing ?

andsoitis · on March 25, 2023

> GPT-x will be useless as it will be beyond our comprehension, that’s if it’s actually “smart”.

Things don’t have to be comprehensible before they’re useful. But they have to work to be useful.

goatlover · on March 25, 2023

Not hard to check whether code compiles or runs.

vasco · on March 25, 2023

> The experience calmed my existential fears about my job being taken by AI.

The issue is that among all the 100k+ software engineers, many don't really do anything novel. How many startups are employing dozens of engineers to create online accessible CRUDs to replace a spreadsheet?

In the company I work for I'd say we have about 15 developers or about 3 teams doing interesting work, and everyone else builds integrations, CRUDs, moves a button there and back in "an experiment", ads a new upsell, etc. All these last parts could be done by a PM or good UX person alone, given good enough tools.

The other parts I'm not worried about either.

yohannesk · on March 25, 2023

For the type of engineers you describe the hard part I think is communication with other devs, communication with product owners, understanding the problem, suggesting different ways of solving the problem, figuring out which department personnel (outside other devs) to talk to about a little detail that you don't have... it's not writing the code which is hard, atleast from my experience

noduerme · on March 25, 2023

Yes. I won't be worried until the day Joe CEO can write a prompt like "build me an app that lets me know where my employees are at all times," and GPT responds with a list of questions about how Joe imagines this being physically implemented, and then calls up the legal department to clear its methods.

zeroonetwothree · on March 25, 2023

I think this is closer than you expect

oblio · on March 25, 2023

The question is... writing the code is a very small part of the job.

Figuring out what code to write is one of the big parts.

Fixing it when it breaks in many creative ways is the other big part.

How good is ChatGPT at fixing bugs? Security bugs or otherwise?

vasco · on March 25, 2023

Sure but the other parts you don't need an engineering degree for, the other parts amount to design / product work, not engineering.

oblio · on March 25, 2023

1. You don't need an engineering degree for software development in many, many cases. So I don't understand your argument.

2. Engineers design stuff :-) I'm not sure what you mean with "product work". Also, engineers debug and fix stuff :-)

zeroonetwothree · on March 25, 2023

Product work = figuring out what to build

oblio · on March 25, 2023

Product work is a fractal and you don't want "product people" designing things past the 2nd or 3rd fractal step, in my experience.

sterlind · on March 25, 2023

I had a similar experience. I wanted it to write code to draw arcs on a world map, with different bends rather than going on a straight bearing. I did all the tricks, told it to explain its chain of thought, gave it a list of APIs to use (with d3-geo), simplified and simplified and spent a couple hours trying to reframe it.

It just spit out garbage. Because (afaict) there aren't really examples of that specific thing on the Internet. And it's just been weirdly bad at all the cartography-related programming problems I've thrown at it, in general.

And yeah, I'm much less worried about it replacing me now. It's just not.. lucid, yet.

laurels-marts · on March 25, 2023

GPT-4 is reasonably good at D3 and drawing arcs on a projection (e.g. orthographic) is not that unique, you’ll find examples of it on observable. However I wonder if you broke down the problem into a small enough task. It performs best if you provide a clear but brief problem description with a code snippet that already kind of does what you want (e.g. using straight lines) and then just ask it to modify your code to calculate arcs instead. The combination of clear description + code I found decreases the likelihood of it getting confused about what you’re asking and hallucinate. If you give it a very long-winded request with no code as basis for it then good luck.

sterlind · on March 25, 2023

I did try the code snippet technique, but unfortunately it got it wrong. For example, I gave it code that drew arcs but didn't follow the shortest great-circle distance, and it gave me several plausible-looking approaches that were completely wrong (e.g. telling ctx.arc to draw counterclockwise, which does the wrong thing because it needs to use projections instead.)

I eventually just asked it to compute coordinates to a point c perpendicular to the midpoint on the great arc between a and b, such that the angle between ab and ac is alpha. I tried for hours, asking it to work out equations and name the mathematical identities it used etc. but it was all gibberish.

codechicago277 · on March 25, 2023

So the closer you come to writing the code for it the better it does

noduerme · on March 25, 2023

I imagine that creative approaches to spacial problem solving would be one of the harder areas for it - not just because there are by definition fewer public examples of one-off or original solutions, but also because one has to visualize things in space before figuring out how to code it. These bots don't have a concept of space. I'm thinking of DALL-E (et. al) having problems with "an X above Y, behind Z".

v4dok · on March 25, 2023

GPT4 has its hands tied behind its back. It does not have active learning and it does not have a robust system of memory or a reward/punishment mechanism. We only now start seeing work on this side [1]

It might not know more than you about your niche. I don't. I would search and I would try to reason, but if I was forced to give a token by token output that is answering the question as truthfully as possible, I might have started saying bullshit as well.

I don't think that the fact that gpt doesn't know things or does some things wrong is sufficient to save dev work from automation.

[1]: https://github.com/noahshinn024/reflexion-human-eval

blablabla123 · on March 25, 2023

> The experience calmed my existential fears about my job being taken by AI.

Same for me. I didn't try GPT-4 yet, and not on code from work anyway but GPT-3 seems borderline useless at this point. The hallucinations are quite significant. Also I tried to produce advice for Agile development with references and as stated in other articles the links where either 404s or even completely unrelated articles.

Still I'm taking this seriously. Just considering the leaps that happened with AlphaGo/AlphaZero or autonomous driving, that was considered unthinkable in the respective domains before.

zeroonetwothree · on March 25, 2023

Even if AI only takes over “easy” programming jobs, it might still create a huge downward pressure on compensation.

After all, just look at manufacturing. Compared to 1970 we produce 5x the real output but employ only 50% the people. The same will likely happen to fields like programming as AI improves.

olivermuty · on March 25, 2023

For the crap devs maybe, but high skill devs and arcitechts will be able to charge more than ever to oversee all of this «productivity» from the AIs.

nimbix · on March 25, 2023

I asked it to write a trivial c#/dotnet example of two actors where one sends a ping message and the other responds with pong. It couldn't get the setup stage right, called several methods that don't exist, and and had a cyclic dependency between actors that would probably take some work to resolve.

Event after several iterations of giving it error messages and writing explanations of what's not working, it didn't even get past the first issue. Sometimes it would agree that it needs to fix something, but would then print back code with exactly the same problem.

toss1 · on March 25, 2023

Yes, exactly this.

I wrote some questions in the specialist legal field of someone in my household, then started to get into more specialist questions, and then specifically asked about a paper that she wrote innovating a new technique in the field.

The general question answers were very impressive to the attny. The specialist questions started turning up errors and getting concepts backwards - bad answers.

When I got to summarizing the paper with the new technique, it could not have been more wrong. It got the entire concept backwards and wrong, barfing generic and wrong phrases, and completely ignored the long list of citations.

Worse yet, to the point of hilariously bad, when asked for the author, date, and employer of the paper, it was entirely hallucinating. Literally, the line under the title was the date, and after that was "Author: [name], [employer]". It just randomly put up dates and names (or combinations of real names) of mostly real authors and law firms in the region. Even when pointed out the errors, it would apologize, and then confidently spout a new error. Eventually it got the date correct, and that stuck, but even when prompted with "Look at where it says 'Author: [fname]" and tell me the full name and employer, it would hallucinate a last name and employer. Always with the complete confidence of a drunken bullshit artist.

Similar for my field of expertise.

So, yes, for anything real, we really need to keep it in the middle-of-the-road zone of maximum training. Otherwise, it will provide BS (of course if it is BS we want, it'll produce it on an industrial scale!).

willbudd · on March 25, 2023

Yeah, in that sense I think one of the next logical steps will be providing on-demand lightweight learning/finetuning of LLM versions/forks (maybe as LoRAs?) as an API and integrated UX based on user chat feedback, while abstracting away all the technical hyperparameter and deployment details involved in a DIY setup. With a lucrative price tag of course.

funstuff007 · on March 25, 2023

> but it doesn’t seem to be good at extrapolation.

This is true to varying degrees for every statistical model ever.

gateorade · on March 25, 2023

Yeah that’s basically my point. The hype on HN/Twitter/etc. forget this.

lostmsu · on March 25, 2023

What would you be able to write with similar requests, if you'd only ever be allowed to use Notepad, and never run compiler/linter/tests, and not allowed to use Internet?

SketchySeaBeast · on March 25, 2023

Given I don't have petabytes of information accessible for instant retrieval (including perfect copies of my language of choice's entire API) I don't think that's comparable. I wouldn't need the entire if I'd memorized a large portion of it.

lostmsu · on March 25, 2023

GPTs don't have access to petabytes of information, that's the point. Only to some internal representation.

mannykannot · on March 25, 2023

Unlike current LLMs, your typical competent programmer would not hallucinate.

m3kw9 · on March 25, 2023

Quants jobs are safe because if it’s public there’s no edge

dorkwood · on March 25, 2023

I feel vindicated reading this. Yesterday in a separate thread I claimed that it was wrong on 80% of the coding problems I gave it, and received the response from multiple readers that I was probably phrasing my questions poorly.

I started to believe them, too. Unfortunately, my brain is structured in such a way that a unanimous verdict from a few strangers is enough to make me think I’m probably the one who’s wrong. I need to make note of these events as a way to remind myself that this isn’t always the case.

mrbombastic · on March 25, 2023

I think part of the issue is that your mileage will vary greatly depending on what your problem domain and language of choice is. People working with languages and problems chatgpt works well in have a hard time believing the hard fails in other domains and vice versa. I wrote a python script the other day to delete some old xcode devices lower than a certain ios version complete witth options with a just a few back and forths with chatGPT. My knowledge of python is extremely basic and the code just worked out of the box. Then yesterday I asked for the code to tell if a device is lidar enabled in Objective-C and it failed to give me compilable code 4 times in a row until I finally gave up and went back to the docs. The correct answer is one line. I for one am pretty excited about this, things that a lot of people have done before should be easy, leaves more brain space for the tough stuff.

dpkirchner · on March 25, 2023

What was the answer, if you don't mind? ChatGPT (3.5) days to use isCollaborationEnabled on the ARWorldTrackingConfiguration class which doesn't seem quite right based on the docs.

I wonder if this is a GI/GO problem. Apple's poor documentation of features being the garbage in.

mrbombastic · on March 25, 2023

Sure, this was what I ended up using:

https://developer.apple.com/documentation/arkit/arworldtrack...

I believe there are multiple ways, part of the problem might be Apple doing the Apple thing where their user philosophy bleeds into their tech. They don't want you checking for a specific sensor or device capability they want you to check if whatever feature you want is enabled.

anonytrary · on March 25, 2023

Vindicated and excited. Gradient descent is likely not enough. I love it when we get closer to something but are still missing the answer. I would be very happy if "add more parameters and compute" isn't enough to get us to AGI. It means you need talent to get there, and money alone will not suffice. Bad news for OpenAI and other big firms, good news for science and the curious.

I imagine physicists got very excited with things like the ultraviolet catastrophe, and the irreconcilable nature of quantum mechanics and general relativity. It's these mysteries that keep the world exciting.

BoorishBears · on March 25, 2023

There's something ironic about implying that us not having a path to AGI is good news for the curious. If you're supervisiously curious then sure, we need to unlock another piece of a puzzle, more puzzle pieces means more puzzle solving.

But if you're able to actually take a step back, AGI would be the the ultimate source of new puzzles for the curious. We don't even all agree on how to define the "GI", approaching AGI wouldn't be unlike meeting extraterrestrial life sitting on a computer.

MayeulC · on March 25, 2023

I think you misunderstood the parent, who was probably saying that the process for achieving AGI would be more interesting if it isn't just "more compute/training".

BoorishBears · on March 25, 2023

I think you misunderstood me since that's exactly my point.

From the trees, it's great for the curious that it might take more than compute and training.

From the forest, it'd be infinitely preferable if AGI were just a matter of more money. There are mysteries we can't even envision yet that would more than make up for any "lost curiosity"

MayeulC · on March 30, 2023

Al right, all right, but to me the implication was that only those who have ludicrous amounts of money would be able to play with it. And I don't think that's the most desirable outcome, is it?

IIAOPSW · on March 25, 2023

Woah. I've never seen someone so self aware of the asch conformity test without literally talking about the asch conformity test.

https://en.wikipedia.org/wiki/Asch_conformity_experiments

sjducb · on March 25, 2023

Easiest fix for that is show the prompt you gave and the output you want. Force the people who tell you it's easy to actually do it. You'll get one of 3 outcomes.

-They try then fail (you were right) -They try then succeed (you learn) -They keep telling you it's easy but don't demonstrate. (You know that this person is full of it and you can ignore them in future)

hughesjj · on March 25, 2023

Yup, always drill down and get/provide more context for when something doesn't align/seems fishy.

I've found WAYYYY too many of my issues were really just communications issues. Shits hard to navigate socially.

That said, mind sharing the specific tasks/prompts gpt-4 failed you?

nopinsight · on March 25, 2023

“Reflection-Based GPT-4 Agent is State-of-the-Art on Code Gen

Iteratively refines code, shifting “accuracy bottleneck” from correct code gen to correct test gen

HumanEval accuracy:

-Reflexion-based GPT-4 88%

-GPT-4 67.0%

-CodeT 65.8%

-PaLM 26.2%”

with link to code in the Tweet:

https://mobile.twitter.com/johnjnay/status/16393620718075494...

21% improvement after adding a feedback loop and self-reflection to GPT-4, which just went public 12 days ago. (The approach is based on a preprint published 4 days ago.)

Human coders often need a feedback loop and self-reflection to properly “generate” code for problems novel to them as well.

-----

A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?

nopinsight · on March 25, 2023

Someone asked GPT-4 to build a complete app from scratch. It's now on the store. He seems to apply good prompt engineering techniques.

Screenshots, video demo, and process here: https://mobile.twitter.com/mortenjust/status/163927657157489...

It seems plausible to me now that a junior developer position would be hard to find in 2-3 years (I thought it would be ~5 years).

mrbombastic · on March 25, 2023

So this is very impressive and looks like a solid lowering of barrier to entry which is great….but that app is around 300 lines of code in one file that fetches data on 5 movies, screenshots are not cropped correctly and the pager doesn’t swipe back to the first dot. I am bit surprised it made it through the review process. Not hating I think it is great to make this stuff more approachable but not convinced junior devs are in danger yet.

nopinsight · on March 25, 2023

GPT-3 was released less than 3 years ago and it was far from capable of this.

mjburgess · on March 25, 2023

Exponential curves are theoretical constructs: all actual phenomena are S-shaped.

The question is only when does the "exponential regime" turn into the flat; and the answer is often fairly obvious if you don't begin from the "time = magic" premise.

There's an entire industry of public (pseudo)-intellectual who's schtick is to draw logistic phenomena with exponential curves and then cry, "the sky is falling!".

automatic6131 · on March 25, 2023

>Exponential curves are theoretical constructs: all actual phenomena are S-shaped.

This. I've been thinking much the same recently, although not expressed so succinctly. Nothing in nature is ever exponential (for very long).

nopinsight · on March 25, 2023

On the contrary, few experts expected this performance from an AI this soon too.

If you can identify one or two aspects of the human “general” intelligence that an AI cannot ever possess, even in principle, I think a lot of people would be grateful.

mjburgess · on March 25, 2023

In animals, propositional knowledge is built from procedural knowledge; and it can't really be otherwise.

What AI does at the moment is approximate propositional knowlegde with statistical associations, rather than take the procedural route. But this fails because P(A|B) doesnt say whether A causes B, B causes A, A is B, A and B are causally unrelated, etc.

What is the procedural route? To perform actions with your body so as to disambiguate the cases. Animals have causal models of their bodies which are unambiguous and their actions are intentional and goal-directed and effectively "express hypotheses" about the nature of the world. In doing so, they can build actual knowledge of it.

There's at least some good reasons to suppose that "bodies which express hypotheses in their actions" require organic properties to do so: becuase you have to have adaption from bottom-up to top-down to really have "the mind" grow the body in the relevant ways.

In other words, every action an animal performs isnt clockwork: in acting, it's body and mind change. Every action is a top-down, bottom-up whole change to the animal.

nopinsight · on March 25, 2023

This is a very interesting hypothesis that could be quite true for living beings. What I disagree with is that having an animal-like body is necessary for the process of forming a world model. A simulation could be sufficient. And there is already work on that front. (Also, I would not characterize deep-learning-based AI as trying to form propositional knowledge. In fact, its great performance partly stems from not dealing with propositional knowledge directly.)

If a body is in fact necessary, PaLM-E could be paving a way toward it as well. https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal...

nopinsight · on March 25, 2023

You might be interested in this thread by a DeepMind’s research scientist. https://mobile.twitter.com/AndrewLampinen/status/16396602197...

mrbombastic · on March 25, 2023

Sure and it is impressive but the upper bounds of AI products are hard to predict. That said things are changing fast I don’t know what tomorrow will bring.

isaacfrond · on March 25, 2023

the paper: https://arxiv.org/pdf/2303.11366.pdf

it reminds me of the thinking-fast vs. thinking-slow dichotomy. Current llms are the thinking fast type. Funnily people’s complains about its errors are reminiscent of this. It answers just to quick and only with its instant response neural net. A thinking slow answer would be more akin to a chain of thought answer. Allowing the llm a more flexible platform than CoT promptin might well be the next step. Of course it would als multiply compute cost. So it might not be in your 20$ subscription

cornholio · on March 25, 2023

A narrower question: can we perhaps stop putting AGI and ChatGPT in the same paragraphs as if they are somehow relevant to each other? Intelligence has very little to do with a glorified Google search trained by statistical crunching of superhuman amounts of data; there is not even a chicken-sized trace of intelligence in ChatGPT that is not a reflection of the training set or of the embedded human-designed models it uses to mimicry problem solving and conversation.

nopinsight · on March 25, 2023

Several necessary ingredients of human intelligence are present in GPT-4: complex pattern matching, abstraction from concrete examples and apply the abstract patterns to new examples, pattern interpolation, basic reasoning.

This is evident by its ability to generalize from the training set to new problems within many domains.

It's still unable to generalize as well as a smart human beyond the distribution it was specifically trained on, which is evident by its poor performance on AMC, Leetcode medium and hard, and Codeforces problems. But most humans are not great at these kinds of problems either.

Benchmark and test results: https://openai.com/research/gpt-4

PartiallyTyped · on March 25, 2023

> A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?

We are running towards a brick wall and people are not paying attention. Setting up self-reflection loops today is actually fairly trivial and can be done programmatically, all the model needs is to produce a solution, invoke the evaluation and keep iterating.

riku_iki · on March 25, 2023

so how do we know chat gpt didn't have humaneval in training data?

nopinsight · on March 25, 2023

The point is adding a couple components can improve GPT-4 significantly, as shown above. The data it originally trained with is presumably held constant in the evaluation above.

riku_iki · on March 25, 2023

the point is if humaneval was in gpt training data, then this component improved memorization from mediocre to Ok-ish, and actual coding skills still not tested.

nopinsight · on March 25, 2023

From other people’s and my experiences, GPT-4 can do more than simply memorizing. It can at least interpolate and reason a little bit too.

A few other tests show that GPT-4 would achieve much better results than 67% for something it has sufficient training data on like GRE Verbal and AP Macroeconomics.

https://openai.com/research/gpt-4

Yes, it still can’t generalize properly outside its training distribution. However, when armed with feedback and self-reflection, it seems better at that too.

hughesjj · on March 25, 2023

yup. we're (we=the public) far from getting access to the full model. that said one of the commentors in the twitter thread brings up how openapi isnt being fully forthcoming about their methods.

AI Explained has a good summary of many of these topics

drbig · on March 25, 2023

Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

LLMs are inherently incapable of any that - as in mechanically incapable, in the same way a washing machine is incapable of being an airplane.

Now my understanding is that the actual systems we have access to now have other components, with the LLM being the _core_, but not _sole_ component.

Can anybody point me to any papers on those "auxiliary" systems?

I would find it very interesting to see if there are any LLMs with logic components (e.g. Prolog-like facts database and basic rules that would enforce factual/numerical correctness; "The number of humans on Mars is zero." etc.).

mjburgess · on March 25, 2023

Because they don't distinguish between properties of the output and properties of the system which generated it. Indeed, much of the last decade of computer-science-engineering has basically been just insisting that these are the same.

An LLM can generate output which is indistinguishable from a system which reasoned/knew/imagined/etc. -- therefore the "hopeium / sky is falling" manic preachers call its output "reasoned" etc.

Any actual scientist in this field isn't interested in whether measures of a system (its output) are indistinguishable, they're interested in the actual properties of the system.

You don't get to claim the sun goes around the earth just because the sky looks that way.

sjducb · on March 25, 2023

Do submarines swim? No, but they are faster underwater than all swimmers. Therefore they are the best swimmers despite being unable to swim....

LLMs are producing human level reasoning in many domains, therefore they are the best at reasoning despite being unable to reason...

This whole debate hangs on the definition of "reasoning"

sebzim4500 · on March 25, 2023

Scientists are extremely interested in measurable results of experiments. I think you are thinking of philosophers.

nopinsight · on March 25, 2023

Can an airplane fly? Can a submarine swim?

Yes, AI may be constructed quite differently from human intelligence. Can it accomplish the same purposes? For some purposes, the answer is a resounding yes as can be seen from its applications around the world by millions of people.

Can an animal ‘think’, ‘understand’, or ‘reason’? Maybe not as well as a homo sapiens. But it’s clear that a raven, a dolphin, or a chimp can do many things we assume require intelligence. (A chimp may even have a slightly larger working memory than a human, according to some research.)

Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?

f6v · on March 25, 2023

> Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?

You’re right, AI doesn’t need to be AGI to be useful. Most SEO content on the internet is probably even worse than ChatGPT can do. And LLM could hallucinate another Marvel movie since they’re so similar.

My problem is that people make ungrounded claims about these systems either already having sentience or being just few steps away from it. It’s a religion at this point.

bsaul · on March 25, 2023

some prompts results are only explainable if chatgpt has the ability to produce some kind of reasoning.

As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.

mjburgess · on March 25, 2023

The reasoning occurred when people wrote the text it was trained on in the first place; it's training data is full of the symptoms of imagination, reason, intelligence, etc.

Of course if you statistically sample from that in convincing ways it will convince you it has the properties of the systems (ie., people) which created its training data.

But on careful inspection, it seems obvious it doesnt.

Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.

drbig · on March 25, 2023

> Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.

Excellent analogy, and I appreciate analogies (perhaps even a bit too much). Will be using this one. Thank you!

SanderNL · on March 25, 2023

If you “sample” this enough to be reasoning in a general manner, what is exactly the problem here?

Magic “reasoning fairy dust” missing from the formula? I get the argument and I think I agree. See Dreyfus and things like “the world is the model”.

Thing is, the world could contain all intelligent patterns and we are just picking up on them. Composing them instead of creating them. This makes us automatons like AI, but who cares if the end result is the same?

mjburgess · on March 25, 2023

The distribution to sample from mostly doesn't exist.

Data is produced by intelligent agents, it isn't just "out there to be sampled from". That would mean all future questions already have their answers in some training data: they do not.

See for example this exact tweet: pre-2021 coding challenges are excellent, post-2021 are poor. Why? Because post-2021 didnt exist to sample from when the system was built.

bsaul · on March 25, 2023

At the minimum, chatgpt displays a remarkable ability to maintain a consistent speech throughout a long and complex conversation with a user, taking into account all the internal implicit references.

this to me is the proof it is able to correctly infer meaning, and is clearly a sign of intelligence. (something a drunk human has trouble doing, for example).

krainboltgreene · on March 25, 2023

"I have seen the output and it matches what I consider to be conversation"

Well yeah, it's been trained to produce output that would look like conversation.

bsaul · on March 25, 2023

it's not what i meant : you can have a full conversation and then at some point use "it" or "him" , and based on the rest of the sentence, it will understand what previous element of the conversation you were mentionning..

This requires at least "some" conceptualisation of the things you're talking about. It's not just statistics.

aent · on March 25, 2023

It does not require conceptualization, pretty sure the "understanding" of previous references comes from this: https://arxiv.org/abs/1706.03762

krainboltgreene · on March 25, 2023

This is exactly statistics.

krainboltgreene · on March 25, 2023

> As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.

If there's one field of expertise I trust programmers to not have a clue about it's how human intelligence works.

SanderNL · on March 25, 2023

What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly? This answer is worth billions so why not enlighten us.

You don’t know? But you feel you have “it”, this magical substance, this substrate of reason itself? Reasoning is build out of what, exactly?

Sorry to be that guy, but I fail to see more than word play.

YeGoblynQueenne · on March 25, 2023

The thing is that we've known how to do reasoning with computers since the 1960's at least. Here:

https://dl.acm.org/doi/10.1145/321250.321253

That's the paper introducing the Resolution principle, which is a sound and complete system of deductive inference, with a single inference rule simple enough that a computer can run it.

The paper is from 1965. AI research had reasoning down pat since the 1970's at least. Modern systems have made progress in modelling and prediction, but lost the ability to reason in the progress.

Yeah, we totally "scienced that shit" as you say in a comment below. And then there was an AI winter and we threw the science out because there wasn't funding for it. And now we got language models that can't do reasoning because all the funding comes from big tech corps that don't give a shit about sciencing anything but their bottom line.

mjburgess · on March 25, 2023

What makes you so sure diluting things doesn't make them stronger? I mean, you don't know any physics, chemistry or biology -- but it's just word play right?

I mean, there isnt anything called science we might used to study stuff. You can't actually study any intelligent things empirically: what would you study? Like animals, and people and things? That would be mad. No no, it's all just word play.

And you know it's wordplay because you've taken the time to study the philosophy of mind, cognitive science, empirical psychology, neuroscience, biology, zoology and anthropology.

And you've really come to a solid conclusion here: yes, of course, the latest trinket from silicon valley really is all we need to know about intelligence.

That's how the scientific method works, right?

Sillicon Valley releases a gimmik and we print that in Nature and all go home. It turns out what Kant was missing was some VC funding -- no need to write the critique of pure reason.

pixl97 · on March 25, 2023

>What makes you so sure diluting things doesn't make them stronger

Alcohol diluted with around 30% water makes it 'stronger' at killing bacteria...

I mean it's easy to say "just science that shit", and then forget we've been spending decades and billions of dollars doing just that.

SanderNL · on March 25, 2023

Let’s all see and be amazed by the absolutely breathtaking achievements of those fields in the domain of AI…

krainboltgreene · on March 25, 2023

> What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly?

It never fails: When faced with the reality of what the program is your average tech bro will immediately fall back to trying to play their hand at being a neuroscientist, psychologist, and philosopher all at once.

SanderNL · on March 25, 2023

You did not answer a single thing and maybe I am those things. You don’t know me.

krainboltgreene · on March 25, 2023

Are you any of those things?

YeGoblynQueenne · on March 25, 2023

>> Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

It's very confusing when you come up with some idiosyncratic expression like "advanced token-based probabilistic sequence generators" and then hold it up as if it is a commonly accepted term. The easiest thing for anyone to do is to ignore your comment as coming from someone who has no idea what a large language model is and is just making it up in their mind to find something to argue with.

Why not just talk about "LLMs"? Everybody knows what you're talking about then. Of course I can see that you have tied your "definition" of LLMs very tightly to your assumption that they can't do reasoning etc., so your question wouldn't be easy to ask unless you started from that assumption in the first place.

Which makes it a pointless question to ask, if you've answered it already.

The extravagant hype about LLMs needs to be criticised, but coming up with fanciful descriptions of their function and attacking those fanciful descriptions as if they were the real thing, is not going to be at all impactful.

Seriously, let's try to keep the noise down in this debate we're having on HN. Can't hear myself think around here anymore.

m3kw9 · on March 25, 2023

Just curious why you’d respond so much to him and also add nothing to the discussion?

YeGoblynQueenne · on March 25, 2023

Hang on, how is it fair to ask me why I "add nothing to the discussion" when all your comment does is ask me why I add nothing to the discussion? Is your comment adding something to the discussion?

I think it makes perfect sense to discuss how we discuss, and even try to steer the conversation to more productive directions. I bet that's part of why we have downvote buttons and flag controls. And I prefer to leave a comment than to downvote without explanation, although it gets hard when the conversation grows as large as this one.

Also, can I please separately bitch about how everyone around here assumes that everyone around here is a "he"? I don't see how you can make that guess from the user's name ("drbig"). And then the other user below seems to assume I'm a "him" also, despite my username (YeGoblynQueenne? I guess I could be a "queen" in the queer sense...). Way to go to turn this place into a monoculture, really.

sebzim4500 · on March 25, 2023

Not him but I am also extremely frustrated by the fact it is impossible to have a real discussion about this topic, especially on HN. Everyone just talks past each other and I get the feeling that a majority of the disagreement is basically about definitions, but since no one defines terms it is hard to tell.

Madmallard · on March 25, 2023

I don't think there's anything inherently different algorithmically or conceptually.

Our brain is just billions of neurons and trillions of connections, with millions of years of evolution making certain structural components of our network look a certain way. The scale makes it impossible to replicate.

sebzim4500 · on March 25, 2023

What do you mean 'impossible to replicate'. With current technology, or in general?

Madmallard · on March 25, 2023

Possibly both? Certainly the first of the two.

m3kw9 · on March 25, 2023

it kind of does “understand” when humans supervise it during training and they are able somehow relate and give mostly coherent responses. It may not be feeling it but it does seem to “understand” a subject more than a few people

whazor · on March 25, 2023

But we do not even know whether GPT-4 is 'just a LLM'. Given the latest addons and the fact it can do some mathematics, I think there is more under the hood. Maybe it can query some reasoning engine.

This is why I think it is so important for OpenAI be more open about the architecture, so we can understand the weaknesses.

hesdeadjim · on March 25, 2023

I threw a challenging rendering problem at it and I was pretty impressed with the overall structure and implementation. But as I looked deeper, the flaws became apparent. It simply made up APIs that didn’t exist, and when prompted to fix it, couldn’t figure it out.

Still, despite being fundamentally wrong it did send me down some different paths.

jasfi · on March 25, 2023

Using APIs that don't exist is the biggest problem I've seen with ChatGPT, and it seems GPT-4 as well.

exodontist · on March 25, 2023

I asked chatgpt about the api for an old programming game called chipwits.. it invented a whole programming language that it called chiptalk with an amalgam of the original chipwits stuff, missing some bits and adding others, and generated a parser for it, which I implemented and got to work, before figuring out how much was imaginary, after talking to the original chipwits devs. They found it pretty amusing.

carbocation · on March 25, 2023

> and got to work

Can you elaborate?

irjustin · on March 25, 2023

I'm fast learning Django and even though it's an extremely well documented space, ChatGPT has sent me down the wrong path more than a handful of times.

This is especially difficult because I don't know when it's wrong and it's so damn confident. I've gotten better at questioning its correctness when the code doesn't perform as expected but initially it cost me upwards of 30min per time.

Still, I would say between ChatGPT and Copilot - I'm WAY further ahead.

hughesjj · on March 25, 2023

chatgpt or gpt4?

public copilot uses gpt3.5, as does non premium chatgpt.

Nathanba · on March 25, 2023

my biggest problem with it is that it doesn't seem to understand its own knowledge. If you talk to it for a while and you go back and forth on a coding problem it will often suddenly start using wrong syntax that doesn't exist. Even though at this point it should already know and have looked up for sure that this syntax can't possibly exist because many times it responded correctly. So in human terms it has read the documentation and must know that this syntax can't possibly exist and yet it doesn't know that 10 sec later. That's currently what makes it seem like a not real intelligence to me.

vitorgrs · on March 25, 2023

One of the advantages of Bing, and do guess now ChatGPT with browsing plugin, is that it's able to search on the web for the right API.

saulpw · on March 25, 2023

To be fair, using APIs that I think should exist, is how I develop most of my APIs.

jasfi · on March 25, 2023

Except that I wasn't asking it to develop a new API.

brabel · on March 25, 2023

It's very likely it was using other languages' as "inspiration" given there's very little Zig code out there... so it's maybe natural it would use APIs that don't yet exist... perhaps informing it that it also needs to implement those APIs could work?

saulpw · on March 25, 2023

Then I guess you're not using it to its fullest potential ;)

bamboozled · on March 25, 2023

We can’t keep blaming the prompter.

xkgt · on March 25, 2023

A simple metric on confidence interval could do the trick. As the model grows larger, it is getting more difficult to understand what is going on, but that doesn't mean that it needs to be a total black box. At least let it throw some proxy metrics. In due course, will learn to interpret those metrics and adjust our internal trust model.

kozikow · on March 25, 2023

You can just ask it to give you confidence in the output on a scale 0 to 1

ALittleLight · on March 25, 2023

I wonder if a plugin to let it query API docs would solve this problem.

behnamoh · on March 25, 2023

Also it makes up Python libraries, macOS apps to do certain tasks, etc.

splatzone · on March 25, 2023

I’ve had very good results from running the code and pasting the errors back into ChatGPT and asking it what to do. Sometimes it corrects itself quite well

behnamoh · on March 25, 2023

Put that in a loop and see if AGI emerges.

whitehexagon · on March 25, 2023

>It simply made up APIs that didn’t exist

That has been my experience with Zig. It led me to the conclusion that there are just too many 'non indexed' developer tools in use these days, so there isnt much training data for new topics. But it was happy to hallucinate API's and their proof of existence.

rel2thr · on March 25, 2023

yea I find it to be wrong a lot when coding But its faster for me to fix existing code than to write code from scratch so its still better than nothing for me

osteele · on March 25, 2023

Same. It seems similar to Copilot in that regard, but better at text-to-code, porting between languages or frameworks, and generating test cases and readmes: https://notes.osteele.com/gpt-experiments/using-chatgpt-to-p...

RhysU · on March 25, 2023

Most of us are much worse on coding problems not in our training set!

(Looks down at dynamic programming problem involving stdin/stdout and combining two data structures).

whatshisface · on March 25, 2023

The reason we're being kept around still is that you can solve the problem without it ever appearing in your training set, and once you have, it has.

toomuchtodo · on March 25, 2023

Hints of The Nine Billion Names of God for sure.

https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God

dymk · on March 25, 2023

Oh wow - Unsong [1] must have taken some inspiration from that. Into the queue it goes!

[1] https://unsongbook.com/

LordDragonfang · on March 25, 2023

Allow me to offer you this twitter thread:

https://twitter.com/chaosprime/status/1607895175799373830

PrimeMcFly · on March 25, 2023

What is that exactly? The site doesn't say.

LordDragonfang · on March 25, 2023

A work of online serial fiction by blogger Scott Alexander, formerly of Slate Star Codex, now of Astral Codex Ten.

Goodreads has a decent intro blurb: https://www.goodreads.com/fr/book/show/28589297-unsong

If you're unsure whether his writing style is your thing, feel free to sample his shorter fiction from his blog:

https://slatestarcodex.com/2015/06/02/and-i-show-you-how-dee...

https://slatestarcodex.com/2015/04/21/universal-love-said-th...

https://astralcodexten.substack.com/p/idol-words

PrimeMcFly · on March 25, 2023

Thanks. The goodreads blurb makes me think it's something like Salmon Rushdie or a Kevin Smith (with less toilet humor) take on things.

passion__desire · on March 25, 2023

A short film adaptation released a year ago.

https://www.youtube.com/watch?v=UtvS9UXTsPI

yarg · on March 25, 2023

A better question is, given a significant corpus of complex functionality, can it implement complex code in a language that it knows, but in which it has only seen lower complexity code?

Can it transfer knowledge across languages with shared underlying domains?

mjburgess · on March 25, 2023

I think given that it's been trained on everything ever written, we should suppose the answer is no.

It has always been possible, in the last century, to build a NN-like system: it's a trivial optimization problem.

What was lacking was the 1 exabyte of human-produced training data necessary to bypass actual mechanisms of intelligence (procedural-knowledge generation via exploration of one's environment, etc.).

ren_engineer · on March 25, 2023

the implication here is that GPT is just brute force memorizing stuff and that it can't actually work from first principles to solve new problems that are just extensions/variations of concepts it should know from training data it has already seen

on the other hand, even if that's true GPT is still extremely useful because 90%+ of coding and other tasks are just grunge work that it can handle. GPT is fantastic for data processing, interacting with APIs, etc.

RhysU · on March 26, 2023

No, the implication is that most of us fake it until we make it. And The Peter Principle says we're all always faking something. My comment was just about humanity. ChatGPT isn't worth writing about.

tus666 · on March 25, 2023

We aren't state machines. We are capable of conscious reasoning which GPT or any computer is not.

We can understand our own limitations, know what to research and how, and how to follow a process to write new code to solve new problems we have never encountered.

Our training set trains our learning and problem solving abilities, not a random forest.

waf · on March 25, 2023

I've been adding C# code completion functionality to my REPL tool, and ended up reverting to the text-davinci model.

The codex (discontinued?) and text-davinci models gave much better results than GPT3.5-turbo, specifically for code completion scenarios. The latest models seem to produce invalid code, mostly having trouble at the boundaries where they start the completion.

My suspicion is that these latter models focus more on conversation semantics than code completion, and completing code "conversationally" vs completing code in a syntactically valid way has differences.

For example, if the last line of code to be completed is a comment, the model will happily continue to write code on the same line as the comment. Not an issue in a conversation model as there is a natural break in a conversation, but when integrating with tooling it's challenging.

Most likely the issue is that I'm not yet effective at prompt engineering, but I had no issues iterating on prompts for the earlier models. I'm loving the DaVinci model and it's working really well -- I just hope it's not discontinued too soon in favor of later models.

wakahiu · on March 25, 2023

I can corroborate that text-davinci gives much better results than for tasks involving summarization or extraction of key sentences among a large corpus. I wonder what empirical metrics OpenAI uses to determine performance benchmarks for practical tasks like these. You can see the model in action for analysis of reviews here: https://show.nnext.ai/

[Disclaimer - I work at nnext.ai]

pffft8888 · on March 25, 2023

I was just talking about this the other day:

> it's more hacking than crareful and well specified engineering, and that could lead down a path of instability in the product where some features get better while others get worse, without understanding exactly why.

https://news.ycombinator.com/threads?id=pffft8888&next=35269...

letitgo12345 · on March 25, 2023

Will take a bit of time before AI can consistently beat us on coding/proofs but the raw ingredients imo are there. As someone who was skeptical of AGI via just scaling things up even after GPT-3, what convinced me was the chain of thought prompting paper. That shows the LLM can pick up on abstract thought and reasoning patterns that humans use. Only a matter of time before it picks up on all of our reasoning patterns (or maybe it already has and is just waiting to be prompted properly...), is hooked up to a good memory system so it's not limited by the context window and then we can watch it go brrrr

It can still make stupid mistakes in reasoning but I don't think that's fundamentally unsolvable in the current paradigm

danielheath · on March 25, 2023

> That shows the LLM can pick up on abstract thought and reasoning patterns that humans use.

Does it? I’m still unconvinced it’s more than copying other examples of “show your work”.

letitgo12345 · on March 25, 2023

It's definitely not just copying verbatim. If you mean it's emulating the reasoning pattern it sees in the training data well...don't humans do that as well to get answers to novel problems?

sidlls · on March 25, 2023

We don't know all the different ways humans arrive at answers to novel problems.

And while these LLMs aren't literally just copying verbatim, they are literally just token selection machines with sophisticated statistical weighting algorithms biased heavily towards their training sets. That isn't to say they are overfitted, but the sheer scale/breadth gives the appearance of generalization without the substance of it.

Izkata · on March 25, 2023

Here's an argument that GPT does actually build an internal representation of the game Othello, it's not just token selection: https://thegradient.pub/othello/

kweingar · on March 25, 2023

Keep in mind that the Othello example is model specifically trained on only Othello games. I haven’t seen any claims that general purpose models like GPT-4 have internal representations of complex abstract structures like this.

pishpash · on March 25, 2023

Why wouldn't they? Text-moves of Othello games are presumably a subset of the training data for a general LLM. If anything the general LLM has the chance to derive more robust internal world representations given similarly laid out board games.

This is very reminiscent of position-encoding neurons: https://en.wikipedia.org/wiki/Grid_cell

It is also not surprising that if you force a system to succinctly encode input-output relationships, eventually it discovers the underlying generating process or its equivalent as implied by Kolmogorov complexity theory. Language is just a convenient encoding for inputs and outputs, not fundamental. So yes it is regurgitating statistics, but statistics are non-random because of some non-trivial underlying process, always, and if you can regurgitate those statistics consistently you're guaranteed to have learned a representation of the process. There is no difference and biological systems aren't any different.

kweingar · on March 25, 2023

This morning I asked GPT-4 to play duck chess with me. Duck chess is a very simple variant of chess with a duck piece (that acts like an impassable brick) that each player moves to an empty square after their normal move. [I gave GPT-4 a more thorough and formal explanation of the rules of course.]

To a human, board state in chess and in duck chess is very simple. It’s chess, but with one square that’s blocked off. Similarly, even a beginner human chess player can understand the basics of duck chess strategy (block your opponent’s development in the beginning, block your opponent’s attacks with the duck to free up your other pieces, etc.).

GPT-4 fell apart, often failing to make legal moves, and never once making a strategically coherent duck placement. To me this suggests that it does not have an internal representation of the 64 squares of the board at all. Even if you set aside the strategic aspect, the only requirement for a duck move to be legal is that you place it on an empty square, which it cannot consistently do, even at the very beginning of the game (it like to place the duck on d7 as black after 1. …e5, even when its own pawn is there).

pishpash · on March 25, 2023

It is a matter of degree. GPT-4 may, for various reasons some of which are artificial handicaps, have only a weak grasp of a board representation now. But if it has any such representation at all, that's already a different story than if it did not. I think all evidence points this way, even from other networks, e.g. image classification networks that learn common vision filters. It's a pretty general phenomenon.

qlm · on March 25, 2023

No, humans don't do that. If humans did that nothing new would ever be created.

croes · on March 25, 2023

They are remixing not reasoning

mirekrusin · on March 25, 2023

It has been proven it creates internal abstract representation models many times. Most trivial one is playing chess or go via text.

mjburgess · on March 25, 2023

The statistical distribution of historical chess games is a approximate statistical model of an actual model of chess.

It's "internal abstract representation" isnt a representation; it's an implicit statistical distribution across historical cases.

Consider the difference between an actual model of a circle (eg., radius + geometry) and a statistical model over 1 billion circles.

In the former case a person with the actual model can say, for any circle, what it's area is. In the latter case, the further you get outside the billion samples, the worse the area will report. And even within them, it'll often be a little off.

Statistical models are just associations in cases. They're good approximations of representational models for some engineering purposes; they're often also bad and unsafe.

mirekrusin · on March 26, 2023

It's not some kind of first order statistical gibberish.

It exhibits internal abstract modeling.

It's a bit silly to argue against it at this time.

To produce answers with quality we see it'd have to use orders of magnitude more memory than it actually does.

It's also easy to test yourself.

Simple way is to create some role playing scenario with multiple characters when same thing is seen differently by different actors at different time and probe it with questions (ie. somebody puts X into bag labelled Y, other person doesn't see it and asking what different actors think is in the bag at specific time in the scenario etc).

Or ask for some crazy analogy.

Why am I even saying it, just ask it to give you list of examples how to probe LLM to discover if it creates abstract internal models or not - it'll surely give you a good list.

zeroonetwothree · on March 25, 2023

Most things in life aren’t mathematical objects and therefore don’t have perfect theoretical models anyway. For example, what is a “chair”?

skybrian · on March 25, 2023

It seems like chain-of-thought will work pretty well when backtracking isn't needed. It can look up or guess the first step, and that gives it enough info to look up or guess the second, and so on.

(This can be helpful for people too.)

If it goes off track it might have trouble recovering, though.

(And that's sometimes true of people too.)

I wonder if LoRA fine-tuning could be used to help it detect when it gets stuck, backtrack, and try another approach? It worked pretty well for training it to follow instructions.

For now, it seems like it's up to the person chatting to see that it's going the wrong way.

bamboozled · on March 25, 2023

The perfect reasoner is upon us ?

letitgo12345 · on March 25, 2023

I would prefer to say that we've seen a glimpse of what a future world with a perfect reasoner will be like

epicureanideal · on March 25, 2023

And I imagine even the glimpse would cause a lot of venture capital to be flowing into AI... and also government/military funds.

ChatGTP · on March 25, 2023

Man America is really such a bore: VC, Military, solving problems, weapons, war, getting it done quicker, “freedom”, is there anything else to all this, to life ?

I used to be one of those “why do people always rag on American culture?” types, but I’m getting it.

It makes me laugh how we want to automate everything without the slightest idea of what we’ll be doing once it’s all automated ? Is that the point where the USA figures out we already had a lot of good things to do ? Ha, we’ll invent the matrix and plug ourselves in.

Sorry it’s not personal but it just seems like a never ending grind instilled with the same themes over and over. Now we have “hustle culture with no job prospects due to automation”, cool plan.

saiya-jin · on March 25, 2023

Well this approach brings us also modern medicine, so we are not dying in the pool of mud from every trivially treatable malaise. We break atoms, we can almost reach the stars which is probably the only long term way to preserve mankind.

You can't have one without the other. Look at any society in history, either push forward or eventual downfall and ending up as small history lesson. Some say its human nature, some say its nature's nature.

ChatGTP · on March 25, 2023

I’m sorry but modern medicine and technologic / scientific advancements happen without the modern American psyche.

If “America” wasn’t a thing , we’d all be completely fine.

In my opinion America is pushing us to “need” to be a multi planetary species. Not the other way around. We’re taking greater risks for the growing need for monetary reward it’s unavoidable.

As others have put it, the Earth is the greatest starship we’ll ever know. We know that the trajectory we’re on will require a backup.