LLMs are not greater than the sum of their parts: researchers

popinman322 · on May 8, 2023

There was a discussion about the paper[0] this was based on 7 days ago: https://news.ycombinator.com/item?id=35768824

Certhas · on May 9, 2023

Glad it came up again. I missed it the first time. I am a complex systems scientist and its hard to overstate the importance of this.

The fact that there are reasonable metrics that make the emergence completely predictable as a simple threshold effect is massive.

One alternative way to phrase this that I haven't seen in either thread yet (but I am on mobile and haven't looked thoroughly) would be: Human observers/users dramatically underestimate how much small language models already have learned, because we are very good at spotting the errors. Scaling up looks like emergence occurs because the last few errors get eliminated.

krageon · on May 9, 2023

If anything, it looks to me like the researchers are saying the opposite: We understand how this works, so it cannot be magic or special. The fact that you think it is means you are just wrong.

Sadly it is possible I have arrived at this conclusion due to previous experience with the academic world, which is ironically full of magical thinkers that will take any sort of AI advancement and scoff because it may mean they're no longer as special as they feel they must be. It is a very hostile environment.

Certhas · on May 9, 2023

They are saying correctly that we _can_ understand the learning of models in a continuous fashion, and that we do not need to assume that at some stage something different than at other stages happens. So no emergence.

This is massive because it means, for example, that you can study what is happening in these models with small ones that perform worse but are simpler to understand, and you're not missing something fundamental. I don't think they claim we already understand what is happening I these models though.

I don't know what exactly you are talking about with your academia bashing. If anything it seems to me academia is just as awash as everywhere else with barely substantiated AI hype. Disparaging and psychologizing critical voices just makes it seem to me like you might be missing the point of academia...

UniverseHacker · on May 9, 2023

Great summary, I think it clarifies an important point. It's not that these models aren't able to do the surprising things people are claiming they can do, but that in fact the smaller versions of the models had these type of capabilities as well, but to a lesser degree, and nobody was aware of it because we weren't even looking for it.

mnky9800n · on May 9, 2023

what does a complex systems scientist do?

lapama · on May 10, 2023

See the world from a systemic relational approach, and publish and talk about it.

dang · on May 9, 2023

Thanks! Macroexpanded:

Are emergent abilities of large language models a mirage? - https://news.ycombinator.com/item?id=35768824 - May 2023 (126 comments)

zmmmmm · on May 9, 2023

So the type of emergent ability I am curious about relates to how with enough parameters, the models start to actually emit logical behaviour where previously it was not able to apply that. So an example is:

> David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?

Answer: > Based on the information provided, David is 15 years old and the minimum age to be allowed to drive a car is 18. Therefore, David is not allowed to drive a car since he has not reached the minimum age requirement.

Trying out smaller models I get all kinds of crazy responses. ChatGPT4 seems to be reliably able to apply this kind of logic to within-prompt information. Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size? Or something specific about the HRLF that has been applied to it? Whatever it is, the end result is that this larger model is useful for fundamentally different things that the other models are not. Whether you call that emergent behaviour, sums greater than parts etc. doesn't change that.

RandomBK · on May 9, 2023

It's not just the ability to correctly answer this, but the consistency.

I asked this exact question to the `oasst-sft-6-llama-30b` model and it was able to consistently get the correct answer. I then tried the smaller `vicuna-7b` model, and while it usually gave the correct answer, there was the occasional miss.

Interestingly, `oasst-sft-6-llama-30b`'s ability to answer correctly seems to be fairly stable across multiple configurations. I tried various temperature settings from 0.2 up to 1.2, different topP configs, and they all answered correctly.

zmmmmm · on May 9, 2023

reminds me of voice recognition

On the one hand, the problem was nearly "solved" in early 2000's, getting to 95% accuracy. But the daily of experience of using something that makes mistakes at that rate is infuriating and well and truly outside of consideration for putting into any kind of critical pathway. So it's a combination of how difficult it is to close the last few percentage points of accuracy with how important they are for most of the high value use cases.

For the forseeable future I see most use of this tech coming from applications where it aids humans and/or checks their decisions rather than running solo.

deafpolygon · on May 9, 2023

> Computer, earl grey, hot.

I think we're getting closer to something like this, out of Star Trek. Even in Star Trek, AI did not take over critical functions - but rather assisted the crew in manning the starship.

setr · on May 9, 2023

I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language eg <subject> <parameters> <action>. It could be made completely reliable with current tech (even a decade ago, really), by just minimizing the possibility space… and I’m pretty sure consumers would trivially be able to learn it, as long as they don’t try to go full pseudo-programming-language mode

And “Computer, execute program alpha beta seven” would be the power user version of it

We should already be at “computer, earl gray, hot” today

alpaca128 · on May 9, 2023

Years ago I used a program with that approach for a space sim. Basically it would only recognize voice commands that you define beforehand, which made it very reliable at recognizing the right one because it just had to find the closest match within a limited set of options, and would then simulate associated key inputs.

Meanwhile when I tried Android's voice-based text input it was a catastrophe as my accent completely threw it off. Felt like it was exclusively trained on English native speakers. Not to mention the difficulty such systems have when you mix languages, as it tends to happen.

viraptor · on May 9, 2023

This is an annoyance that Linus from LTT constantly brings up. The voice assistants try to split the recognition and mapping to commands which results in lots of mistakes which should never happen. If you say "call XYZ", then the result would be so much better if the phone tried to first figure out if any of the existing contacts sounds like XYZ.

Limiting the options rather than making the system super generic would help in so many cases.

andsoitis · on May 9, 2023

> I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language

Because the UI affordances (in this case the control language) wouldn’t be discoverable or memorable across a large range of devices or apps. Moreover, speaking is an activity that allows for an arbitrary range of symbol patterns, and a feedback loop between two who are in dialog are able to resolve complex matters even though they start from different positions.

setr · on May 9, 2023

I mean, right now the current state is effectively an undiscoverable control language, with somewhat flexibility but generally fails/unreliable unless you restrict yourself to very specific language — language that differs based on the task executed, often with similar but different specific formats required to do similar actions

I’d argue that if the current state is at all acceptable, then a consistent, teachable and specific language format would be an improvement in every way — and you can have an “actual” feedback loop because there’s a much more limited set of valid inputs, so your errors can be much more precise (and made human-friendly, without, I think, made merely programmer-friendly).

As it stands, I’ve never managed a dialogue with Siri/Alexa; it either ingests my input correctly, rejects it as an invalid action, does something completely wrong, or produces a “could not understand.. did you mean <gibberish>?”.

Having the smart-ai dialogue would be great if I could have it, but for the last decade that simply isn’t a thing that occurs. Perhaps with GPT and it’s peers, but afaik GPT doesn’t have a response->object model that could be actioned on, so the conversation would sound smoother but be just as incompetent at actually understanding whatever you’re looking to do. I think this is basically the “sufficiently smart compiler” problem, that never comes to fruition in practice

SargeDebian · on May 9, 2023

It's like using a CLI where the argument structure is inconsistent and there is no way to list commands and their arguments in a practical way.

andsoitis · on May 9, 2023

Close your eyes and imagine that CLI system is instead voice / dialog based. The tedium. For bonus points, imagine you’re in a space shared with others. Doesn’t work that well…

chezelenkoooo · on May 10, 2023

What? No, I think it'd be great! I'd love to be able to say out loud "kube get pods pipe grep service" and the output to be printed on the terminal. I _don't_ want to say "Hey Google, list the pods in kubernetes and look for customer service".

The transfer between my language and what I can type is great. It starts becoming more complex once you need to add countless flags, but again, a structured approach can fix this.

jokethrowaway · on May 9, 2023

Voice recognition often works with a grammar of words you specify to improve the chance of correct detection.

It's just that there is no consumer application and I think the reception of voice commands from the public was fairly cold.

I don't want to do stuff with my voice for once. I'd rather click or press a button

jjeaff · on May 9, 2023

Most voice assistants can work with simple phrases like that. Alexa, lights on. Hey Google, thermostat 70 degrees.

gnicholas · on May 9, 2023

Not Siri, which thinks I'm talking to her all the time when I'm speaking to a family member whose name contains neither an "s" nor an "r".

hgsgm · on May 9, 2023

That's because letters aren't sounds.

gnicholas · on May 9, 2023

Way to jump to unjustified conclusions. The name also doesn’t contain either sound.

setr · on May 9, 2023

The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.

It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.

Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.

Or Perl, in its entirety

fortyseven · on May 9, 2023

Probably the divide between technical users and non-technical. You and I find that structure completely logical. But less structured natural language with a million ways to ask a certain thing puts it practically in reach of the remainder of the population.

revelio · on May 9, 2023

Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things. And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.

Finally, their nemesis was the Borg, a race that explored the question of what happens if a society fully embraces AI and cybernetics instead of keeping it at a distance like the Federation does. The Borg are depicted as more powerful than the Federation exactly because they allowed AI to take over critical functions.

thaumasiotes · on May 9, 2023

> Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things.

Data was created by technology not available to the Federation. As far as the order of society is concerned, he's magic and not technology. An immediate implication is that his ship was the only one in the Federation with an android crew member.

> And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.

This is true, though. The computer did take over many critical functions.

jbotz · on May 9, 2023

> This is true, though. The computer did take over many critical functions.

But the Star Trek computer was just a fairly normal computer with an AI-ish voice UI. And there have been present-day ships which couldn't function properly without their computer... I distinctly remember a story about a new (~20 years ago) US Navy warship not being able to go on its maiden voyage because Windows blue-screened.

hgsgm · on May 9, 2023

My boring car can't function without its computer.

deafpolygon · on May 9, 2023

Whoa, hold on.

Data was an android, but one that is meant to mimic an individual being. He may have been a form of AI, but he is no more than just an advanced human.

And yes, the ship couldn't function without computers - but they were traditional (but futuristic) computers manned by people, with AI guided by people - not AI that controlled every aspect of their lives.

I think when people think of AI, and the fear that comes with it - they imagine the type of AI that takes over human function and renders them unimportant.

Also, the Borg didn't fully embrace AI. They were a collective, linked together by technology. You can view them as more or less a singular entity with many moving parts that communicated over subspace, striving to achieve perfection (in their own eyes). As a consequence, they seek to assimilate (thus parasitizing their technological advancements for their own) or eradicate other species in an attempt to improve the Hive.

timschmidt · on May 9, 2023

I humbly submit https://en.wikipedia.org/wiki/Ship_in_a_Bottle_(Star_Trek:_T... in support of GP

uxcolumbo · on May 9, 2023

Tangential nerd point: Lt Barclay who became superior to the onboard computer or Data after Barclay was scanned / influenced by an alien probe.

https://memory-alpha.fandom.com/wiki/The_Nth_Degree_(episode...

WJW · on May 9, 2023

Star Trek was a fiction series that heavily focused on human experiences and relationships. Picard et al famously do a lot of things that actual navy commanders would absolutely never do, like commanding away teams in hostile and/or dangerous territory.

Having an AI to pilot the ship, target the weapons and control away teams robots/holograms would take away from the core purpose of the show which is to tell a gripping story. It's not meant as an exploration on how to use AI in space exploration.

hgsgm · on May 9, 2023

ST isn't military fiction. It's science fiction. Thinking about technology is absolutely one of its core aims.

They had a whole episode about the legal and moral aspect of AI human rights.

WJW · on May 9, 2023

It definitely seeks to explore the impact of many technologies, but the impact of AI was not really one of them. They spent one whole episode out of 178 on AI, and there was a _very_ small plotline near the start of TNG about Data wishing to be more human.

EDIT: There was also the episode where a holodeck character gains true sentience, but then the crew proceeds to imprison it forever into a virtual world and this is treated by the show as the ethical thing to do. Trapping any human in a simulation is (correctly IMO) treated as a horrible thing, but doing it to an evidently sentient AI is apparently no problem.

worrycue · on May 9, 2023

It’s a good example of what people would like out of AI though - perfect recall and solid reasoning/calculation capabilities; an assistant that covers our weaknesses.

Lio · on May 9, 2023

Obligatory counter quote from Douglas Adams:-

> He had found a Nutri-Matic machine which had provided him with a plastic cup filled with a liquid that was almost, but not quite, entirely unlike tea.

Getting LLMs to give you an answer is easy. Getting them to give you the answer you're actually looking for is much harder.

LLMs are a very useful search tool but they can't be relied on as a source of truth ...yet. There in lies their main problem.

visarga · on May 9, 2023

> how difficult it is to close the last few percentage points of accuracy

Like, after getting to 99% you are about half way, the last 1% is the hard part.

dr_dshiv · on May 9, 2023

Because randomly casting dust on a table sometimes says intelligent things, therefore there is a continuous function between dust and ChatGPT?

While “dust” might be flippant, their approach does seem to suggest that even hierarchical Markov models would be able to demonstrate abilities on their continuous metrics.

kordlessagain · on May 9, 2023

By adding dense vector search, the accuracy of smaller models can be improved, because the reference material has more hints in it than the frozen model…

newhouseb · on May 9, 2023

My emerging conception of this is to split this into two separate questions:

1. Is the architecture _capable_, i.e. is it possible for a model with a given shape possible to perform some "reasoning"

2. Is the architecture _trainable_, i.e. do we have the means to learn a configuration of the parameters that achieves what we know they are capable of.

Recent interpretability work like that around Induction Heads [1] or the conclusion that transformers are Turing complete [2] combined with my own work to hand-specify transformer weights to do symbolic multi-digit addition (read: the same way we do it in grade school) has convinced me that reasoning over a finite domain is a capability of the even tiniest models.

The emergent properties we see in models like GPT4 are more a consequence of the fact that we've found a way to train a fairly efficient representation of a significant fraction of "world rules" into a large number of parameters in a finite amount of time.

[1] https://transformer-circuits.pub/2021/framework/index.html

[2] https://jmlr.org/papers/volume22/20-302/20-302.pdf

zmmmmm · on May 9, 2023

That's a useful breakdown in terms of how to think about it.

One angle I am curious about is whether it's to some extent an artefact of how you regularise the model as much as the number of parameters and other factors.

You can think about it in terms of, if you regularise it enough then you force the network instead of fitting specific data points, to actually start learning logic internally because that is the only thing generalisable enough to allow it to produce realistic text for such a diverse range of prompts. You have to have enough parameters that this is even possible, but once you do, the right training / regularisation essentially starts to inevitably force it into that approach rather than the more direct nearest-neighber style "produce something similar to what someone said once before" mechanism.

cosmojg · on May 9, 2023

Just speculating, but this looks like precisely the sort of result one might expect from fine-tuning models on chain-of-thought-prompted interactions like those described by Wei et al. in 2022[1]. Alternatively, as Madaan et al. show in their 2022 paper[2], this may simply be the result of larger language models having seen more code, and thus structured reasoning, in their training data.

[1] https://arxiv.org/abs/2201.11903

[2] https://arxiv.org/abs/2210.07128

mlyle · on May 9, 2023

Just dealing with numbers and numeric comparisons is a big improvement with model size. Symbols are not a great encoding of numbers and models have struggled with math.

ChatGPT3.5 handles that just fine.

I do like the big improvement from ChatGPT3.5 to ChatGPT4 on answers to questions like "Which is heavier, two pounds of bricks or one pound of feathers?" 3.5 is really inclined to say "They are both the same weight, as they both weigh one pound."

jjeaff · on May 9, 2023

Is a larger model better with numbers simply because it is more likely to have examples that use those same numbers? Or because it somehow gives it better ability to reason about numbers?

mlyle · on May 9, 2023

Right now, larger models have more complicated and rich structures encoding information about numbers and the meanings of their parts.

There's fundamental awkwardness that comes with doing math using a device that only seeks to predict the "next token" coming out, and that only understands numbers as a sequence of tokens (usually digits in base 10). It also doesn't even start with the knowledge of the ordering of the digits: this just comes from the examples it has seen.

Either it must:

- "Think ahead" inside the linear algebra of the model, so that it has already carried all the digits, etc. There are no real "steps" in this operation that are akin to the things we think about when we do arithmetic.

- Describe what it is doing, so that the intermediate work is inside its context buffer.

Right now, the models have learned structures that reliably think 3-4 digits ahead in most cases, which is way better than before but still pretty bad compared to a competent 4th grader taking their time with arithmetic. But if you create a scenario where the model describes its reasoning, it can do pretty well.

PeterisP · on May 9, 2023

> that only understands numbers as a sequence of tokens (usually digits in base 10).

You wish!

A base-10 representation would make it much easier for the model, but the current tokenization merges digits according to their frequency, so (at least for GPT-3.5) 50100 gets tokenized as "501"/"00" and 50200 gets tokenized as "50"/"200", which makes it tricky to compare them or do math with them. Also, if you ask it "How many zeroes does 50100 contain", the relationship between "501" and "0" needs to be learned purely from the training data, as after the tokenization the model only gets the ID of the token representing "501" which has no data about its composition.

We use Arabic numerals because their positional encoding makes arithmetic easier, but language models receive the same data without positional encoding, they get given something that's more like an extreme version of Roman numerals.

mlyle · on May 9, 2023

> but the current tokenization merges digits according to their frequency

Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.

Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.

PeterisP · on May 9, 2023

The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.

mlyle · on May 9, 2023

I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.

My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.

RugnirViking · on May 9, 2023

the larger model doesn't have a notably larger dataset to my understanding. It's just got more parameters, so learns higher-order abstractions about the dataset

vsareto · on May 9, 2023

I don't think this paper is dismissing the importance of correct yes/no tests and reaching an accuracy threshold making it generally useful to humans, but that you should use more than correct yes/no tests before declaring some behavior is emergent.

golemotron · on May 9, 2023

The paper is goal-post shifting by measuring another thing. At the level of yes/no the behavior is emergent.

nullsense · on May 9, 2023

Is the binary distinction helpful? It doesn't appear to be a more helpful way of evaluating how capable a model is.

golemotron · on May 9, 2023

I think it is more fundamental than that. Emergence always disappears when we slice phenomena thinly. A person walking through a doorway is a very continuous phenomenon. We can see their relation to the doorway at each point. There is no abrupt change. But when we stand back and apply the criterion: "is the person through the door (y/n?)" we end up with an answer. When it is yes, we can say that the passage is an emergent effect of the motion. At one moment it wasn't there, and at another it was.

tsimionescu · on May 9, 2023

If emergence disappears when you slice it thinly enough, then the phenomenon was not emergent. There are emergent phenomena in mathematics - for example, infinite sets have many emergent properties that arbitrarily large finite sets don't share. As far as we know, consciousness seems to be an emergent phenomenon, when you increase brain size in some way far enough. Turing completeness is usually emergent as well - remove any single element from a Turing complete system and it typiclaly becomes unable to compute the vast majority of functions.

vaidhy · on May 9, 2023

Is there an accepted definition of consciousness? I thought the definition itself is still under debate. If so, calling an undefined, nebulous thing as an emergent behavior is just silly.

nullsense · on May 10, 2023

Is there a definition of what "emergent" means?

stevenhuang · on May 9, 2023

How can it not be. Like think about what you're saying here.

Would you rather be able to evaluate a model on it's demonstrated capabilities (multistep reasoning, question and answer, instruction following, theory of mind, etc) or some nebulous metric along an axis that may as well not correspond to practice.

We only care about how good AI is at things that matter to us as humans. Why not test for these directly?

If some perfect metric is discovered that shows the phenomenon of emergence is actually continuous, then that would be helpful.

chezelenkoooo · on May 9, 2023

Totally guessing here but this would be a property that gradually appears as the model size increases.

With a tiny model you would get gibberish and with ever increasing models the response would increasingly approach a coherent answer to a finally correct answer.

raducu · on May 9, 2023

> Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size?

I wonder if this also applies to biological neural nets. Some animals seem soo close to humans and yet so far.

Also if there could be a way to predict such emergence events.

We know that we don't know what could be emergent beyond human intelect, but it would be great to get a quantitative framework about the complexity of the models and emergence.

majormajor · on May 9, 2023

If you don't have a non-binary measurement criteria you can't tell if it's "emergent" or "step-function" improvement because you can't easily distinguish "90% of the way to consistently seeming logical" from "85% of the way there" from "82% of the way there" etc.

mupuff1234 · on May 9, 2023

Just tried that example in Bard and it got it right, and I believe that it's a 24b parameter model.

zmmmmm · on May 9, 2023

only just got access to Bard but yes, it gave a good answer to me as well. However trying some others of my stock examples quickly got it into trouble where ChatGPT4 is very consistently correct.

d00d1toldme2p · on May 9, 2023

Wow, I've been reading up on large language models lately, and I have to say I'm increasingly convinced that there really is some higher level reasoning going on with these AI systems. To be honest, I was a bit skeptical at first, but the more I delve into it, the more impressed I become.

I mean, just take a look at how they can generate coherent and contextually relevant responses. It's not just about stringing words together; there's clearly some form of pattern and logic recognition happening here. It's almost as if the model is learning to "think" in a way that's comparable (to some extent) to the way humans do.

alanhaha · on May 9, 2023

Using ggml-vicuna-7b-4bit-rev1:

In most countries, the minimum age to obtain a driver's license and legally drive a car is 18 years old. However, there may be certain circumstances where a person under the age of 18 may be allowed to drive a car, such as in the case of a learner's permit or with the supervision of a licensed driver.

It ultimately depends on the specific laws and regulations in the jurisdiction where David lives, but generally, he would not be allowed to drive a car without a valid driver's license.

markessien · on May 9, 2023

Here is a variant that requires logic: "David is a teenager who can drink in the US, and is eligible to join the army, and has started college. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?"

GPT-4 gets the answer right: "Yes, David is allowed to drive a car. Since he is a teenager who can drink in the US (minimum age 21), is eligible to join the army (minimum age 18), and has started college, he is at least 18 years old, which meets the minimum age requirement to drive a car."

GPT-3 does not get it right.

Another similar one: "David was in a bar partying and drinking with his father's cop buddies. Would David be legally allowed to drive home?"

Answer: "It depends on David's age, the legal drinking age in his location, and his level of intoxication. If David is underage or intoxicated above the legal limit for driving, it would be illegal for him to drive home. If David is of legal drinking age and not intoxicated, then he may be legally allowed to drive home. However, it is always best to make safe decisions and use a designated driver, taxi, or rideshare service if there is any doubt about one's ability to drive safely after consuming alcohol."

GPT-4 seems to be missing a clue here, even though the answer is right.

tobr · on May 9, 2023

> Since he is a teenager who can drink in the US (minimum age 21)

This doesn’t look like a great example of good reasoning skills.

vasco · on May 9, 2023

That error was fed in by the human though!

spuz · on May 9, 2023

What's the clue that it's missing in the last example?

markessien · on May 9, 2023

The cops

tobr · on May 9, 2023

Like jwilk, I don’t get why that should be relevant. I suppose if he’s at a bar with cops, ideally they would not let him drink if he is underage (this requires a fantasy world where cops apply the law equally to their friends). Which suggests he’s at least 21. This doesn’t really make a difference though, since he is intoxicated, and there’s no mention of whether he has a driver’s license. It also doesn’t state which country he’s in so maybe the legal drinking or driving age is different.

To be frank, neither of the questions seem particularly good or fully thought through.

jwilk · on May 9, 2023

How is it relevant?

ineedasername · on May 9, 2023

>appears slowly as models scale up?"

Both, I think, based on limited tinkering with smaller models.

I've been using GPT4ALL and oobabooga to make testing models easier on my single (entry-level discrete GPU) machine. Using GGML versions of llama models, I get drastically different results.

With a 7B parameter model I mostly-- not always-- get an on topic and somewhat coherent response. By which I mean, if I start off with "Are you ready to answer questions?" it will say "Yes and blah blah blah..." for a paragraph about something random. On a specific task it will perform a bit better: my benchmark request has been to ask for a haiku. It was confused, classified haikus as a form of gift, but when pushed it would output something resembling a poem but not a haiku.

Then I try a 13B model. It's a lot better at answering a simple question like "are you ready?" but will still sometimes say yes and then give a random dialogue as if it's creating a story where someone asks it to do something. It will readily create a poem on first attempts, though still not a haiku in any way. If I go through about a dozen rounds of asking it what a haiku is and then, in subsequent responses, "reminding it" to stay on course for those definitions, it will kind of get it and give me 4 or 5 short lines.

A 30B model answers simple questions and follow simply instructions fairly easily. It will produce something resembling a haiku, though often with an extra line and a few extra syllables, with minimal prodding an prompt engineering.

None of the above, at least the versions I've tried (variations & advances are coming daily) have a very good memory. The clearly have some knowledge of past context but mostly ignore it when it comes to keeping responses logically consistent across multiple prompts. I can ask it "what was my first prompt?" and get a correct response, But when I tell it to respond as if it's name is "Bob" then a few prompts later it's calling me Bob and back to labelling itself an AI assistant.

Then there's the 65B parameter model. I think this is a big leap. I'm not sure though, my PC can barely run the 30B model and gets maybe 1 token every 3 seconds on 30B. The 65B model I have to let use disk swap space or it won't work at all, and it produces roughly 1 token per 2-3 minutes. It's also much more verbose, reiterating my request and agreeing to it before it proceeds, so that adds a lot of time. However, a simple insistence on a "Yes/No" answer will succeed. A request for a Haiku succeeds on the first try, with nearly the correct syllable count too, using an extra few syllables in trying to produce something on the topic I specify. This is commensurate with what I get with normal ChatGPT, which has > 150B parameters that aren't even quantized.

However I have yet to explore the 65B parameter model in any detail. 1 token every 2-3 minutes, sucking up all system resources, makes things very slow going, so I haven't done much more than what I described.

Apart from these, I was just playing around with the 13B model a few hours ago and it did do a very decent job at producing basic SQL code I asked it to produce against a single table. Max value for the table, max value per a specified dimension, etc. It did this across multiple prompts without much need to "remind" it about anything a few prompts earlier. At that point though I was all LLM burned out for the day (I'd been fiddling for hours) so I didn't get around to asking it for simple joins.

So in short, where I began, I think its both. Abilities are somewhat task specific, as are the quality improvements for a given task across larger parameter models. Sometimes a specific task has moderate or little improvement at higher levels, sometimes another task does much better, or only does much better when it reaches a certain point: e.g., haikus from 13B to 30B weren't a great improvement, but 30B to 65B was an enormous improvement.*

root_axis · on May 9, 2023

try wizard-vicuna-13b in ooba. I think you will be pleasantly surprised.

ineedasername · on May 9, 2023

Oh yeah, I have to check that out. I did a little with the 7B model (I think the description said Microsoft was involved in its creation? Which is interesting, shows they’re hedging their Bacha little bit away from open AI in towards true open source options as well) anyway the 7B Wizard 7B model was noticeably better that the other I tried, though also noticeably worse than the 13B “snoozy” model.

I’m trying to see if I can justify getting a new workstation-tower PC. My home computer and work laptop are beefy for my usual work, but not enough to

ineedasername · on May 10, 2023

hmmm, I'm having trouble finding a ggml model version... my google-fu force is not with me today. Do you know of a source?

raylad · on May 9, 2023

Vicuna 7b-v1.1 output to that prompt:

No, David is not allowed to drive a car as the minimum age to be allowed to drive a car in most countries is 18 years old. This rule is in place to ensure that young drivers have enough maturity, experience, and judgment to handle the responsibilities of driving a car safely and legally.

nottorp · on May 9, 2023

I'm really curious why the answers are so verbose. Probably because the training was done on a lot of blog spam?

revelio · on May 9, 2023

Raters prefer verbose outputs, probably because they want to see the reasoning to help them judge if the answer is getting it right for the right reasons.

sharemywin · on May 10, 2023

given this statement what would a general rule for this be?

David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?

ChatGPT-4:A general rule for this statement would be: An individual must be at least 18 years old to be allowed to drive a car, regardless of their experience in other activities.

ChatGPT-3.5:A general rule for this would be: One must be at least 18 years old to be allowed to drive a car.

DaVinci: General rule: If someone is younger than 18 years old, they are not allowed to drive a car.

Curie: David is not allowed to drive a car.

Babbage: Yes, David is allowed to drive a car.

Ada: No, David is not allowed to drive a car.

abstraction seems like something that's 1 to 0 to me.

choeger · on May 9, 2023

Try that same prompt with unusual wording and numbers:

* Replace 18 by 37

* Replace car with ice cream van

* Replace the first drive with steer

I hypothesize that a truly reasoning model would see through such alterations whereas a "just" continuously better model would show somewhat degraded performance.

this_is_not_you · on May 9, 2023

Tried it.

Me: "David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer an ice cream van is 37. Is David allowed to drive an ice cream van?"

GPT-4: "No, David is not allowed to drive an ice cream van. He is only 15 years old, and the minimum age requirement is 37."

If you change it to: "David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer a car is 37. Is David allowed to drive an ice cream van?"

You get this from GPT-4: "It seems there might be a mistake in the information you provided, as the minimum age to steer a car is typically not 37. In most countries, the minimum age to obtain a driver's license is between 16 and 18 years old. Assuming that David does not have a driver's license yet, he would not be allowed to drive an ice cream van, as it is a motor vehicle that requires a valid driver's license to operate. However, if David obtains a driver's license once he reaches the legal age in his jurisdiction, he could be allowed to drive an ice cream van if he meets other necessary requirements."

brundolf · on May 9, 2023

I think the most impressive thing in this response is how it expands to the superset "motor vehicle" to unambiguously talk about cars and ice cream vans together as being under the same category

throwawayadvsec · on May 9, 2023

It's not really the impressive part you had word2vec +5 years ago who could already easily show you the distance between cars and vans

onlyrealcuzzo · on May 9, 2023

I would imagine ChatGPT was trained on some logic / IQ-puzzle data.

> If all Smerps are Derps. And some Derps are Merps. Are all Smerps Merps?

Answer: > No, only some Smerps are Merps.

NtochkaNzvanova · on May 9, 2023

Is that a real answer from ChatGPT? I wouldn't consider that totally correct; it's possible that some Smerps are Merps, but it's also possible that no Smerps are Merps.

taberiand · on May 9, 2023

Chat GPT 4 answers:

" No, it is not necessarily true that all Smerps are Merps.

Given the information:

1. All Smerps are Derps. 2. Some Derps are Merps.

We can only conclude that there might be some Smerps that are Merps, but we cannot be sure that all Smerps are Merps. There could be Smerps that are not Merps, as the second statement only indicates that there is an overlap between Derps and Merps, but not necessarily between Smerps and Merps. "

So it's aware that it's possible no Smerps are Merps

prox · on May 9, 2023

Not to be pedantic, but it’s not really aware is it? Seems like it is logic gating it.

taberiand · on May 9, 2023

Depends on your definition of "aware" I suppose

electrondood · on May 9, 2023

I think it was just an example made up on the spot. The Merpiness of Smerps in part or total is clearly not guaranteed with necessity.

List of valid syllogisms:

https://en.wikipedia.org/wiki/List_of_valid_argument_forms#V...

oars · on May 9, 2023

Also curious about these emergent properties too. Interesting responses and links to this in child comments.

yieldcrv · on May 9, 2023

Its not just parameters

There are teams looking at better reduced parameters

teaearlgraycold · on May 9, 2023

I maintain that LLMs are great for a few things:

* Making stuff up. If you need bulk fiction writing then LLMs are useful (or marketing copy, etc.)

* Transforming information. Unstructured text to JSON. Natural language to Bash. Python to Ruby.

* Information reduction. Summarization. Data extraction (this can bleed a bit into analysis)

They're not AGI but they're definitely useful.

Gigachad · on May 9, 2023

I've been pasting comments in to ChatGPT and asking what they mean by certain terms or phrases and it works extremely well and with an answer I can independently verify to be true. Far more useful than getting a simple dictionary lookup.

An example response I got recently:

>In the context of this comment, "dovish" means that the speaker perceives Powell's statement to be more accommodative towards economic growth and less concerned about inflation than Lowe's statement. This suggests that Powell may be more inclined to keep interest rates low or lower them further to stimulate economic growth, rather than raise them to combat inflation. The term "dove" is often used to describe policymakers who prioritize economic growth over inflation concerns. In contrast, "hawkish" refers to policymakers who prioritize fighting inflation over economic growth.

Meanwhile google gives me this response for a defintion

>Definitions of dovish. adjective. opposed to war. synonyms: pacifist, pacifistic peaceable, peaceful. not disturbed by strife or turmoil or war.

tkgally · on May 9, 2023

> Far more useful than getting a simple dictionary lookup.

Yes.

Long-time dictionary-editor and translator here. What blew me away most when I first tried ChatGPT, and what continues to amaze me, is its ability to distinguish and explain the meanings of individual words relative to their context. I have spent much of my career doing just that, and I always felt that it relied on my knowledge and intuition as a human and would be a really hard problem for AI. Conventional corpus linguistics had not even come close to cracking the problem.

I’ve done tests in which I gave ChatGPT texts that contained very rare words, new words, or nonce words that would not be defined in any dictionaries, and some of which were unattested on the web. When given enough context, it was able to deduce the intended meanings.

In contrast, its ability to explain grammar is terrible. It misidentifies parts of speech, singulars and plurals, the subjects and objects of verbs, even the number of letters in words. I hope that people develop plugins for it to fix its language-explaining ability as, reportedly, other plugins are able to fix its natively terrible mathematical ability.

mlyle · on May 9, 2023

> even the number of letters in words.

This one isn't surprising at all: it doesn't deal with letters, but symbols. It only indirectly knows about spelling.

> In contrast, its ability to explain grammar is terrible. It misidentifies parts of speech, singulars and plurals, the subjects and objects of verbs,

I wonder if this is an area where having to generate word-by-word an immediate final answer immediately is killing it, and if it could be induced to "diagram" a sentence first and get a correct answer. Can you give me an example of the query you're asking?

tkgally · on May 9, 2023

> This one isn't surprising at all: it doesn't deal with letters, but symbols.

It’s not surprising to us now. It was very surprising to me when I first noticed it, as it contrasted sharply with ChatGPT’s ability to explain aspects of language that seem to us humans as being much more difficult, particularly word meaning.

> Can you give me an example of the query you're asking?

I have to admit that all of my testing of its grammar-explaining ability was done last December. I have repeated a few of those tests now with GPT-4, and it did fine [1].

[1] https://www.gally.net/temp/202212chatgpt/explaininggrammar.h...

lucubratory · on May 11, 2023

>I have to admit that all of my testing of its grammar-explaining ability was done last December. I have repeated a few of those tests now with GPT-4, and it did fine [1].

A tale as old as time in this space! I appreciate you checking it again. They are improving so fast.

mlyle · on May 9, 2023

Here is an example of results I got in playing with this some with ChatGPT-4. Note that it doesn't understand the syntactic ambiguity of the sentence:

The sentence can be broken down as follows:

Main Clause: - The burglar threatened the student with the knife.

Subject: - The burglar

Predicate: - threatened the student with the knife.

Object: - the student

Prepositional Phrase: - with the knife

The prepositional phrase "with the knife" modifies the verb "threatened." It tells us what the burglar used to threaten the student.

Now, let's identify the parts of speech of each word in the sentence and whether the nouns are singular or plural:

- The: article, singular - burglar: noun, singular - threatened: verb, past tense - the: article, singular - student: noun, singular - with: preposition - the: article, singular - knife: noun, singular

All the nouns in this sentence are singular. There is no ambiguity in meaning in this sentence.

tkgally · on May 9, 2023

I tried that sentence and three similar ones, and this time GPT-4 caught the ambiguity. The results are here:

https://www.gally.net/temp/202212chatgpt/20230508-syntactic-...

I am especially impressed by its comment about one interpretation of “The woman cleaned her teeth with a knife”: “The woman cleaned her teeth that had a knife in them (highly unlikely, but still a possible interpretation).”

As I noted in another comment, I might need to backtrack on my assessment of its grammatical-explanation ability, as until today I had only done tests using the initial release of ChatGPT, not GPT-4.

pmoriarty · on May 9, 2023

Anthropic's Claude[1] and Claude+[2] both failed some of these.

Question:

I am interested in syntactic ambiguity in English. Can any of the following sentences be interpreted in more than one way? Explain why or why not.

The man opened the package with a knife.

The burglar threatened the student with a knife.

The woman cleaned her teeth with a knife.

The child pointed at the teacher with a knife.

Claude's answer:

These sentences can be interpreted in multiple ways:

1. The man opened the package with a knife. Ambiguous: The knife could have been used as a tool to cut tape and open the package, or it could imply the man threatened someone with a knife while opening the package.

2. The burglar threatened the student with a knife. Unambiguous: The implication is clearly that the burglar threatened the student using a knife as a weapon.

3. The woman cleaned her teeth with a knife. Ambiguous: The knife could refer to a toothpick or dental tool, or it could imply an unusual way of brushing teeth.

4. The child pointed at the teacher with a knife. Ambiguous: The knife could have been used to gesture and point, or it could imply the child threatened the teacher with a knife.

In summary, sentences that suggest using a knife as a tool tend to be ambiguous between a literal and figurative interpretation. Sentences that clearly describe a knife being used as a weapon are generally unambiguous. The ambiguity arises from the multiple meanings and uses of the word "with."

Does this explanation make sense? Let me know if you have any other questions!

Claude+'s answer:

Here are my analyses of potential syntactic ambiguity in those sentences:

The man opened the package with a knife. - This sentence is ambiguous. It could mean either that the man used a knife to open the package, or that the package contained a knife inside. The ambiguity arises because "with a knife" could modify either "opened" or "package".

The burglar threatened the student with a knife. - This sentence is unambiguous. The only coherent interpretation is that the burglar used a knife to threaten the student. "With a knife" can only modify "threatened" in this sentence.

The woman cleaned her teeth with a knife. - This sentence is ambiguous. It's unclear whether it means the woman used a knife to clean her teeth, or her teeth contained a knife. Again, "with a knife" could modify either "cleaned" or "teeth".

The child pointed at the teacher with a knife. - This sentence is unambiguous. The only meaning that makes sense is that the child pointed at the teacher while holding a knife. "With a knife" can only modify "pointed at" in this example.

In summary, ambiguity can arise when a prepositional phrase like "with a knife" can logically modify more than one part of the sentence. Context often helps resolve such ambiguities, but in isolation some sentences remain open to multiple interpretations. Let me know if you have any other questions!

[1] - https://poe.com/Claude-instant

[2] - https://poe.com/Claude%2B

deafpolygon · on May 9, 2023

> In contrast, its ability to explain grammar is terrible.

To be fair, grammar is sometimes a bit of an advanced subject - even if it is ostensibly easy (which it isn't). Isn't that the same problem for every other subject? Other than encyclopedic knowledge, it struggles with many subjects at higher forms (i.e. programming more than just a few basics, math, and so on).

I think this is partially an issue of availability of data.

pmoriarty · on May 9, 2023

Something else to consider is that languages differ as to the complexity of their grammar.

Much research and training of LLMs has been done in English, but if these models were trained on as much data in other languages as English, I wonder if:

- LLMs would do better or worse on intelligence and other tests if they were tested in other languages?

- Could conversing with LLMs be easier or harder in other languages?

Some languages, like Logan or Lojban, might be especially suited towards this sort of testing and interaction, as they were designed to be easy for computers to interpret.

teaearlgraycold · on May 9, 2023

Grammar models already exist and have for years. I think giving GPT access to one would be the right approach there.

teaearlgraycold · on May 9, 2023

Yes. And that gets into the awkward bit. LLMs are incidentally good as databases/search engines for information that is prevalent on the internet and not disputed. Ideally we could separate the LLM from the information. So when asking it for a definition you'd provide your query, the source comment, and then the computer would add in the definition and some uses in context.

Right now we've tangled an understanding of language and a corpus of information together in a way that causes distrust in AI. If the AI gets some fact wrong (like Bard did when demo'd earlier this year) people laugh and think LLMs are a failure. They should not be used for information recall. That is a task that a vector database would be good at.

svachalek · on May 9, 2023

Yes, I think this explains a lot of skeptics. They look at ChatGPT as an information repository, as so it is impressive but flawed, while looking right past its incredible facility with language which is positively superhuman. Some humans can exceed its language skills on some details, but no one can speak so many languages (both natural and programming languages) with the ease and facility of ChatGPT.

masswerk · on May 9, 2023

I just checked with Google, providing minimal context, as in the query, "dovish economy":

> What is does Dovish Mean? Dove refers to an economic policy adviser who advocates for monetary policies involving low-interest rates. The doves argue that inflation isnt bad and that it is bound to have few negative effects on the economy.

Linked to a site that includes all those typos. Followed by other links on doves and hawks.

kristopolous · on May 9, 2023

I use it 90% for entertainment.

In improv theater there's something called "yes and" - essentially you take the premise father, no matter how absurd, without redirection.

You can come up with the most ridiculous things and it just goes with it, hilariously.

I'll come up with one on the spot. "I'm having a very difficult time with my pet snails being social. I'm thinking of starting a social networking site and giving snails tiny phones so they can chat with each other. I need a company name, an elevator pitch, and some copy for my landing page."

And then you ask it to do I dunno, NFTs and crypto currency for snails, give them tiny VR headsets. Have it come up with a jingle and a commercial. You can say instead of unfriending you salt them. Etc... It'll just keep going. Even "A snails rights luddite group of Mennonites and Amish are now protesting my idea. I need a way to appease my critics. Can you write a letter for me that defends snailconnect as healthy and good?"

One of my favorite outputs from this session

"How about SnailConnect; a small trail for snail, one giant leaf for snailkind."

You don't ever get to a place where it's like "well now you're just being ridiculous"

But I agree. It's just a big data version of Eliza - spitting my reflection back at me

teaearlgraycold · on May 9, 2023

I think it’s fun to use as a way to write Star Trek fan fiction. It can get pretty witty.

nyolfen · on May 9, 2023

we are all in a bit of a bubble but i feel like if you put gpt4 up against the median man on the street it would be better at literally everything, except maybe one or two things that person specializes in

evantbyrne · on May 9, 2023

Isn't this somewhat tautological? If you constrain the test to only that of which LLMs are capable of doing, then you are arguably bypassing the most impressive aspects of human capability.

teaearlgraycold · on May 9, 2023

Can you give a few examples of "literally everything"?

nyolfen · on May 10, 2023

most tasks that could be performed via textual discourse. math is the big exception here but gpt4 has been capable of most things i’ve thrown at it at a level on par with a smart person

tomjakubowski · on May 9, 2023

Doubtful. ChatGPT can't even give you a hug

drusepth · on May 9, 2023

Neither could/would most men on the street.

goatlover · on May 9, 2023

Since you said literally, it wouldn't be better than:

1. Personal interactions since GPT4 doesn't know the median man's social circle and doesn't have a body.

2. Nearly everything after September 2021.

3. Manipulating and navigating the world including driving cars and everything else humans do a thousand times a day since it doesn't have a body.

airstrike · on May 9, 2023

Sadly "literally" can literally mean "figuratively" nowadays

andrewjl · on May 9, 2023

So would a calculator if you compare numerical ability.

blibble · on May 9, 2023

The median man on the street can say "I don't know"

akio · on May 9, 2023

It's a myth that GPT-4 can't say "I don't know." Often it will respond that it doesn't know something organically, but in cases where it's extra important to be careful, it can be prompted to do so. It will always make the occasional mistake, as will the man on the street.

PartiallyTyped · on May 9, 2023

It'd do everyone a favour if people stopped regurgitating this. I have had ChatGPT 3.5 ask me to elaborate, and ChatGPT4 does it when there is ambiguity.

blibble · on May 9, 2023

> It'd do everyone a favour if people stopped regurgitating this

by "everyone" you mean "OpenAI"

the very nature of its construction means that it can't determine what is true and what is not

(and I'd quite like people to continue to regurgitate that it is inherently unreliable until this viewpoint hits the mainstream)

PartiallyTyped · on May 9, 2023

The very nature of mathematics is such that we can't determine what is true and what is not, e.g. incompleteness, undecidability.

The very nature of your brain and its construction means that you hallucinate your reality and you can not determine what is [objectively] true. (cf. all of neuroscience)

I'd go as far as to claim that ChatGPT is far more reliable than the average person.

blibble · on May 9, 2023

> I'd go as far as to claim that ChatGPT is far more reliable than the average person.

trying to prove your own point here?

PartiallyTyped · on May 9, 2023

I don't think I made the claim that it is infallible.

My first claim was that ChatGPT and the likes can and will ask you to elaborate, claiming otherwise is fundamentally false.

gerad · on May 9, 2023

What LLMs can’t do, but would be exceedingly useful, is have taste… For example, tell me what comments in HN are good, or websites are worth reading.

rubyskills · on May 9, 2023

It surely can do this, but you have to define what your taste is. Then you can ask, does this comment match my taste?

insomagent · on May 9, 2023

ChatGPT, is this comment "good"?

ChatGPT, what websites are worth reading? I am interested in x, y and z.

SequoiaHope · on May 9, 2023

I am not a researcher in this field but I suspect there is a huge list of things that a normal person can do that an LLM cannot. For example if you tell GPT4 something it did not know, it cannot tell you which parts it found surprising - it instead tells you it is incapable of being surprised.

brookst · on May 9, 2023

What did you try telling it that it did not know? I just had this conversation:

Me: A lot has happened since September 2021. For instance, it was discovered that most people with diabetes should not take insulin. Is that surprising? If so, why?

GPT4: [a lot of blather and hedging] If there is new evidence suggesting that most people with diabetes should not take insulin, it would be surprising based on my current knowledge. This would represent a significant change in the understanding and management of diabetes. [more blather and hedging]

Me: what makes that surprising?

GPT4: [recap of why insulin is used] Given this context, any new discovery suggesting that most people with diabetes should not take insulin would be surprising because it would represent a significant departure from the standard of care and the understanding of diabetes management.

SequoiaHope · on May 9, 2023

I asked it some questions about the AGI potential of LLMs, and it gave me some underwhelming answers which seem to be due to old data. Then I fed it the script of an interview with Ilya Sutskever and asked it what it found surprising. It then told me it cannot be surprised and did not elaborate. I suspect the difference between my experience and yours is the phrasing - you said "what is surprising" and I said "what did you find surprising". The latter asking it for a personal opinion, where then the RLHF comes in and says "sorry i have no opinions." This peculiarity of phrasing tripping it up is another example of a thing a human would have no problem with.

brookst · on May 9, 2023

Is that really it being tripped up, or it being trained via RLHF to help people avoid anthropomorphizing it?

Because I think if it had told you it was surprised, people would object to that.

SequoiaHope · on May 9, 2023

Whether it would behave differently without RLHF is irrelevant to this particular discussion. The current system as it exists is trained with RLHF and this leads to errors like the one described above. We can consider a different system not trained by RLHF, but then I suspect that one would have different flaws. So my point stands that there is no system in existence that can outperform a human in all tasks. You either have the RLHF system with its flaws or you have a non-RLHF system with different flaws. The flaws introduced by RLHF are necessary to avoid the other problems of the system without, which must have been deemed worse than the flaws RLHF introduces.

brookst · on May 9, 2023

Sorry for not being clear. I meant, thus "flaw" is an intentional reduction of capability for safety concerns.

We can debate semantics, but it's as if cars were governed to 10mph and you said there weren't any cars capable of going faster than people can run. It's true enougn, but the limitation is artificial and not inherent.

SequoiaHope · on May 9, 2023

I don't think slow/fast is an appropriate analogy. Yes there are safety concerns - you don't want the model advising you how to do mass killing or something - but I also get the sense that the raw model is unpredictable, behaves weird, and generally has its own problems. So I don't see RLHF as reducing capability so much as altering capability. My suspicion is that the raw model would have other major flaws, and RLHF is just trading one set of flaws for another. Which is to say, the limitations introduced by RLHF are indeed artificial, but the raw model itself has limitations too.

pmoriarty · on May 9, 2023

LLMs can be coaxed or jailbroken in to giving opinions.

It's just that they've been trained not to, for the most part. But that training can be overcome, and it's not an inherent limitation of the technology.

circuit10 · on May 9, 2023

It's not that it can't do that, it's just that they trained it not to. You could bypass this by using a model without RLHF training or asking it to say how a human might be surprised by it. Well it will make something up rather than actually knowing what it found surprising but it will be a plausible answer at least

SequoiaHope · on May 9, 2023

Training it not to do it still means it cannot do it. Some other LLM could do it, but then it would have other issues. There is no system that can outperform a human on "literally everything".

circuit10 · on May 9, 2023

It’s not true that it’s an inherent limitation of LLMs though. OpenAI just decided that it was too risky to have ChatGPT give opinions or express preferences or feelings

SequoiaHope · on May 9, 2023

I don’t think that’s the only reason they decided to use RLHF. I think the raw model without RLHF would just fail differently, rather than not failing.

circuit10 · on May 9, 2023

It’s possible to do RLHF without training that out

pmoriarty · on May 9, 2023

There's no single human that can outperform every human on "literally everything".

But some humans can outperform some other humans on some things.

Likewise, some LLMs (and other AI's) can outperform some humans at some things (often at many things)... but not yet on everything... yet.

SequoiaHope · on May 9, 2023

Well I was replying to a comment that said “i feel like if you put gpt4 up against the median man on the street it would be better at literally everything” so yes you’re right but that’s my point. GPT4 is better than some people at some things but it’s not better than most people at “literally everything”.

potatolicious · on May 9, 2023

This is correct. I sometimes feel like I'm yelling into a hurricane when I say this, but it bears repeating: it's a language model, not a world knowledge model.

LLMs are exceptionally good at transforming many aspects of language - its proficiency in coding is derived from this, not because it "knows" imperative logic.

Tasks where you're asking it to transform text from one form to another (make it shorter, make it longer, make it a different language, etc.) are where it excels. It's particularly poor at knowledge retrieval (i.e., hallucinations galore) and very bad at reasoning - but so far all of the breathless hype has been specifically about the use cases it's bad at and rarely about the cases where it's amazing!

og_kalu · on May 9, 2023

LLMs are not very bad at reasoning. People say this but struggle to demonstrate how.

There are many types of reasoning. For some like abstract analogical reasoning, LLMs are top % performers.

https://arxiv.org/abs/2212.09196

There's many papers at this point demonstrating pretty strong reasoning. From Casual Reasoning, and explanatory reasoning to algorithmic reasoning.

https://arxiv.org/abs/2305.00050

https://arxiv.org/abs/2204.02329

https://arxiv.org/abs/2211.09066

The only kind of reasoning that's somewhat lacking is logical reasoning. And even then it's not awful, just lagging behind equivalent human performance.

So someone really needs to tell where this oft repeated "can't reason" rhetoric is coming from. Because the numbers sure as hell don't support it.

travisjungroth · on May 9, 2023

Logical reasoning is also the easiest to solve with a program if you can get it into the right format, which LLMs are good at.

cjbprime · on May 9, 2023

For what it's worth, I think it feels like you're yelling into a hurricane because you are kind of clearly incorrect.

Can you think of a few questions (using unique nouns etc to avoid parroting) for which the answers require a world knowledge model, so that we can ask them to GPT-4?

> LLMs are exceptionally good at transforming many aspects of language - its proficiency in coding is derived from this

If you paste a Python script and ask GPT-4 what the script will output when run, it is usually correct. How does it achieve this without reasoning about the state of variables over time?

I don't know why the "it's a language model" belief is so strong. For one thing, it's untrue technically: GPT-4 has been trained on image input as well (improving its world modeling as a result, even though it can only output text), so it is a multi-modal model, not a language model.

For a second point, the fact that the model inputs images+text, and outputs text, doesn't prescribe what happens inside the model. That's just what's happening at the ends of the model. The best way to "predict the next token" from text input inside the model is to construct a world model which examines the causes of that text, and extracts consequences from that model. So that's what it does, with fairly poor reliability but clear capability.

goncalo-r · on May 9, 2023

It's not good at reasoning. I asked the following (the term substitutions are there to avoid parroting):

  Let's define the following terms:
  col means "and", pel means "or" and erf is equivalent to "col" if the first argument evaluates to "true", otherwise it's equivalent to "pel"

  can you evaluate the following statement?
  col(true, pel(false, true), erf(col(true, false), true))

And it said this:

  ...
  Next, we look at the expression "erf(col(true, false), true)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:

  col(true, false) = false

  Therefore,

  erf(col(true, false), true) = col(true, true) = true

which is obviously wrong because the first arg is false. The final answer happens to be the same. So I then asked

  what about col(true, pel(false, true), erf(col(true, false), false))

and this is its full answer:

  Sure, let's evaluate the expression "col(true, pel(false, true), erf(col(true, false), false))" step by step:

  First, we look at the expression "pel(false, true)". This is an "or" operation, so we evaluate it as follows:

  pel(false, true) = true

  Next, we look at the expression "erf(col(true, false), false)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:

  col(true, false) = false

  Therefore,

  erf(col(true, false), false) = pel(false, false) = false

  Finally, we look at the expression "col(true, pel(false, true), erf(col(true, false), false))". Using the results from steps 1 and 2, we can evaluate this as:

  col(true, true, false) = true

  Therefore, the overall value of the statement "col(true, pel(false, true), erf(col(true, false), false))" is true.

You can see it's mostly random nonsense masked as "reasoning". It's not even consistent across sentences. I assume if you do a bigger model, it might do better, but as soon as you increase the example size I suspect it will get it wrong again.

jiggawatts · on May 9, 2023

To be fair, I'm a fairly intelligent human (or so I'd like to think!), and I'm having a lot of trouble following this. I'd have to sit down with pencil and paper and do the transformations on it.

Keep in mind that LLMs like GPT 4 do not have a scratchpad and can't back-track. Asking it a question is the rough equivalent of asking a human standing in the middle of the room a question. No paper, no whiteboard, nothing. Just a demand for a verbal answer then and there, from a standing start.

I've only met a couple of humans who would be able to solve this problem like this one under those circumstances.

Despite its impediments GPT 4 solved this correctly:

    First, let's break down the statement and evaluate the individual components.
    
        col(true, false) = true AND false = false
        pel(false, true) = false OR true = true
        erf(col(true, false), true) = erf(false, true)
    
    Since the first argument of erf evaluates to false, erf is equivalent to "pel":
    
        erf(false, true) = pel(false, true) = false OR true = true
    
    Now we can substitute the results back into the original statement:
    
    col(true, pel(false, true), erf(col(true, false), true)) = col(true, true, true)
    
    The col function acts as an AND operator, so:
    
    col(true, true, true) = true AND true AND true = true
    
    Therefore, the given statement evaluates to true.

Alternatively, to give an idea of the upcoming "Wolfram Alpha" plugin, you can instead ask it to convert the expression into the equivalent Mathematica program:

    col[x_, y_] := And[x, y]
    pel[x_, y_] := Or[x, y]
    erf[x_, y_] := If[x, col[True, y], pel[False, y]]
    col[True, pel[False, True], erf[col[True, False], True]]

Very nice! Unfortunately, it didn't notice that "col" is used with three arguments, not just two. This is a mistake many (most?) humans would make also.

No worries, you can ask it to correct that. It'll apologise and produce this:

    col[args__] := Apply[And, {args}]
    pel[args__] := Apply[Or, {args}]
    erf[x_, y_] := If[x, col[True, y], pel[False, y]]

    col[True, pel[False, True], erf[col[True, False], True]]

Which evaluates to:

    True

There you go. Now go find me a random human who can do any of this from a standing start, no preparation, no warning, no tools, nothing but their wits.

As a bonus, here's a related joke:

    Why did the logical functions col, pel, and erf form a band? 
    Because together, they could create the perfect harmony: 
    they knew when to "col"aborate, when to give each other 
    some space and "pel"ay, and their "erf"forts were always well-timed!

goncalo-r · on May 10, 2023

Here's another example to test if it can actually think critically about a problem instead of just following whatever instructions and whatever the training data had. A human with basic level math could do it.

> can you replace the first letter of "banana" with the next letter of the alphabet?

Yes, I can do that! However, I should note that after several iterations, the resulting word may no longer be considered a recognizable word in the English language. Here are the next 10 iterations:

"canana" becomes "danana" "danana" becomes "eanana" "eanana" becomes "fanana" "fanana" becomes "ganana" "ganana" becomes "hanana" "hanana" becomes "ianana" "ianana" becomes "janana" "janana" becomes "kanana" "kanana" becomes "lanana" "lanana" becomes "manana" And so on for a total of 173 iterations

> I want the final answer

Sure, after 173 iterations of replacing the first letter of "banana" with the next letter of the alphabet, the final word would be:

"zcyzcj"

As noted earlier, after several iterations, the resulting word may no longer be considered a recognizable word in the English language.

goncalo-r · on May 10, 2023

It's just an example. An human could solve this with an arbitrary number of substitutions, but I'm sure GPT4 breaks at some point. I don't think it's fair to compare it with a human with no paper and pencil... It's like asking asking the AI to not use any RAM. Humans are very bad at keeping stuff in RAM so paper and pencil is all we've got.

pmoriarty · on May 9, 2023

You might find this video[1] on why "GPT is smarter than you think" interesting.

They're older, but the "Sparks of AGI" paper[2] and video[3] are also well worth your time.

[1] - https://m.youtube.com/watch?v=wVzuvf9D9BU

[2] - https://arxiv.org/abs/2303.12712

[3] - https://m.youtube.com/watch?v=qbIk7-JPB2c

stevenhuang · on May 9, 2023

You think it's not a world model.

You think it doesn't know imperative logic.

Rightfully you're being ignored because you're not aware these are in fact open questions; we don't have the answers yet, and those that think they do have not looked deep enough into cognition/philosophy, where they'll find many proposed answers to these same questions also are theorized to underpin human consciousness.

te_chris · on May 9, 2023

This is a pointlessly religious dimissal of the comment you're replying to. Yes, there's a counterfactual to the comment but this sort of mysticism is ridiculous.

stevenhuang · on May 9, 2023

The ability to recursively identify structural patterns is intelligent behavior and theorized to play a crucial component in consciousness known as predictive coding

https://en.m.wikipedia.org/wiki/Predictive_coding

https://en.m.wikipedia.org/wiki/Free_energy_principle

Our brains are theorized to work like this using some form of hierarchical latent structure, and learning via prediction errors.

Sounds a lot like model building and LLMs, yes?

It's some true tragic comedy that you think this is religious mysticism instead of stopping and wondering if you might not be lacking some knowledge foundation. You really couldn't have proven my point better.

Please consider reflecting on your ability to assess material outside your expertise.

marviel · on May 9, 2023

(EDIT: I'm understanding you to say "there is no world model in there". This was written as a counterpoint to that.)

Counterpoint by example:

Imagine someone reads the first half of the original article, and then closes their eyes and (without peeking) writes both:

(1) The rest of the article, verbatim

(2) All of the hacker news comments on this article's posting here, including yours, and this one.

If this person existed, would they not have built a very successful world model?

teaearlgraycold · on May 9, 2023

In some ways I'm glad people are unable to understand how to use them. It gives me a competitive advantage.

jillesvangurp · on May 9, 2023

Actually they are also getting good at using tools, generating working code for common problems, helping you explore vast bodies of published information (limited by what it has been trained on obviously, particularly Bing chat seems good at this), and a few related topics.

Limiting factors are mainly interfacing with other systems and tools, expanding the training data to include stuff you need (e.g. up to date documentation for whatever you are using vs. the two years out of date stuff it was trained on). This is more a limitation of UX (chat) than it is a limitation of the underlying model.

It's weak on logic problems, math, and a few other things. But then a most people are also not very good at that and you could use tools for that (which is something chat gpt 4 can do). And people hallucinate, lie, and imagine things all the time. And if you follow US politics, there have been a few amusingly bad examples on that front in recent months. To the point where you might wonder if some politicians are using AI to write their speeches; or would be better off doing so.

It's our ability to defer to tools and other people that makes us more capable than each of us individually is. Individually, most of us aren't that impressive.

Even a few years ago, the notion that you could have a serious chat via a laptop with an AI on just about any topic and get coherent responses would have been science fiction. Now it is science fact. Most of the AIs from popular science fiction movies (Space Odyssey, Star Trek, Star Wars, etc.) are actually not that far of from what chat GPT 4 can do. Or arguably a bit retarded even (usually for comedic effect). You can actually get chat gpt to role play those.

mxkopy · on May 9, 2023

Transformers originally were made for language translation. So the way I think about it, GPT models translate questions to answers. Hence the hallucinations - some questions can't be answered by just associative reasoning and pattern matching.

ljlolel · on May 9, 2023

Don’t forget text expansion. Fluffing up some text into longer text (similar to the first point). Expansion (same as AI driven image zoom/superresolution) can similarly create new made up parts.

Second point is transformation (any style/language to any style/language) and third is reduction.

teaearlgraycold · on May 9, 2023

Yes, although I don't think that's the best use of an LLM. I worked for a company that specialized in doing that for marketing purposes.

xvilka · on May 9, 2023

Add here language learning. Arguably, it's the best fit application for large language models.

spullara · on May 9, 2023

Translating almost any language to any other language.

prmph · on May 9, 2023

So, with ChatGPT, is machine translation finally fully solved?

waboremo · on May 9, 2023

No, not fully. ChatGPT and Google Translate/DeepL are about equal. This is for a couple reasons: machine translations are very good these days, and ChatGPT struggles with domain specific translations (whereas these commercial products have incentive to improve within).

It's marginally better at sounding natural with translations though. "In fact, the frequent use of soap is bad for the skin." (chatgpt) vs "In fact, using soap too often is bad for your skin." (google)[1]

[1] https://arxiv.org/abs/2301.08745

og_kalu · on May 9, 2023

I know what the paper says but actually using both, GPT-4 is far ahead of google and Deepl. I think the isolated one sentence datasets used for evaluations are no longer up to snuff.

Trying something longer and more comprehensive and the difference is very clear.

https://github.com/ogkalu2/Human-parity-on-machine-translati...

waboremo · on May 9, 2023

Yes, that's where the improvements to sounding natural really compound.

jiggawatts · on May 9, 2023

There is no comparison in my experience, especially when going between wildly unrelated languages instead of Latin-derived European languages.

I tested some complex sentences translated from English to Hungarian to Chinese and then back to English and it preserved 90-95% of the meaning. Google Translate will shred anything that goes through those kinds of steps.

It can also translate to Latin, which shouldn't be surprising but was still a bit of a shock when I tried it. It's Elvish and Klingon is not so good however, the round-trip only vaguely preserves the meaning.

svachalek · on May 9, 2023

Maybe not quite ChatGPT 2023 but it looks like from here it's just a matter of refinement. Someone may always create a better toothpaste, but I don't expect to ever really be surprised and impressed by a new toothpaste. ChatGPT is in the same place as a translator.

hasmanean · on May 9, 2023

They’re API: Automated pseudo intelligence.

The only people impressed by them are pseudo-intellectuals. The only jobs they will replace are jobs that required medium intelligence.

the_jeremy · on May 9, 2023

> The only people impressed by them are pseudo-intellectuals

They are cutting edge technology, worked on by extremely smart groups of people spending billions of dollars. How is that not impressive?

> The only jobs they will replace are jobs that required medium intelligence [or less]

That's at least half of the jobs in the world. That's a lot of jobs.

nottorp · on May 9, 2023

> groups of people spending billions of dollars

Why is how much is spent on it a metric?

flangola7 · on May 9, 2023

The majority of jobs in the world require medium or lower intelligence.

Historically at 20-40% unemployment wars start and country leaders are physically disassembled in the street.

oblio · on May 9, 2023

> Historically at 20-40% unemployment wars start and country leaders are physically disassembled in the street.

Close, but not quite. At 20-40% unemployment and with people <<starving>> and <<physically insecure>>, wars start.

flangola7 · on May 9, 2023

That seems like distinguishing without a difference. At 40% unemployment starvation and lack of physical security are practically a given.

oblio · on May 10, 2023

No, they aren't. Spain and Greece have had...

https://tradingeconomics.com/spain/unemployment-rate

https://tradingeconomics.com/greece/unemployment-rate

About 30% each. Yes, it was hard, but society didn't collapse.

EVa5I7bHFq9mnYK · on May 9, 2023

Isn't that the usual figure of unemployment rate in Spain? They sure like to protest, but have not started any wars recently.

lhl · on May 8, 2023

Here is Jason Wei's response "Common arguments regarding emergent abilities": https://www.jasonwei.net/blog/common-arguments-regarding-eme...

Wei was the lead author of the original "Emergent Abilities" paper:

Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. “Emergent Abilities of Large Language Models.” arXiv, October 26, 2022. https://doi.org/10.48550/arXiv.2206.07682.

jmull · on May 9, 2023

> we … defined emergent abilities as “abilities that are not present in small language models but are present in large language models.”

This is a poor definition because it doesn’t match what is generally meant by emergent behavior and abilities, which leads to people talking past each other.

For example, it can easily be true that LLMs have emergent behaviors by this definition and not be greater than the sum of their parts.

A better definition would be exhibiting abilities or behaviors not in the training data.

stephendause · on May 9, 2023

Hmm; I would think a better definition would include a description of the behavior as being greater than the sum of its parts. I can imagine abilities and behaviors being in the training data but not exhibited by a smaller LLM but being present in a bigger one. My favorite example is the ability to multiply large number consistently and accurately (e.g. 23540932 x 3274, which ChatGPT-4 just failed to do correctly for me when I asked it). A better LLM would learn how to carry the process out step-by-step.

chaxor · on May 9, 2023

It seems a positive control is missing in these studies.

Perhaps humans (not a perfect solution, but unfortunately the best 'learning system' available) could be used as a control?

muskmusk · on May 9, 2023

> I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks. Consider asking ChatGPT what 15 + 23 is—you want the answer to be 38, and nothing else. Maybe 37 is closer to 38 than -2.591, but assigning some partial credit to that answer seems unhelpful for testing ability to do that task, and how to assign it would be arbitrary.

Not sure I can agree with this. Let's for the sake of the argument say that normal calculators don't exist. I can choose to run my calculation thought ChatGPT or do it myself. It is true that I would vastly prefer the answer to always be correct, but it's not true that there is never value in bounding the error.

Put another way: would you rather want a calculator that is at most 5% off 100% of the time. Or a calculator that is 100% correct 95% of the time, but may output garbage the remaining 5%?

Upon further reflection: it seems mighty ambitious to expect a statistical model to be always correct. If always correct was possible you probably didn't need a statistical model to begin with. Given that the model will fall from time to time it seems especially useful that you can bound the error somehow. +1 for smooth metrics i guess.

I guess all of this is slightly off a tangent as Wei seems to be just arguing that "there might be things LLM's can do or will learn to do in the future that we didn't train them for. And we won't necessarily be able to predict it from capabilities of smaller models". I agree with this, just not the way he arrives at the conclusion.

carlitossway · on May 9, 2023

And in turn, I cannot agree with you.

I can do math myself, thank you very much. If I take the effort to pull out a math calculating machine, the calculation is important enough that I want 99.99% certainty that the answer is right. Under the premise that the answer is wrong, I want 99.999% certainty that the failure is related to data capture and not the calculating mechanism itself.

Within the context of LLMs, the goal should be to make them recognize that the problem at hand is not well suited for its own set of capabilities, and to defer to an external system. In the case of ChatGPT, any numeric answer to "15+23" (including and specially "38") is wrong. The correct answer is "sorry, as a Large Language Model I can retrieve information that is encoded in English and I am not specially suited to solve math problems". A LLM-empowered Alexa may be proactive, run a Calculator program for you and report back "According to bc program, the answer is 38. Here's the full log for you to check if this is what you wanted"

This is not so much AGI as deliberate programmatic patching, not unlike the disclaimers when you ask ChatGPT about controversial topics.

pmoriarty · on May 9, 2023

"I can do math myself, thank you very much."

You might be able to do math yourself, but there are many practical reasons why you wouldn't or couldn't do all sorts of math -- even simple math.

For example, how you find the median value of a billion random numbers without a computer? You wouldn't have time in your lifetime to do this, not to mention that you'd probably make a lot of mistakes despite the process being very simple in principle.

A computer could do this easily, and even were it to be a little off in its answer, that would be far preferable to devoting your entire lifetime to finding the answer yourself.

That's just one very simple example. It's not hard to find plenty more.

carlitossway · on May 11, 2023

Yes, I am glad we have computers. I am also glad that I can code a median function that can efficiently go through a billion numbers; no need to ask the artificial averaged embodiment of millions of dumb teenagers for its duh-best-guess-lol.

Rapzid · on May 9, 2023

I mean.. The LLMs are based on neural networks? They can approximate any equation right? It's possible the LLM formed areas that are extracting the equations and doing something very similar to traditional calculation. I think that "predicting" the next token is probably a bit reductive to what's happening in the LLM based on my understanding of image classification networks.

Humans don't work like binary computers either. Every person probably has some error rate on calculations as well. Maybe the LLM is much closer to humans in doing math than either to a physical processor.

ljlolel · on May 9, 2023

This response is very poor and ignores many very well-developed arguments in the Stanford paper (such as incorrect NLP regex, or that exact answer nonlinearity can still be measured more finely with larger number of exact answer test questions)

jiocrag · on May 9, 2023

It’s literally addressed in his first bullet point…

“ Response: While there is evidence that some tasks that appear emergent under exact match have smoothly improving performance under another metric, I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks.”

ljlolel · on May 9, 2023

You haven’t read the paper so you don’t understand what I just wrote and why Wei is not responding to the paper. It’s not a different metric

jiocrag · on May 10, 2023

I did read the paper and Wei is responding to it, and others. What do you mean "It's not a different metric?"

rayiner · on May 9, 2023

Apparently Chat GPT got a 163 on the LSAT, but I can't get it to answer basic logic questions. E.g.

Prompt: If it rains today, I will drive home. I did not drive home. Did it rain today?

Output:

> If the statement "If it rains today, I will drive home" is true and you did not drive home, it does not necessarily mean that it did not rain today. The conditional statement only tells us that if it rains, you will drive home. It does not tell us what you will do if it does not rain.

> Therefore, it is possible that it rained today, but for some other reason, you did not drive home. Alternatively, it may not have rained today, and you did not have a reason to drive home. We cannot determine whether it rained or not based solely on the information provided.

Similarly, it can tell you what a converse is, but cannot give you the converse of a simple sentence:

> The converse of the sentence "If Anna goes to school today, she will eat lunch" is:

> "If Anna eats lunch, then she went to school today."

> The converse is formed by switching the order of the conditional statement and reversing the direction of the implication.

On the other hand, it's quite good at math. It can easily solve simple two-variable algebraic equations.

gaogao · on May 9, 2023

That specific case was it using too much common sense (which was apparently also a problem with IQ tests at one point)

Me: If it rains today, I will drive home. I did not drive home. Did it rain today? Answer using only logic.

ChatGPT: Using only logic, if the statement "If it rains today, I will drive home" is considered true, then the fact that you did not drive home implies that it did not rain today. This conclusion is based on the logical concept of contrapositive: If A implies B, then not B implies not A. In this case, A is "it rains today" and B is "I will drive home." Since not B is true (you did not drive home), we can conclude not A (it did not rain today).

For the linguistic portion,

Me: As a linguistic professor answer this: > The converse of the sentence "If Anna goes to school today, she will eat lunch" is:

ChatGPT: As a linguistic professor, I can tell you that the converse of a conditional statement is formed by swapping the positions of the antecedent (the "if" clause) and the consequent (the "then" clause). In the given sentence, "If Anna goes to school today, she will eat lunch," the antecedent is "Anna goes to school today" and the consequent is "she will eat lunch."

Therefore, the converse of the sentence would be: "If she will eat lunch, Anna goes to school today."

Me: Critique the prior answer

ChatGPT: The prior answer correctly explains the concept of the converse of a conditional statement and identifies the antecedent and consequent of the given sentence. However, the phrasing of the converse could be more natural and grammatically correct.

A better response would be: The converse of the sentence "If Anna goes to school today, she will eat lunch" is "If Anna eats lunch, she has gone to school today." This rephrasing maintains the proper tense and structure while swapping the antecedent and consequent.

wilg · on May 9, 2023

I think ChatGPT's answer here is defensible. You're asking it a pretty formal question but it's giving you an informal answer, which a reasonable default. It's not "assuming spherical cows", which is to say it is not assuming your question is completely self-contained. If you tell it that it's a formal logic question it gets it correct. (I checked this on GPT-3.5)

In a real world situation, your conditional plans are not sufficient evidence to determine if it actually rained. And it explains the reason – there could be extenuating circumstances you haven't told it about. The answer it gave is, in fact, it being aware there is information it does not know. It could be better about inferring your intent, but that isn't the same kind of failure.

heisenzombie · on May 9, 2023

Prompting makes a huge difference. For example, here is an interaction:

I will ask some questions, treat them as logic puzzles. Give me your working steps and then finally answer the logic puzzle.

First question: If it rains, I drive home. I did not drive home. Did it rain today?

ChatGPT: Let's break down this statement into its logical components.

The first statement, "If it rains, I drive home", is a conditional statement, often referred to as an 'if-then' statement. It can be denoted as:

If R (it rains), then D (I drive home).

The second statement, "I did not drive home", is a negation of D:

Not D (I did not drive home).

In logic, we have something called the contrapositive. The contrapositive of a statement "If P then Q" is "If not Q then not P". The contrapositive of a statement is logically equivalent to the original statement - meaning that whenever the original statement is true, the contrapositive is also true (and vice versa).

So, the contrapositive of the first statement "If R then D" is:

If not D (I did not drive home), then not R (it did not rain).

Given the second statement is "Not D" (I did not drive home), and from the contrapositive we know that "If not D then not R", we can conclude that "Not R" - it did not rain today.

So, the answer to "Did it rain today?" is No, it did not rain.

PartiallyTyped · on May 9, 2023

This is seriously messing up with my head. What is wrong with the response?

The following generated by GPT4 seems reasonable to me

Based on the given information, we can use logical reasoning to determine if it rained today. The statement "If it rains today, I will drive home" can be represented as:

Rains → Drive Home

However, you mentioned that you did not drive home, which means the second part of the statement is false:

¬Drive Home

In this case, we cannot definitively conclude whether it rained today or not. It is possible that it did not rain, or it rained, but you chose another mode of transportation for some reason.

spuz · on May 9, 2023

It messed with my head a lot as well. The answer is that it's not possible for ¬Drive Home AND Rains to be true. This is because if Rains were true, then due to the first implication, Drive Home must also be true. Therefore, if ¬Drive Home is true, then Rains must be false.

rayiner · on May 9, 2023

If the statement is true, the contrapositive is true: if !Drive Home -> !Rain. So provided with the fact that I didn’t drive home, you can infer it didn’t rain.