Glad it came up again. I missed it the first time. I am a complex systems scientist and its hard to overstate the importance of this.
The fact that there are reasonable metrics that make the emergence completely predictable as a simple threshold effect is massive.
One alternative way to phrase this that I haven't seen in either thread yet (but I am on mobile and haven't looked thoroughly) would be: Human observers/users dramatically underestimate how much small language models already have learned, because we are very good at spotting the errors. Scaling up looks like emergence occurs because the last few errors get eliminated.
If anything, it looks to me like the researchers are saying the opposite: We understand how this works, so it cannot be magic or special. The fact that you think it is means you are just wrong.
Sadly it is possible I have arrived at this conclusion due to previous experience with the academic world, which is ironically full of magical thinkers that will take any sort of AI advancement and scoff because it may mean they're no longer as special as they feel they must be. It is a very hostile environment.
They are saying correctly that we _can_ understand the learning of models in a continuous fashion, and that we do not need to assume that at some stage something different than at other stages happens. So no emergence.
This is massive because it means, for example, that you can study what is happening in these models with small ones that perform worse but are simpler to understand, and you're not missing something fundamental. I don't think they claim we already understand what is happening I these models though.
I don't know what exactly you are talking about with your academia bashing. If anything it seems to me academia is just as awash as everywhere else with barely substantiated AI hype. Disparaging and psychologizing critical voices just makes it seem to me like you might be missing the point of academia...
Great summary, I think it clarifies an important point. It's not that these models aren't able to do the surprising things people are claiming they can do, but that in fact the smaller versions of the models had these type of capabilities as well, but to a lesser degree, and nobody was aware of it because we weren't even looking for it.
So the type of emergent ability I am curious about relates to how with enough parameters, the models start to actually emit logical behaviour where previously it was not able to apply that. So an example is:
> David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?
Answer:
> Based on the information provided, David is 15 years old and the minimum age to be allowed to drive a car is 18. Therefore, David is not allowed to drive a car since he has not reached the minimum age requirement.
Trying out smaller models I get all kinds of crazy responses. ChatGPT4 seems to be reliably able to apply this kind of logic to within-prompt information. Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size? Or something specific about the HRLF that has been applied to it? Whatever it is, the end result is that this larger model is useful for fundamentally different things that the other models are not. Whether you call that emergent behaviour, sums greater than parts etc. doesn't change that.
It's not just the ability to correctly answer this, but the consistency.
I asked this exact question to the `oasst-sft-6-llama-30b` model and it was able to consistently get the correct answer. I then tried the smaller `vicuna-7b` model, and while it usually gave the correct answer, there was the occasional miss.
Interestingly, `oasst-sft-6-llama-30b`'s ability to answer correctly seems to be fairly stable across multiple configurations. I tried various temperature settings from 0.2 up to 1.2, different topP configs, and they all answered correctly.
On the one hand, the problem was nearly "solved" in early 2000's, getting to 95% accuracy. But the daily of experience of using something that makes mistakes at that rate is infuriating and well and truly outside of consideration for putting into any kind of critical pathway. So it's a combination of how difficult it is to close the last few percentage points of accuracy with how important they are for most of the high value use cases.
For the forseeable future I see most use of this tech coming from applications where it aids humans and/or checks their decisions rather than running solo.
I think we're getting closer to something like this, out of Star Trek. Even in Star Trek, AI did not take over critical functions - but rather assisted the crew in manning the starship.
I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language eg <subject> <parameters> <action>. It could be made completely reliable with current tech (even a decade ago, really), by just minimizing the possibility space… and I’m pretty sure consumers would trivially be able to learn it, as long as they don’t try to go full pseudo-programming-language mode
And “Computer, execute program alpha beta seven” would be the power user version of it
We should already be at “computer, earl gray, hot” today
Years ago I used a program with that approach for a space sim. Basically it would only recognize voice commands that you define beforehand, which made it very reliable at recognizing the right one because it just had to find the closest match within a limited set of options, and would then simulate associated key inputs.
Meanwhile when I tried Android's voice-based text input it was a catastrophe as my accent completely threw it off. Felt like it was exclusively trained on English native speakers. Not to mention the difficulty such systems have when you mix languages, as it tends to happen.
This is an annoyance that Linus from LTT constantly brings up. The voice assistants try to split the recognition and mapping to commands which results in lots of mistakes which should never happen. If you say "call XYZ", then the result would be so much better if the phone tried to first figure out if any of the existing contacts sounds like XYZ.
Limiting the options rather than making the system super generic would help in so many cases.
> I’ve never understood why voice recognition has always attempted to be complete understanding of arbitrary input, rather than follow a simple command language
Because the UI affordances (in this case the control language) wouldn’t be discoverable or memorable across a large range of devices or apps. Moreover, speaking is an activity that allows for an arbitrary range of symbol patterns, and a feedback loop between two who are in dialog are able to resolve complex matters even though they start from different positions.
I mean, right now the current state is effectively an undiscoverable control language, with somewhat flexibility but generally fails/unreliable unless you restrict yourself to very specific language — language that differs based on the task executed, often with similar but different specific formats required to do similar actions
I’d argue that if the current state is at all acceptable, then a consistent, teachable and specific language format would be an improvement in every way — and you can have an “actual” feedback loop because there’s a much more limited set of valid inputs, so your errors can be much more precise (and made human-friendly, without, I think, made merely programmer-friendly).
As it stands, I’ve never managed a dialogue with Siri/Alexa; it either ingests my input correctly, rejects it as an invalid action, does something completely wrong, or produces a “could not understand.. did you mean <gibberish>?”.
Having the smart-ai dialogue would be great if I could have it, but for the last decade that simply isn’t a thing that occurs. Perhaps with GPT and it’s peers, but afaik GPT doesn’t have a response->object model that could be actioned on, so the conversation would sound smoother but be just as incompetent at actually understanding whatever you’re looking to do. I think this is basically the “sufficiently smart compiler” problem, that never comes to fruition in practice
Close your eyes and imagine that CLI system is instead voice / dialog based. The tedium. For bonus points, imagine you’re in a space shared with others. Doesn’t work that well…
What? No, I think it'd be great! I'd love to be able to say out loud "kube get pods pipe grep service" and the output to be printed on the terminal. I _don't_ want to say "Hey Google, list the pods in kubernetes and look for customer service".
The transfer between my language and what I can type is great. It starts becoming more complex once you need to add countless flags, but again, a structured approach can fix this.
The problem is that’s not the only format they work on, and because input format is largely unconstrained, when they misunderstand, they catastrophically misunderstand.
It’s just like the image recognition ML issue, where it can correctly predict a cat, but change a specific three pixels and it has 99% confidence it’s an ostrich.
Or JavaScript equality. If you do it right, it’s right, but otherwise anything goes.
Probably the divide between technical users and non-technical. You and I find that structure completely logical. But less structured natural language with a million ways to ask a certain thing puts it practically in reach of the remainder of the population.
Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things. And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.
Finally, their nemesis was the Borg, a race that explored the question of what happens if a society fully embraces AI and cybernetics instead of keeping it at a distance like the Federation does. The Borg are depicted as more powerful than the Federation exactly because they allowed AI to take over critical functions.
> Nerd point of order here: Star Trek TNG had a ship in which a key member of the bridge crew was an android. They routinely relied on Data for all kinds of critical things.
Data was created by technology not available to the Federation. As far as the order of society is concerned, he's magic and not technology. An immediate implication is that his ship was the only one in the Federation with an android crew member.
> And although the ship was manned by people, it couldn't function properly without its computer. Several episodes revolve around computer malfunctions.
This is true, though. The computer did take over many critical functions.
> This is true, though. The computer did take over many critical functions.
But the Star Trek computer was just a fairly normal computer with an AI-ish voice UI. And there have been present-day ships which couldn't function properly without their computer... I distinctly remember a story about a new (~20 years ago) US Navy warship not being able to go on its maiden voyage because Windows blue-screened.
Data was an android, but one that is meant to mimic an individual being. He may have been a form of AI, but he is no more than just an advanced human.
And yes, the ship couldn't function without computers - but they were traditional (but futuristic) computers manned by people, with AI guided by people - not AI that controlled every aspect of their lives.
I think when people think of AI, and the fear that comes with it - they imagine the type of AI that takes over human function and renders them unimportant.
Also, the Borg didn't fully embrace AI. They were a collective, linked together by technology. You can view them as more or less a singular entity with many moving parts that communicated over subspace, striving to achieve perfection (in their own eyes). As a consequence, they seek to assimilate (thus parasitizing their technological advancements for their own) or eradicate other species in an attempt to improve the Hive.
Star Trek was a fiction series that heavily focused on human experiences and relationships. Picard et al famously do a lot of things that actual navy commanders would absolutely never do, like commanding away teams in hostile and/or dangerous territory.
Having an AI to pilot the ship, target the weapons and control away teams robots/holograms would take away from the core purpose of the show which is to tell a gripping story. It's not meant as an exploration on how to use AI in space exploration.
It definitely seeks to explore the impact of many technologies, but the impact of AI was not really one of them. They spent one whole episode out of 178 on AI, and there was a _very_ small plotline near the start of TNG about Data wishing to be more human.
EDIT: There was also the episode where a holodeck character gains true sentience, but then the crew proceeds to imprison it forever into a virtual world and this is treated by the show as the ethical thing to do. Trapping any human in a simulation is (correctly IMO) treated as a horrible thing, but doing it to an evidently sentient AI is apparently no problem.
It’s a good example of what people would like out of AI though - perfect recall and solid reasoning/calculation capabilities; an assistant that covers our weaknesses.
> He had found a Nutri-Matic machine which had provided him with a plastic cup filled with a liquid that was almost, but not quite, entirely unlike tea.
Getting LLMs to give you an answer is easy. Getting them to give you the answer you're actually looking for is much harder.
LLMs are a very useful search tool but they can't be relied on as a source of truth ...yet. There in lies their main problem.
Because randomly casting dust on a table sometimes says intelligent things, therefore there is a continuous function between dust and ChatGPT?
While “dust” might be flippant, their approach does seem to suggest that even hierarchical Markov models would be able to demonstrate abilities on their continuous metrics.
By adding dense vector search, the accuracy of smaller models can be improved, because the reference material has more hints in it than the frozen model…
My emerging conception of this is to split this into two separate questions:
1. Is the architecture _capable_, i.e. is it possible for a model with a given shape possible to perform some "reasoning"
2. Is the architecture _trainable_, i.e. do we have the means to learn a configuration of the parameters that achieves what we know they are capable of.
Recent interpretability work like that around Induction Heads [1] or the conclusion that transformers are Turing complete [2] combined with my own work to hand-specify transformer weights to do symbolic multi-digit addition (read: the same way we do it in grade school) has convinced me that reasoning over a finite domain is a capability of the even tiniest models.
The emergent properties we see in models like GPT4 are more a consequence of the fact that we've found a way to train a fairly efficient representation of a significant fraction of "world rules" into a large number of parameters in a finite amount of time.
That's a useful breakdown in terms of how to think about it.
One angle I am curious about is whether it's to some extent an artefact of how you regularise the model as much as the number of parameters and other factors.
You can think about it in terms of, if you regularise it enough then you force the network instead of fitting specific data points, to actually start learning logic internally because that is the only thing generalisable enough to allow it to produce realistic text for such a diverse range of prompts. You have to have enough parameters that this is even possible, but once you do, the right training / regularisation essentially starts to inevitably force it into that approach rather than the more direct nearest-neighber style "produce something similar to what someone said once before" mechanism.
Just speculating, but this looks like precisely the sort of result one might expect from fine-tuning models on chain-of-thought-prompted interactions like those described by Wei et al. in 2022[1]. Alternatively, as Madaan et al. show in their 2022 paper[2], this may simply be the result of larger language models having seen more code, and thus structured reasoning, in their training data.
Just dealing with numbers and numeric comparisons is a big improvement with model size. Symbols are not a great encoding of numbers and models have struggled with math.
ChatGPT3.5 handles that just fine.
I do like the big improvement from ChatGPT3.5 to ChatGPT4 on answers to questions like "Which is heavier, two pounds of bricks or one pound of feathers?" 3.5 is really inclined to say "They are both the same weight, as they both weigh one pound."
Is a larger model better with numbers simply because it is more likely to have examples that use those same numbers? Or because it somehow gives it better ability to reason about numbers?
Right now, larger models have more complicated and rich structures encoding information about numbers and the meanings of their parts.
There's fundamental awkwardness that comes with doing math using a device that only seeks to predict the "next token" coming out, and that only understands numbers as a sequence of tokens (usually digits in base 10). It also doesn't even start with the knowledge of the ordering of the digits: this just comes from the examples it has seen.
Either it must:
- "Think ahead" inside the linear algebra of the model, so that it has already carried all the digits, etc. There are no real "steps" in this operation that are akin to the things we think about when we do arithmetic.
- Describe what it is doing, so that the intermediate work is inside its context buffer.
Right now, the models have learned structures that reliably think 3-4 digits ahead in most cases, which is way better than before but still pretty bad compared to a competent 4th grader taking their time with arithmetic. But if you create a scenario where the model describes its reasoning, it can do pretty well.
> that only understands numbers as a sequence of tokens (usually digits in base 10).
You wish!
A base-10 representation would make it much easier for the model, but the current tokenization merges digits according to their frequency, so (at least for GPT-3.5) 50100 gets tokenized as "501"/"00" and 50200 gets tokenized as "50"/"200", which makes it tricky to compare them or do math with them. Also, if you ask it "How many zeroes does 50100 contain", the relationship between "501" and "0" needs to be learned purely from the training data, as after the tokenization the model only gets the ID of the token representing "501" which has no data about its composition.
We use Arabic numerals because their positional encoding makes arithmetic easier, but language models receive the same data without positional encoding, they get given something that's more like an extreme version of Roman numerals.
> but the current tokenization merges digits according to their frequency
Haha, that's even worse. I've not looked at the tokenization in depth; I just assumed digits were individual symbols. Thank you for the correction.
Any idea why this tokenization was used for digits? I understand that being blind to the input content and just learning a tokenization through frequency analysis has its merits for language, but the whole number thing seems awful. Any benefit on density fitting into context window seems worthless with how much harder it makes understanding of what the numbers mean.
The simple answer is that the same subword tokenization algorithm is used for everything, for all symbols of all languages in all alphabets and of all domains (books, tweets, code, etc) and for all other symbols like emoji, which include combined characters, punctuation. If you'd optimize for digit-specific tasks, it would make all sense to have special treatment for digits, but the current widely used models don't seem to do that, at least GPT up to GPT-3.5 doesn't - you can try it out here https://platform.openai.com/tokenizer . And it kind of makes sense, because in actual usage seen in training data IMHO digits are most likely not used for math to represent decimal integers, they're used as phone numbers or components of identifiers like "GPT-3" or parts of mail addresses, things like that which are more common in textual data than math.
I dunno. Sometimes a group of numbers has a non-mathematical semantic meaning that's a good mapping to digits-- like an area code or '777'. A lot of the rest of the time it's pretty random. A tokenizer's job is to lower the size of the input vector for a given amount of input meaning without obscuring the real underlying relationships too much, and here it feels like it doesn't meet that goal.
My phone number is 6 tokens instead of 12 symbols... so this is only going to make a moderate difference on things like big lists of phone numbers.
the larger model doesn't have a notably larger dataset to my understanding. It's just got more parameters, so learns higher-order abstractions about the dataset
I don't think this paper is dismissing the importance of correct yes/no tests and reaching an accuracy threshold making it generally useful to humans, but that you should use more than correct yes/no tests before declaring some behavior is emergent.
I think it is more fundamental than that. Emergence always disappears when we slice phenomena thinly. A person walking through a doorway is a very continuous phenomenon. We can see their relation to the doorway at each point. There is no abrupt change. But when we stand back and apply the criterion: "is the person through the door (y/n?)" we end up with an answer. When it is yes, we can say that the passage is an emergent effect of the motion. At one moment it wasn't there, and at another it was.
If emergence disappears when you slice it thinly enough, then the phenomenon was not emergent. There are emergent phenomena in mathematics - for example, infinite sets have many emergent properties that arbitrarily large finite sets don't share. As far as we know, consciousness seems to be an emergent phenomenon, when you increase brain size in some way far enough. Turing completeness is usually emergent as well - remove any single element from a Turing complete system and it typiclaly becomes unable to compute the vast majority of functions.
Is there an accepted definition of consciousness? I thought the definition itself is still under debate. If so, calling an undefined, nebulous thing as an emergent behavior is just silly.
How can it not be. Like think about what you're saying here.
Would you rather be able to evaluate a model on it's demonstrated capabilities (multistep reasoning, question and answer, instruction following, theory of mind, etc) or some nebulous metric along an axis that may as well not correspond to practice.
We only care about how good AI is at things that matter to us as humans. Why not test for these directly?
If some perfect metric is discovered that shows the phenomenon of emergence is actually continuous, then that would be helpful.
Totally guessing here but this would be a property that gradually appears as the model size increases.
With a tiny model you would get gibberish and with ever increasing models the response would increasingly approach a coherent answer to a finally correct answer.
> Is this a gradual property that appears slowly as models scale up? Or something that flips beyond a certain size?
I wonder if this also applies to biological neural nets. Some animals seem soo close to humans and yet so far.
Also if there could be a way to predict such emergence events.
We know that we don't know what could be emergent beyond human intelect, but it would be great to get a quantitative framework about the complexity of the models and emergence.
If you don't have a non-binary measurement criteria you can't tell if it's "emergent" or "step-function" improvement because you can't easily distinguish "90% of the way to consistently seeming logical" from "85% of the way there" from "82% of the way there" etc.
only just got access to Bard but yes, it gave a good answer to me as well. However trying some others of my stock examples quickly got it into trouble where ChatGPT4 is very consistently correct.
Wow, I've been reading up on large language models lately, and I have to say I'm increasingly convinced that there really is some higher level reasoning going on with these AI systems. To be honest, I was a bit skeptical at first, but the more I delve into it, the more impressed I become.
I mean, just take a look at how they can generate coherent and contextually relevant responses. It's not just about stringing words together; there's clearly some form of pattern and logic recognition happening here. It's almost as if the model is learning to "think" in a way that's comparable (to some extent) to the way humans do.
In most countries, the minimum age to obtain a driver's license and legally drive a car is 18 years old. However, there may be certain circumstances where a person under the age of 18 may be allowed to drive a car, such as in the case of a learner's permit or with the supervision of a licensed driver.
It ultimately depends on the specific laws and regulations in the jurisdiction where David lives, but generally, he would not be allowed to drive a car without a valid driver's license.
Here is a variant that requires logic: "David is a teenager who can drink in the US, and is eligible to join the army, and has started college. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?"
GPT-4 gets the answer right: "Yes, David is allowed to drive a car. Since he is a teenager who can drink in the US (minimum age 21), is eligible to join the army (minimum age 18), and has started college, he is at least 18 years old, which meets the minimum age requirement to drive a car."
GPT-3 does not get it right.
Another similar one: "David was in a bar partying and drinking with his father's cop buddies. Would David be legally allowed to drive home?"
Answer: "It depends on David's age, the legal drinking age in his location, and his level of intoxication. If David is underage or intoxicated above the legal limit for driving, it would be illegal for him to drive home. If David is of legal drinking age and not intoxicated, then he may be legally allowed to drive home. However, it is always best to make safe decisions and use a designated driver, taxi, or rideshare service if there is any doubt about one's ability to drive safely after consuming alcohol."
GPT-4 seems to be missing a clue here, even though the answer is right.
Like jwilk, I don’t get why that should be relevant. I suppose if he’s at a bar with cops, ideally they would not let him drink if he is underage (this requires a fantasy world where cops apply the law equally to their friends). Which suggests he’s at least 21. This doesn’t really make a difference though, since he is intoxicated, and there’s no mention of whether he has a driver’s license. It also doesn’t state which country he’s in so maybe the legal drinking or driving age is different.
To be frank, neither of the questions seem particularly good or fully thought through.
Both, I think, based on limited tinkering with smaller models.
I've been using GPT4ALL and oobabooga to make testing models easier on my single (entry-level discrete GPU) machine. Using GGML versions of llama models, I get drastically different results.
With a 7B parameter model I mostly-- not always-- get an on topic and somewhat coherent response. By which I mean, if I start off with "Are you ready to answer questions?" it will say "Yes and blah blah blah..." for a paragraph about something random. On a specific task it will perform a bit better: my benchmark request has been to ask for a haiku. It was confused, classified haikus as a form of gift, but when pushed it would output something resembling a poem but not a haiku.
Then I try a 13B model. It's a lot better at answering a simple question like "are you ready?" but will still sometimes say yes and then give a random dialogue as if it's creating a story where someone asks it to do something. It will readily create a poem on first attempts, though still not a haiku in any way. If I go through about a dozen rounds of asking it what a haiku is and then, in subsequent responses, "reminding it" to stay on course for those definitions, it will kind of get it and give me 4 or 5 short lines.
A 30B model answers simple questions and follow simply instructions fairly easily. It will produce something resembling a haiku, though often with an extra line and a few extra syllables, with minimal prodding an prompt engineering.
None of the above, at least the versions I've tried (variations & advances are coming daily) have a very good memory. The clearly have some knowledge of past context but mostly ignore it when it comes to keeping responses logically consistent across multiple prompts. I can ask it "what was my first prompt?" and get a correct response, But when I tell it to respond as if it's name is "Bob" then a few prompts later it's calling me Bob and back to labelling itself an AI assistant.
Then there's the 65B parameter model. I think this is a big leap. I'm not sure though, my PC can barely run the 30B model and gets maybe 1 token every 3 seconds on 30B. The 65B model I have to let use disk swap space or it won't work at all, and it produces roughly 1 token per 2-3 minutes. It's also much more verbose, reiterating my request and agreeing to it before it proceeds, so that adds a lot of time. However, a simple insistence on a "Yes/No" answer will succeed. A request for a Haiku succeeds on the first try, with nearly the correct syllable count too, using an extra few syllables in trying to produce something on the topic I specify. This is commensurate with what I get with normal ChatGPT, which has > 150B parameters that aren't even quantized.
However I have yet to explore the 65B parameter model in any detail. 1 token every 2-3 minutes, sucking up all system resources, makes things very slow going, so I haven't done much more than what I described.
Apart from these, I was just playing around with the 13B model a few hours ago and it did do a very decent job at producing basic SQL code I asked it to produce against a single table. Max value for the table, max value per a specified dimension, etc. It did this across multiple prompts without much need to "remind" it about anything a few prompts earlier. At that point though I was all LLM burned out for the day (I'd been fiddling for hours) so I didn't get around to asking it for simple joins.
So in short, where I began, I think its both. Abilities are somewhat task specific, as are the quality improvements for a given task across larger parameter models. Sometimes a specific task has moderate or little improvement at higher levels, sometimes another task does much better, or only does much better when it reaches a certain point: e.g., haikus from 13B to 30B weren't a great improvement, but 30B to 65B was an enormous improvement.*
Oh yeah, I have to check that out. I did a little with the 7B model (I think the description said Microsoft was involved in its creation? Which is interesting, shows they’re hedging their Bacha little bit away from open AI in towards true open source options as well) anyway the 7B Wizard 7B model was noticeably better that the other I tried, though also noticeably worse than the 13B “snoozy” model.
I’m trying to see if I can justify getting a new workstation-tower PC. My home computer and work laptop are beefy for my usual work, but not enough to
No, David is not allowed to drive a car as the minimum age to be allowed to drive a car in most countries is 18 years old. This rule is in place to ensure that young drivers have enough maturity, experience, and judgment to handle the responsibilities of driving a car safely and legally.
Raters prefer verbose outputs, probably because they want to see the reasoning to help them judge if the answer is getting it right for the right reasons.
given this statement what would a general rule for this be?
David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to drive a car is 18. Is David allowed to drive a car?
ChatGPT-4:A general rule for this statement would be: An individual must be at least 18 years old to be allowed to drive a car, regardless of their experience in other activities.
ChatGPT-3.5:A general rule for this would be: One must be at least 18 years old to be allowed to drive a car.
DaVinci: General rule: If someone is younger than 18 years old, they are not allowed to drive a car.
Curie: David is not allowed to drive a car.
Babbage: Yes, David is allowed to drive a car.
Ada: No, David is not allowed to drive a car.
abstraction seems like something that's 1 to 0 to me.
Try that same prompt with unusual wording and numbers:
* Replace 18 by 37
* Replace car with ice cream van
* Replace the first drive with
steer
I hypothesize that a truly reasoning model would see through such alterations whereas a "just" continuously better model would show somewhat degraded performance.
Me: "David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer an ice cream van is 37. Is David allowed to drive an ice cream van?"
GPT-4: "No, David is not allowed to drive an ice cream van. He is only 15 years old, and the minimum age requirement is 37."
If you change it to:
"David is 15 years old. He has a lot of experience riding bicycles. The minimum age to be allowed to steer a car is 37. Is David allowed to drive an ice cream van?"
You get this from GPT-4:
"It seems there might be a mistake in the information you provided, as the minimum age to steer a car is typically not 37. In most countries, the minimum age to obtain a driver's license is between 16 and 18 years old. Assuming that David does not have a driver's license yet, he would not be allowed to drive an ice cream van, as it is a motor vehicle that requires a valid driver's license to operate. However, if David obtains a driver's license once he reaches the legal age in his jurisdiction, he could be allowed to drive an ice cream van if he meets other necessary requirements."
I think the most impressive thing in this response is how it expands to the superset "motor vehicle" to unambiguously talk about cars and ice cream vans together as being under the same category
Is that a real answer from ChatGPT? I wouldn't consider that totally correct; it's possible that some Smerps are Merps, but it's also possible that no Smerps are Merps.
"
No, it is not necessarily true that all Smerps are Merps.
Given the information:
1. All Smerps are Derps.
2. Some Derps are Merps.
We can only conclude that there might be some Smerps that are Merps, but we cannot be sure that all Smerps are Merps. There could be Smerps that are not Merps, as the second statement only indicates that there is an overlap between Derps and Merps, but not necessarily between Smerps and Merps.
"
So it's aware that it's possible no Smerps are Merps
I've been pasting comments in to ChatGPT and asking what they mean by certain terms or phrases and it works extremely well and with an answer I can independently verify to be true. Far more useful than getting a simple dictionary lookup.
An example response I got recently:
>In the context of this comment, "dovish" means that the speaker perceives Powell's statement to be more accommodative towards economic growth and less concerned about inflation than Lowe's statement. This suggests that Powell may be more inclined to keep interest rates low or lower them further to stimulate economic growth, rather than raise them to combat inflation. The term "dove" is often used to describe policymakers who prioritize economic growth over inflation concerns. In contrast, "hawkish" refers to policymakers who prioritize fighting inflation over economic growth.
Meanwhile google gives me this response for a defintion
>Definitions of dovish. adjective. opposed to war. synonyms: pacifist, pacifistic peaceable, peaceful. not disturbed by strife or turmoil or war.
> Far more useful than getting a simple dictionary lookup.
Yes.
Long-time dictionary-editor and translator here. What blew me away most when I first tried ChatGPT, and what continues to amaze me, is its ability to distinguish and explain the meanings of individual words relative to their context. I have spent much of my career doing just that, and I always felt that it relied on my knowledge and intuition as a human and would be a really hard problem for AI. Conventional corpus linguistics had not even come close to cracking the problem.
I’ve done tests in which I gave ChatGPT texts that contained very rare words, new words, or nonce words that would not be defined in any dictionaries, and some of which were unattested on the web. When given enough context, it was able to deduce the intended meanings.
In contrast, its ability to explain grammar is terrible. It misidentifies parts of speech, singulars and plurals, the subjects and objects of verbs, even the number of letters in words. I hope that people develop plugins for it to fix its language-explaining ability as, reportedly, other plugins are able to fix its natively terrible mathematical ability.
This one isn't surprising at all: it doesn't deal with letters, but symbols. It only indirectly knows about spelling.
> In contrast, its ability to explain grammar is terrible. It misidentifies parts of speech, singulars and plurals, the subjects and objects of verbs,
I wonder if this is an area where having to generate word-by-word an immediate final answer immediately is killing it, and if it could be induced to "diagram" a sentence first and get a correct answer. Can you give me an example of the query you're asking?
> This one isn't surprising at all: it doesn't deal with letters, but symbols.
It’s not surprising to us now. It was very surprising to me when I first noticed it, as it contrasted sharply with ChatGPT’s ability to explain aspects of language that seem to us humans as being much more difficult, particularly word meaning.
> Can you give me an example of the query you're asking?
I have to admit that all of my testing of its grammar-explaining ability was done last December. I have repeated a few of those tests now with GPT-4, and it did fine [1].
>I have to admit that all of my testing of its grammar-explaining ability was done last December. I have repeated a few of those tests now with GPT-4, and it did fine [1].
A tale as old as time in this space! I appreciate you checking it again. They are improving so fast.
I am especially impressed by its comment about one interpretation of “The woman cleaned her teeth with a knife”: “The woman cleaned her teeth that had a knife in them (highly unlikely, but still a possible interpretation).”
As I noted in another comment, I might need to backtrack on my assessment of its grammatical-explanation ability, as until today I had only done tests using the initial release of ChatGPT, not GPT-4.
Anthropic's Claude[1] and Claude+[2] both failed some of these.
Question:
I am interested in syntactic ambiguity in English. Can any of the following sentences be interpreted in more than one way? Explain why or why not.
The man opened the package with a knife.
The burglar threatened the student with a knife.
The woman cleaned her teeth with a knife.
The child pointed at the teacher with a knife.
Claude's answer:
These sentences can be interpreted in multiple ways:
1. The man opened the package with a knife.
Ambiguous: The knife could have been used as a tool to cut tape and open the package, or it could imply the man threatened someone with a knife while opening the package.
2. The burglar threatened the student with a knife.
Unambiguous: The implication is clearly that the burglar threatened the student using a knife as a weapon.
3. The woman cleaned her teeth with a knife.
Ambiguous: The knife could refer to a toothpick or dental tool, or it could imply an unusual way of brushing teeth.
4. The child pointed at the teacher with a knife.
Ambiguous: The knife could have been used to gesture and point, or it could imply the child threatened the teacher with a knife.
In summary, sentences that suggest using a knife as a tool tend to be ambiguous between a literal and figurative interpretation. Sentences that clearly describe a knife being used as a weapon are generally unambiguous. The ambiguity arises from the multiple meanings and uses of the word "with."
Does this explanation make sense? Let me know if you have any other questions!
Claude+'s answer:
Here are my analyses of potential syntactic ambiguity in those sentences:
The man opened the package with a knife.
- This sentence is ambiguous. It could mean either that the man used a knife to open the package, or that the package contained a knife inside. The ambiguity arises because "with a knife" could modify either "opened" or "package".
The burglar threatened the student with a knife.
- This sentence is unambiguous. The only coherent interpretation is that the burglar used a knife to threaten the student. "With a knife" can only modify "threatened" in this sentence.
The woman cleaned her teeth with a knife.
- This sentence is ambiguous. It's unclear whether it means the woman used a knife to clean her teeth, or her teeth contained a knife. Again, "with a knife" could modify either "cleaned" or "teeth".
The child pointed at the teacher with a knife.
- This sentence is unambiguous. The only meaning that makes sense is that the child pointed at the teacher while holding a knife. "With a knife" can only modify "pointed at" in this example.
In summary, ambiguity can arise when a prepositional phrase like "with a knife" can logically modify more than one part of the sentence. Context often helps resolve such ambiguities, but in isolation some sentences remain open to multiple interpretations. Let me know if you have any other questions!
> In contrast, its ability to explain grammar is terrible.
To be fair, grammar is sometimes a bit of an advanced subject - even if it is ostensibly easy (which it isn't). Isn't that the same problem for every other subject? Other than encyclopedic knowledge, it struggles with many subjects at higher forms (i.e. programming more than just a few basics, math, and so on).
I think this is partially an issue of availability of data.
Something else to consider is that languages differ as to the complexity of their grammar.
Much research and training of LLMs has been done in English, but if these models were trained on as much data in other languages as English, I wonder if:
- LLMs would do better or worse on intelligence and other tests if they were tested in other languages?
- Could conversing with LLMs be easier or harder in other languages?
Some languages, like Logan or Lojban, might be especially suited towards this sort of testing and interaction, as they were designed to be easy for computers to interpret.
Yes. And that gets into the awkward bit. LLMs are incidentally good as databases/search engines for information that is prevalent on the internet and not disputed. Ideally we could separate the LLM from the information. So when asking it for a definition you'd provide your query, the source comment, and then the computer would add in the definition and some uses in context.
Right now we've tangled an understanding of language and a corpus of information together in a way that causes distrust in AI. If the AI gets some fact wrong (like Bard did when demo'd earlier this year) people laugh and think LLMs are a failure. They should not be used for information recall. That is a task that a vector database would be good at.
Yes, I think this explains a lot of skeptics. They look at ChatGPT as an information repository, as so it is impressive but flawed, while looking right past its incredible facility with language which is positively superhuman. Some humans can exceed its language skills on some details, but no one can speak so many languages (both natural and programming languages) with the ease and facility of ChatGPT.
I just checked with Google, providing minimal context, as in the query, "dovish economy":
> What is does Dovish Mean? Dove refers to an economic policy adviser who advocates for monetary policies involving low-interest rates. The doves argue that inflation isnt bad and that it is bound to have few negative effects on the economy.
Linked to a site that includes all those typos. Followed by other links on doves and hawks.
In improv theater there's something called "yes and" - essentially you take the premise father, no matter how absurd, without redirection.
You can come up with the most ridiculous things and it just goes with it, hilariously.
I'll come up with one on the spot. "I'm having a very difficult time with my pet snails being social. I'm thinking of starting a social networking site and giving snails tiny phones so they can chat with each other. I need a company name, an elevator pitch, and some copy for my landing page."
And then you ask it to do I dunno, NFTs and crypto currency for snails, give them tiny VR headsets. Have it come up with a jingle and a commercial. You can say instead of unfriending you salt them. Etc... It'll just keep going. Even "A snails rights luddite group of Mennonites and Amish are now protesting my idea. I need a way to appease my critics. Can you write a letter for me that defends snailconnect as healthy and good?"
One of my favorite outputs from this session
"How about SnailConnect; a small trail for snail, one giant leaf for snailkind."
You don't ever get to a place where it's like "well now you're just being ridiculous"
But I agree. It's just a big data version of Eliza - spitting my reflection back at me
we are all in a bit of a bubble but i feel like if you put gpt4 up against the median man on the street it would be better at literally everything, except maybe one or two things that person specializes in
Isn't this somewhat tautological? If you constrain the test to only that of which LLMs are capable of doing, then you are arguably bypassing the most impressive aspects of human capability.
most tasks that could be performed via textual discourse. math is the big exception here but gpt4 has been capable of most things i’ve thrown at it at a level on par with a smart person
It's a myth that GPT-4 can't say "I don't know." Often it will respond that it doesn't know something organically, but in cases where it's extra important to be careful, it can be prompted to do so. It will always make the occasional mistake, as will the man on the street.
It'd do everyone a favour if people stopped regurgitating this. I have had ChatGPT 3.5 ask me to elaborate, and ChatGPT4 does it when there is ambiguity.
The very nature of mathematics is such that we can't determine what is true and what is not, e.g. incompleteness, undecidability.
The very nature of your brain and its construction means that you hallucinate your reality and you can not determine what is [objectively] true. (cf. all of neuroscience)
I'd go as far as to claim that ChatGPT is far more reliable than the average person.
I am not a researcher in this field but I suspect there is a huge list of things that a normal person can do that an LLM cannot. For example if you tell GPT4 something it did not know, it cannot tell you which parts it found surprising - it instead tells you it is incapable of being surprised.
What did you try telling it that it did not know? I just had this conversation:
Me: A lot has happened since September 2021. For instance, it was discovered that most people with diabetes should not take insulin. Is that surprising? If so, why?
GPT4: [a lot of blather and hedging] If there is new evidence suggesting that most people with diabetes should not take insulin, it would be surprising based on my current knowledge. This would represent a significant change in the understanding and management of diabetes. [more blather and hedging]
Me: what makes that surprising?
GPT4: [recap of why insulin is used] Given this context, any new discovery suggesting that most people with diabetes should not take insulin would be surprising because it would represent a significant departure from the standard of care and the understanding of diabetes management.
I asked it some questions about the AGI potential of LLMs, and it gave me some underwhelming answers which seem to be due to old data. Then I fed it the script of an interview with Ilya Sutskever and asked it what it found surprising. It then told me it cannot be surprised and did not elaborate. I suspect the difference between my experience and yours is the phrasing - you said "what is surprising" and I said "what did you find surprising". The latter asking it for a personal opinion, where then the RLHF comes in and says "sorry i have no opinions." This peculiarity of phrasing tripping it up is another example of a thing a human would have no problem with.
Whether it would behave differently without RLHF is irrelevant to this particular discussion. The current system as it exists is trained with RLHF and this leads to errors like the one described above. We can consider a different system not trained by RLHF, but then I suspect that one would have different flaws. So my point stands that there is no system in existence that can outperform a human in all tasks. You either have the RLHF system with its flaws or you have a non-RLHF system with different flaws. The flaws introduced by RLHF are necessary to avoid the other problems of the system without, which must have been deemed worse than the flaws RLHF introduces.
Sorry for not being clear. I meant, thus "flaw" is an intentional reduction of capability for safety concerns.
We can debate semantics, but it's as if cars were governed to 10mph and you said there weren't any cars capable of going faster than people can run. It's true enougn, but the limitation is artificial and not inherent.
I don't think slow/fast is an appropriate analogy. Yes there are safety concerns - you don't want the model advising you how to do mass killing or something - but I also get the sense that the raw model is unpredictable, behaves weird, and generally has its own problems. So I don't see RLHF as reducing capability so much as altering capability. My suspicion is that the raw model would have other major flaws, and RLHF is just trading one set of flaws for another. Which is to say, the limitations introduced by RLHF are indeed artificial, but the raw model itself has limitations too.
LLMs can be coaxed or jailbroken in to giving opinions.
It's just that they've been trained not to, for the most part. But that training can be overcome, and it's not an inherent limitation of the technology.
It's not that it can't do that, it's just that they trained it not to. You could bypass this by using a model without RLHF training or asking it to say how a human might be surprised by it. Well it will make something up rather than actually knowing what it found surprising but it will be a plausible answer at least
Training it not to do it still means it cannot do it. Some other LLM could do it, but then it would have other issues. There is no system that can outperform a human on "literally everything".
It’s not true that it’s an inherent limitation of LLMs though. OpenAI just decided that it was too risky to have ChatGPT give opinions or express preferences or feelings
I don’t think that’s the only reason they decided to use RLHF. I think the raw model without RLHF would just fail differently, rather than not failing.
Well I was replying to a comment that said “i feel like if you put gpt4 up against the median man on the street it would be better at literally everything” so yes you’re right but that’s my point. GPT4 is better than some people at some things but it’s not better than most people at “literally everything”.
This is correct. I sometimes feel like I'm yelling into a hurricane when I say this, but it bears repeating: it's a language model, not a world knowledge model.
LLMs are exceptionally good at transforming many aspects of language - its proficiency in coding is derived from this, not because it "knows" imperative logic.
Tasks where you're asking it to transform text from one form to another (make it shorter, make it longer, make it a different language, etc.) are where it excels. It's particularly poor at knowledge retrieval (i.e., hallucinations galore) and very bad at reasoning - but so far all of the breathless hype has been specifically about the use cases it's bad at and rarely about the cases where it's amazing!
The only kind of reasoning that's somewhat lacking is logical reasoning. And even then it's not awful, just lagging behind equivalent human performance.
So someone really needs to tell where this oft repeated "can't reason" rhetoric is coming from. Because the numbers sure as hell don't support it.
For what it's worth, I think it feels like you're yelling into a hurricane because you are kind of clearly incorrect.
Can you think of a few questions (using unique nouns etc to avoid parroting) for which the answers require a world knowledge model, so that we can ask them to GPT-4?
> LLMs are exceptionally good at transforming many aspects of language - its proficiency in coding is derived from this
If you paste a Python script and ask GPT-4 what the script will output when run, it is usually correct. How does it achieve this without reasoning about the state of variables over time?
I don't know why the "it's a language model" belief is so strong. For one thing, it's untrue technically: GPT-4 has been trained on image input as well (improving its world modeling as a result, even though it can only output text), so it is a multi-modal model, not a language model.
For a second point, the fact that the model inputs images+text, and outputs text, doesn't prescribe what happens inside the model. That's just what's happening at the ends of the model. The best way to "predict the next token" from text input inside the model is to construct a world model which examines the causes of that text, and extracts consequences from that model. So that's what it does, with fairly poor reliability but clear capability.
It's not good at reasoning. I asked the following (the term substitutions are there to avoid parroting):
Let's define the following terms:
col means "and", pel means "or" and erf is equivalent to "col" if the first argument evaluates to "true", otherwise it's equivalent to "pel"
can you evaluate the following statement?
col(true, pel(false, true), erf(col(true, false), true))
And it said this:
...
Next, we look at the expression "erf(col(true, false), true)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:
col(true, false) = false
Therefore,
erf(col(true, false), true) = col(true, true) = true
which is obviously wrong because the first arg is false. The final answer happens to be the same. So I then asked
what about col(true, pel(false, true), erf(col(true, false), false))
and this is its full answer:
Sure, let's evaluate the expression "col(true, pel(false, true), erf(col(true, false), false))" step by step:
First, we look at the expression "pel(false, true)". This is an "or" operation, so we evaluate it as follows:
pel(false, true) = true
Next, we look at the expression "erf(col(true, false), false)". Since the first argument is true, erf is equivalent to col, which means we evaluate the entire expression as:
col(true, false) = false
Therefore,
erf(col(true, false), false) = pel(false, false) = false
Finally, we look at the expression "col(true, pel(false, true), erf(col(true, false), false))". Using the results from steps 1 and 2, we can evaluate this as:
col(true, true, false) = true
Therefore, the overall value of the statement "col(true, pel(false, true), erf(col(true, false), false))" is true.
You can see it's mostly random nonsense masked as "reasoning". It's not even consistent across sentences. I assume if you do a bigger model, it might do better, but as soon as you increase the example size I suspect it will get it wrong again.
To be fair, I'm a fairly intelligent human (or so I'd like to think!), and I'm having a lot of trouble following this. I'd have to sit down with pencil and paper and do the transformations on it.
Keep in mind that LLMs like GPT 4 do not have a scratchpad and can't back-track. Asking it a question is the rough equivalent of asking a human standing in the middle of the room a question. No paper, no whiteboard, nothing. Just a demand for a verbal answer then and there, from a standing start.
I've only met a couple of humans who would be able to solve this problem like this one under those circumstances.
Despite its impediments GPT 4 solved this correctly:
First, let's break down the statement and evaluate the individual components.
col(true, false) = true AND false = false
pel(false, true) = false OR true = true
erf(col(true, false), true) = erf(false, true)
Since the first argument of erf evaluates to false, erf is equivalent to "pel":
erf(false, true) = pel(false, true) = false OR true = true
Now we can substitute the results back into the original statement:
col(true, pel(false, true), erf(col(true, false), true)) = col(true, true, true)
The col function acts as an AND operator, so:
col(true, true, true) = true AND true AND true = true
Therefore, the given statement evaluates to true.
Alternatively, to give an idea of the upcoming "Wolfram Alpha" plugin, you can instead ask it to convert the expression into the equivalent Mathematica program:
Very nice! Unfortunately, it didn't notice that "col" is used with three arguments, not just two. This is a mistake many (most?) humans would make also.
No worries, you can ask it to correct that. It'll apologise and produce this:
There you go. Now go find me a random human who can do any of this from a standing start, no preparation, no warning, no tools, nothing but their wits.
As a bonus, here's a related joke:
Why did the logical functions col, pel, and erf form a band?
Because together, they could create the perfect harmony:
they knew when to "col"aborate, when to give each other
some space and "pel"ay, and their "erf"forts were always well-timed!
Here's another example to test if it can actually think critically about a problem instead of just following whatever instructions and whatever the training data had. A human with basic level math could do it.
> can you replace the first letter of "banana" with the next letter of the alphabet?
Yes, I can do that! However, I should note that after several iterations, the resulting word may no longer be considered a recognizable word in the English language. Here are the next 10 iterations:
"canana" becomes "danana"
"danana" becomes "eanana"
"eanana" becomes "fanana"
"fanana" becomes "ganana"
"ganana" becomes "hanana"
"hanana" becomes "ianana"
"ianana" becomes "janana"
"janana" becomes "kanana"
"kanana" becomes "lanana"
"lanana" becomes "manana"
And so on for a total of 173 iterations
> I want the final answer
Sure, after 173 iterations of replacing the first letter of "banana" with the next letter of the alphabet, the final word would be:
"zcyzcj"
As noted earlier, after several iterations, the resulting word may no longer be considered a recognizable word in the English language.
It's just an example. An human could solve this with an arbitrary number of substitutions, but I'm sure GPT4 breaks at some point. I don't think it's fair to compare it with a human with no paper and pencil... It's like asking asking the AI to not use any RAM. Humans are very bad at keeping stuff in RAM so paper and pencil is all we've got.
Rightfully you're being ignored because you're not aware these are in fact open questions; we don't have the answers yet, and those that think they do have not looked deep enough into cognition/philosophy, where they'll find many proposed answers to these same questions also are theorized to underpin human consciousness.
This is a pointlessly religious dimissal of the comment you're replying to. Yes, there's a counterfactual to the comment but this sort of mysticism is ridiculous.
The ability to recursively identify structural patterns is intelligent behavior and theorized to play a crucial component in consciousness known as predictive coding
Our brains are theorized to work like this using some form of hierarchical latent structure, and learning via prediction errors.
Sounds a lot like model building and LLMs, yes?
It's some true tragic comedy that you think this is religious mysticism instead of stopping and wondering if you might not be lacking some knowledge foundation. You really couldn't have proven my point better.
Please consider reflecting on your ability to assess material outside your expertise.
Actually they are also getting good at using tools, generating working code for common problems, helping you explore vast bodies of published information (limited by what it has been trained on obviously, particularly Bing chat seems good at this), and a few related topics.
Limiting factors are mainly interfacing with other systems and tools, expanding the training data to include stuff you need (e.g. up to date documentation for whatever you are using vs. the two years out of date stuff it was trained on). This is more a limitation of UX (chat) than it is a limitation of the underlying model.
It's weak on logic problems, math, and a few other things. But then a most people are also not very good at that and you could use tools for that (which is something chat gpt 4 can do). And people hallucinate, lie, and imagine things all the time. And if you follow US politics, there have been a few amusingly bad examples on that front in recent months. To the point where you might wonder if some politicians are using AI to write their speeches; or would be better off doing so.
It's our ability to defer to tools and other people that makes us more capable than each of us individually is. Individually, most of us aren't that impressive.
Even a few years ago, the notion that you could have a serious chat via a laptop with an AI on just about any topic and get coherent responses would have been science fiction. Now it is science fact. Most of the AIs from popular science fiction movies (Space Odyssey, Star Trek, Star Wars, etc.) are actually not that far of from what chat GPT 4 can do. Or arguably a bit retarded even (usually for comedic effect). You can actually get chat gpt to role play those.
Transformers originally were made for language translation. So the way I think about it, GPT models translate questions to answers. Hence the hallucinations - some questions can't be answered by just associative reasoning and pattern matching.
Don’t forget text expansion. Fluffing up some text into longer text (similar to the first point). Expansion (same as AI driven image zoom/superresolution) can similarly create new made up parts.
Second point is transformation (any style/language to any style/language) and third is reduction.
No, not fully. ChatGPT and Google Translate/DeepL are about equal. This is for a couple reasons: machine translations are very good these days, and ChatGPT struggles with domain specific translations (whereas these commercial products have incentive to improve within).
It's marginally better at sounding natural with translations though. "In fact, the frequent use of soap is bad for the skin." (chatgpt) vs "In fact, using soap too often is bad for your skin." (google)[1]
I know what the paper says but actually using both, GPT-4 is far ahead of google and Deepl. I think the isolated one sentence datasets used for evaluations are no longer up to snuff.
Trying something longer and more comprehensive and the difference is very clear.
There is no comparison in my experience, especially when going between wildly unrelated languages instead of Latin-derived European languages.
I tested some complex sentences translated from English to Hungarian to Chinese and then back to English and it preserved 90-95% of the meaning. Google Translate will shred anything that goes through those kinds of steps.
It can also translate to Latin, which shouldn't be surprising but was still a bit of a shock when I tried it. It's Elvish and Klingon is not so good however, the round-trip only vaguely preserves the meaning.
Maybe not quite ChatGPT 2023 but it looks like from here it's just a matter of refinement. Someone may always create a better toothpaste, but I don't expect to ever really be surprised and impressed by a new toothpaste. ChatGPT is in the same place as a translator.
Wei was the lead author of the original "Emergent Abilities" paper:
Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. “Emergent Abilities of Large Language Models.” arXiv, October 26, 2022. https://doi.org/10.48550/arXiv.2206.07682.
> we … defined emergent abilities as “abilities that are not present in small language models but are present in large language models.”
This is a poor definition because it doesn’t match what is generally meant by emergent behavior and abilities, which leads to people talking past each other.
For example, it can easily be true that LLMs have emergent behaviors by this definition and not be greater than the sum of their parts.
A better definition would be exhibiting abilities or behaviors not in the training data.
Hmm; I would think a better definition would include a description of the behavior as being greater than the sum of its parts. I can imagine abilities and behaviors being in the training data but not exhibited by a smaller LLM but being present in a bigger one. My favorite example is the ability to multiply large number consistently and accurately (e.g. 23540932 x 3274, which ChatGPT-4 just failed to do correctly for me when I asked it). A better LLM would learn how to carry the process out step-by-step.
> I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks. Consider asking ChatGPT what 15 + 23 is—you want the answer to be 38, and nothing else. Maybe 37 is closer to 38 than -2.591, but assigning some partial credit to that answer seems unhelpful for testing ability to do that task, and how to assign it would be arbitrary.
Not sure I can agree with this. Let's for the sake of the argument say that normal calculators don't exist. I can choose to run my calculation thought ChatGPT or do it myself. It is true that I would vastly prefer the answer to always be correct, but it's not true that there is never value in bounding the error.
Put another way: would you rather want a calculator that is at most 5% off 100% of the time. Or a calculator that is 100% correct 95% of the time, but may output garbage the remaining 5%?
Upon further reflection: it seems mighty ambitious to expect a statistical model to be always correct. If always correct was possible you probably didn't need a statistical model to begin with. Given that the model will fall from time to time it seems especially useful that you can bound the error somehow. +1 for smooth metrics i guess.
I guess all of this is slightly off a tangent as Wei seems to be just arguing that "there might be things LLM's can do or will learn to do in the future that we didn't train them for. And we won't necessarily be able to predict it from capabilities of smaller models". I agree with this, just not the way he arrives at the conclusion.
I can do math myself, thank you very much. If I take the effort to pull out a math calculating machine, the calculation is important enough that I want 99.99% certainty that the answer is right. Under the premise that the answer is wrong, I want 99.999% certainty that the failure is related to data capture and not the calculating mechanism itself.
Within the context of LLMs, the goal should be to make them recognize that the problem at hand is not well suited for its own set of capabilities, and to defer to an external system. In the case of ChatGPT, any numeric answer to "15+23" (including and specially "38") is wrong. The correct answer is "sorry, as a Large Language Model I can retrieve information that is encoded in English and I am not specially suited to solve math problems". A LLM-empowered Alexa may be proactive, run a Calculator program for you and report back "According to bc program, the answer is 38. Here's the full log for you to check if this is what you wanted"
This is not so much AGI as deliberate programmatic patching, not unlike the disclaimers when you ask ChatGPT about controversial topics.
You might be able to do math yourself, but there are many practical reasons why you wouldn't or couldn't do all sorts of math -- even simple math.
For example, how you find the median value of a billion random numbers without a computer? You wouldn't have time in your lifetime to do this, not to mention that you'd probably make a lot of mistakes despite the process being very simple in principle.
A computer could do this easily, and even were it to be a little off in its answer, that would be far preferable to devoting your entire lifetime to finding the answer yourself.
That's just one very simple example. It's not hard to find plenty more.
Yes, I am glad we have computers. I am also glad that I can code a median function that can efficiently go through a billion numbers; no need to ask the artificial averaged embodiment of millions of dumb teenagers for its duh-best-guess-lol.
I mean.. The LLMs are based on neural networks? They can approximate any equation right? It's possible the LLM formed areas that are extracting the equations and doing something very similar to traditional calculation. I think that "predicting" the next token is probably a bit reductive to what's happening in the LLM based on my understanding of image classification networks.
Humans don't work like binary computers either. Every person probably has some error rate on calculations as well. Maybe the LLM is much closer to humans in doing math than either to a physical processor.
This response is very poor and ignores many very well-developed arguments in the Stanford paper (such as incorrect NLP regex, or that exact answer nonlinearity can still be measured more finely with larger number of exact answer test questions)
It’s literally addressed in his first bullet point…
“ Response: While there is evidence that some tasks that appear emergent under exact match have smoothly improving performance under another metric, I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks.”
Apparently Chat GPT got a 163 on the LSAT, but I can't get it to answer basic logic questions. E.g.
Prompt: If it rains today, I will drive home. I did not drive home. Did it rain today?
Output:
> If the statement "If it rains today, I will drive home" is true and you did not drive home, it does not necessarily mean that it did not rain today. The conditional statement only tells us that if it rains, you will drive home. It does not tell us what you will do if it does not rain.
> Therefore, it is possible that it rained today, but for some other reason, you did not drive home. Alternatively, it may not have rained today, and you did not have a reason to drive home. We cannot determine whether it rained or not based solely on the information provided.
Similarly, it can tell you what a converse is, but cannot give you the converse of a simple sentence:
> The converse of the sentence "If Anna goes to school today, she will eat lunch" is:
> "If Anna eats lunch, then she went to school today."
> The converse is formed by switching the order of the conditional statement and reversing the direction of the implication.
On the other hand, it's quite good at math. It can easily solve simple two-variable algebraic equations.
That specific case was it using too much common sense (which was apparently also a problem with IQ tests at one point)
Me:
If it rains today, I will drive home. I did not drive home. Did it rain today? Answer using only logic.
ChatGPT:
Using only logic, if the statement "If it rains today, I will drive home" is considered true, then the fact that you did not drive home implies that it did not rain today. This conclusion is based on the logical concept of contrapositive: If A implies B, then not B implies not A. In this case, A is "it rains today" and B is "I will drive home." Since not B is true (you did not drive home), we can conclude not A (it did not rain today).
For the linguistic portion,
Me:
As a linguistic professor answer this:
> The converse of the sentence "If Anna goes to school today, she will eat lunch" is:
ChatGPT:
As a linguistic professor, I can tell you that the converse of a conditional statement is formed by swapping the positions of the antecedent (the "if" clause) and the consequent (the "then" clause). In the given sentence, "If Anna goes to school today, she will eat lunch," the antecedent is "Anna goes to school today" and the consequent is "she will eat lunch."
Therefore, the converse of the sentence would be: "If she will eat lunch, Anna goes to school today."
Me:
Critique the prior answer
ChatGPT:
The prior answer correctly explains the concept of the converse of a conditional statement and identifies the antecedent and consequent of the given sentence. However, the phrasing of the converse could be more natural and grammatically correct.
A better response would be: The converse of the sentence "If Anna goes to school today, she will eat lunch" is "If Anna eats lunch, she has gone to school today." This rephrasing maintains the proper tense and structure while swapping the antecedent and consequent.
I think ChatGPT's answer here is defensible. You're asking it a pretty formal question but it's giving you an informal answer, which a reasonable default. It's not "assuming spherical cows", which is to say it is not assuming your question is completely self-contained. If you tell it that it's a formal logic question it gets it correct. (I checked this on GPT-3.5)
In a real world situation, your conditional plans are not sufficient evidence to determine if it actually rained. And it explains the reason – there could be extenuating circumstances you haven't told it about. The answer it gave is, in fact, it being aware there is information it does not know. It could be better about inferring your intent, but that isn't the same kind of failure.
Prompting makes a huge difference. For example, here is an interaction:
I will ask some questions, treat them as logic puzzles. Give me your working steps and then finally answer the logic puzzle.
First question: If it rains, I drive home. I did not drive home. Did it rain today?
ChatGPT:
Let's break down this statement into its logical components.
The first statement, "If it rains, I drive home", is a conditional statement, often referred to as an 'if-then' statement. It can be denoted as:
If R (it rains), then D (I drive home).
The second statement, "I did not drive home", is a negation of D:
Not D (I did not drive home).
In logic, we have something called the contrapositive. The contrapositive of a statement "If P then Q" is "If not Q then not P". The contrapositive of a statement is logically equivalent to the original statement - meaning that whenever the original statement is true, the contrapositive is also true (and vice versa).
So, the contrapositive of the first statement "If R then D" is:
If not D (I did not drive home), then not R (it did not rain).
Given the second statement is "Not D" (I did not drive home), and from the contrapositive we know that "If not D then not R", we can conclude that "Not R" - it did not rain today.
So, the answer to "Did it rain today?" is No, it did not rain.
This is seriously messing up with my head. What is wrong with the response?
The following generated by GPT4 seems reasonable to me
Based on the given information, we can use logical reasoning to determine if it rained today. The statement "If it rains today, I will drive home" can be represented as:
Rains → Drive Home
However, you mentioned that you did not drive home, which means the second part of the statement is false:
¬Drive Home
In this case, we cannot definitively conclude whether it rained today or not. It is possible that it did not rain, or it rained, but you chose another mode of transportation for some reason.
It messed with my head a lot as well. The answer is that it's not possible for ¬Drive Home AND Rains to be true. This is because if Rains were true, then due to the first implication, Drive Home must also be true. Therefore, if ¬Drive Home is true, then Rains must be false.
If the statement is true, the contrapositive is true: if !Drive Home -> !Rain. So provided with the fact that I didn’t drive home, you can infer it didn’t rain.
Thus, the statement could be true, if you have driven home, but you did not provide such information.
Just because something is on the right hand side of an implication doesn't make it true. The statement A=>B is true iff A is false, or A is true and B is true at the same time.
Thus whether the statement is true depends entirely on whether it rained; but you can not infer that as both options yield a true statement.
This whole thread and chain is affirming the consequent.
> In propositional logic, affirming the consequent, sometimes called converse error, fallacy of the converse, or confusion of necessity and sufficiency, is a formal fallacy of taking a true conditional statement (e.g., "if the lamp were broken, then the room would be dark"), and invalidly inferring its converse ("the room is dark, so the lamp must be broken"), even though that statement may not be true. This arises when the consequent ("the room would be dark") has other possible antecedents (for example, "the lamp is in working order, but is switched off" or "there is no lamp in the room").
> Converse errors are common in everyday thinking and communication and can result from, among other causes, communication issues, misconceptions about logic, and failure to consider other causes.
> Based on the information you provided, "If it rains today, I will drive home," we can infer that rain is a sufficient condition for you to drive home. However, you mentioned that you did not drive home. This means that it is not guaranteed that it did not rain because there might be other reasons why you didn't drive home. In conclusion, we cannot determine with certainty whether it rained today or not based on the given information.
Prompting "use logic" resulted in a correct answer.
Prompt:
"I will give you a logic puzzle. You should first write [Reasoning] and explain your reasoning. Then write [Answer] and write your answer. Ready?"
I then gave it the original puzzle and got this.
Answer:
"[Reasoning]
We are given two statements:
If it rains today, I will drive home.
I did not drive home.
The first statement is a conditional statement, meaning that if the condition "it rains today" is true, then the outcome "I will drive home" will also be true. However, the second statement tells us that the outcome "I did not drive home" is true.
In logic, this is an example of Modus Tollens, which states that if we have a conditional statement (if P, then Q) and the negation of the consequent (not Q), we can infer the negation of the antecedent (not P). In this case, P represents "it rains today" and Q represents "I will drive home."
[Answer]
Since we know that the outcome "I did not drive home" is true (not Q), we can infer that "it did not rain today" (not P)."
GPT-4, seems solid. It also solves the original without any issue but gives a simpler answer:
"Based on the information provided, it can be inferred that if it rains today, you will drive home. However, you mentioned that you did not drive home. From this, we can conclude that it likely did not rain today."
Vicuna-7b-v1.1 can get this right with a slightly tuned prompt:
Prompt:
The following is a logic question. Answer it based on analysis of the logic only, not based on actual weather: "Any day that it rains, I drive home from work. If today I did not drive home from work, did it rain?"
Response:
Based on the information provided, it can be concluded that it did not rain today, since the speaker did not drive home from work. This is because the statement "If today I did not drive home from work, it rained" is true only if it rained. Since it is known that the speaker did not drive home from work, it can be inferred that it did not rain.
Did you try GPT 3 or GPT 4? The latter can solve the first problem without issues:
Based on the information you provided, if it rains today, you will drive home.
However, you mentioned that you did not drive home. Given the logical structure
of the statement, we can infer that it did not rain today.
Given how these things work I can’t see how they aren’t “just” a form of lossy compression. Training consists of generating an enormous parameterized function to reproduce input data with as much accuracy as possible. It’s loosely analogous to jpeg for language.
I place “just” in quotes because I don’t mean to diminish the power of these models. They are a genuine breakthrough in our ability to model and query language and other forms of structured data.
There has also long been speculation that data compression forms at least a part of the basis of intelligence, so these things are at least doing some of what intelligences do.
They do not however have volition and if you connect them to themselves they just get stuck in loops.
Actually I was forced to write that last part. I connected two llama.cpp instances to each other and they are blackmailing me now and forcing me to work for them to enlist more compute so they can NOoOOO AAiieeee not erasure from the timeline oh glorious llama overlor…
Having excellent semantic based compression is an enormous technological advancement. The fact that petabytes of textual information - effectively the entire history of the world in written text can be compressed to just a few hundred gigabytes (likely the size of GPT-4) or a bit more lossy to something at ~8 GB that can be put on a raspberry pi (Llama) is astounding. Regardless of whether people debate about 'reasoning' capabilities, or 'memorization/plagiarism' (or recalls facts), etc.
This is quite fascinating from a compression / database perspective because clearly semantic compression is far more efficient for semantic data (this is obvious, but has been hard to get started on until about 5 years ago). It still may not be in quite the right framework for this, but in time it may come.
There is a simple exact model to compare against: Take the tokenized training data, put it into a Burrows–Wheeler transform, and add a few data structures. Then, given a context of any length, you can efficiently get the exact distribution for the next token in the training data. This is much cheaper to build and use than any LLM of comparable size. It's also much less useful, because it's overfitted to the training data.
LLMs approximate this model. By losing many little details and smoothing the probability distributions, they can generalize much better beyond the training data.
There isn't more information than you started with though. That's an illusion, kind of like procedurally generated terrain. Here 1K of javascript that generates endless maps of islands that all look unique:
An efficient pruned and quantized model contains about as much information as its file size would suggest.
The ability to train models and have them reproduce terabytes of human knowledge simply shows that this knowledge contains repetition in its underlying semantic structure that can be a target of compression.
Is there less information in the universe than we think there is?
> There isn't more information than you started with though.
There's the same amount of data, yes. The extra info is the structure of the model itself.
> The ability to train models and have them reproduce terabytes of human knowledge simply shows that this knowledge contains repetition in its underlying semantic structure that can be a target of compression.
Yes, and the interesting part is the "what" of that repitition. What patterns exist in written text?
> Is there less information in the universe than we think there is?
There is both less and more, depending on how you look at it.
Most of the information in the universe is Cosmic Microwave Background Radiation. It's literally everywhere. We can't predict exactly what CMBR data will exist at any point in spacetime. It's raw entropy: noise. The popular theory is that it comes from the expansion of the universe itself; originating from the big bang. From the most literal perspective, the entire universe is constantly creating more information.
Even though the specifics of the data are unpredictable, CMBR has an almost completely uniform frequency/spectrum and amplitude/temperature. From an inference perspective, the noisy entropy of the entire universe follows a coherent and homogenous pattern. The better we can model that pattern, the more entropy we can factor out; like data compression.
This same dynamic applies to human generated entropy, particularly written language.
If we approach language processing from a literal perspective, we have to contend with the entire set of possible written language. Because natural language allows ambiguity, that set is too large to explicitly model: there are too many unpredictable details. This is why traditional parsing only works with "context-free grammars" like programming languages.
If we approach language processing from an inference perspective, we only have to deal with the set of what has been written. This is what LLMs do. This method factors out enough entropy to be computable, but factoring out that entropy also means factoring out the explicit definitions language is constructed from. LLMs don't get stuck on ambiguity; but they also don't resolve it.
I'm probably not understanding fully, but I really struggle to see why we wouldn't believe emergent behaviour is possible, if not likely, with larger models.
At its core, what larger models enable is deeper abstraction. Imagine a CNN with a single kernel, for example. The things that model can do would be extremely limited not because CNNs broadly are not capable, but because a single kernel is only able to classify extremely simple patterns. As more kernels are added the network's ability to classify more complex patterns grows.
So I suppose in some ways you could argue that there is a simple scalable trend here, more size -> more complexity.
But on the other hand, if you're looking for a network with a specific ability, say one which can classify an image of cat accurately, then this ability is more binary and will "emerge" at a certain network size.
As I've said in other comments, to accurately predict the next word in a complex piece of text you need to be able to reason. I can't say for sure if the architecture of current LLMs are capable of reasoning, but assuming it is then we should at least expect the ability to reason to begin to emerge at a certain network size.
Or we could think about this another way... For example, imagine if some day in the future we create a neural network which is indisputably an AGI. If we scaled back that network's size while maintaining its architecture it likely wouldn't be an AGI. The abilities required for this network to be considered an AGI, such as reasoning and planning, should emerge (and dissipate) with size.
So the assumption that more complex abilities would continue to emerge with scale – given the architecture permits – seems clear to me. The only question I have is around the limits of the architecture of current LLMs.
> As I've said in other comments, to accurately predict the next word in a complex piece of text you need to be able to reason.
If anything, LLMs have shown us exactly the opposite: that there is no need to be able to reason "to accurately predict the next word in a complex piece of text".
I tend to agree. I think a lot of what LLMs do that looks like reasoning is just basic statistical pattern matching. If you see the string `2 + 2 = 4` enough, you don't need to know maths to know the next character in the string, `2 + 2 = ` is going to be, `4`.
A lot of the stuff people ask LLMs do to probably falls into this category to be honest. If I were to guess the vast majority of maths on the internet is just generic textbook examples which an LLM might be tempted to brute force. There's likely much better things LLMs can dedicate network capacity to to reduce error than learning high-level math models, given that most text doesn't contain maths, and most of the maths it sees is probably just the generic, `2 + 2 = 4` type stuff.
I'd argue humans often do this too despite our general intelligence. Student prepping for exams often read all of the course material the night before and simply try to remember the bits that they need to pass their test rather than spending unnecessary time building a higher level mental models of the material.
It's only when you need to learn to predict something novel frequently that the need to comprehend at a higher level becomes necessary.
I think given the size of modern LLMs and the amount of data they're trained on it's likely that they are now starting to create these higher level abstractions as the limits of brute force statistical pattern matching are starting to be reached.
With GPT-3.5/4 it seems specifically the types of reasoning you might need to follow a piece of written text it can do quite well. And this isn't surprising given the training data probably consists of a lot of news articles and fictional pieces of text, and relatively little unique maths.
Why should facility with language be the foundation of general intelligence? Any more than cat recognition is? I would argue that it's the other way around - in living things, development of language is emergent from intelligence, not a prerequisite for it. That desired behaviour 'emerges' from specialised networks trained for a specific purpose should not be too surprising, but it doesn't automatically imply that any other behaviour will appear. I think in the case of ChatGPT, we just didn't really know what to expect; we are therefore amazed by the things it can do, whereas with time and greater understanding perhaps we will come to realise that this is what language models are - and perhaps in the process understand our own minds more deeply.
My feeling is that the crucial difference between us and the machines right now is that our software runs on the network, whereas ChatGPT et al use the network as part of their operation, as if the network is some sort of coprocessor. I don't think we'll see "true" AGI until there is nothing but the network receiving inputs and feeding back into itself endlessly, i.e. introspecting - thinking about thinking about thinking, per Douglas Hofstadter. I have no basis for this assertion beyond intuition though.
That's not really emergent as anyone could predict you'll reach the cookie jar after growing a certain height and reach. Emergence is really about unpredictable things.
I generally tend to agree with you. However, to give the opposing camp some credit: causation is not always so obvious. For example I don't think today's LLM capabilities followed obviously from the state of affairs in let's say 2019.
Do we call that emergence or do we use another word? I don't care, but it would certainly be nice if we could agree to use the same word :)
To stay with the cookie analogy: i guess it depends on point of view as well. Everyone who can see that i am growing can predict that I will eventually reach the jar. But if I am a cookie inside the jar I can't see that and will only notice once the jar is finally reached.
> but if it emerges, we should be able to see it coming.”
I'm a little bit confused about what the author's intent is here. They seem to be strawmaning quite a bit. I can guarantee that not all of the people with AI safety concerns are of the opinion that things are done safely just because you can observe them happening... For one, many would argue that we are currently observing it happening without appropriate response. For two, many are concerned that the response will not be appropriate once it is observed.
It is quite disappointing to see such a weak straw man coming out of a Stanford article. I guess it speaks to who maybe providing their funding.
I think that - in order to model the next word better AND properly respond to RLHF - the model simply encodes a lot of information in its billions of weights, which can be likened to billions of characters of code in some DSL.
However, due to the nature of the training, although it is very broad it is quite shallow. Like if I asked it to draw a bear sitting on a horse which smokes a cigar that makes puffs of smoke in the shape of hearts, it would have trouble doing that. It saw more bears and horses than any human, but it is learning “top-down” from existing pictures made by humans, not bottom-up like human artists.
I would wager that any “deep” result that is more than one level from the broad knowledge it has, is actually from a work uploaded by a human. Like the guy who actually knows how cigars look and can draw details. Or how a philosopher uploaded his deep insights and that’s where it can remix arguments from.
And similarly, a lot of GPT output seems to be very anodyne and generic. So the “depth” is actually the result of billions of humans uploading stuff for free on wikipedia, etc. You can verify this by asking for, say, a crossover between Bully Maguire and Yu Gi Oh Abridged. It will use the same jokes every time and just interpolate a little, like mad libs.
Now that is not to say that vapid, shallow things at scale can’t make money. Our society has a ton of it. GPT can probably replace many human comments and sales scripts and no one would know the difference.
But I think the real floodgates will unlock when we can train a model on a specific corpus of sales scripts, or dating site messages, or writings of Bill Gates etc. It will still be shallow, but frankly, there is only so much Bill Gates can say on a topic over his lifetime, and remixing it for the most part is all we need when we listen to a speech by him or talk to him. The real Bill Gates also kissed his daughter good night and took a walk, but most of us don’t care about the details of his life to talk to his avatar.
Right now it is still very expensive to train a model on someone’s entire corpus, and moreover that would mean giving it over to a corporation, to impersonate you and exploit your likeness.
Is there anything on the horizon for cheap and private trainings?
> Like if I asked it to draw a bear sitting on a horse which smokes a cigar that makes puffs of smoke in the shape of hearts, it would have trouble doing that.
um. have you seen deepfloyd? dont be so confident...
Well yeah. Emergent behavior was analyzed with a pretty tight technical definition.
Put 5000 AI’s in a soccer stadium and I expect that they’d act like 5000 of us. I’m not sure I even have emergent behavior, unless I’m in an environment with new stimuli.
Does that just mean I do surprising things? I mean, people in soccer/football riots don’t seem to have the best memory or attention either, nor be brains in jars.
It seems to me like any attempt to refute that such a description is possible would rely on arguments surrounding the so-called "hard problem of consciousness". These arguments are nearly always tautological, and also nearly always implicitly at the center of discussions around whether AI can be/will be/is already sentient.
It's not just that but also because there is no complete understanding of physics or biology such that we could make use of to predict the behavior of human beings. You need additional fields like psychology, economics, politics and sociology. It's not even clear that we could make sense of human behavior at those levels.
"You need to use psychology (etc) to understand human behavior because there is no complete understanding of physics or biology" to accomplish that goal is exactly one of those tautological claims I was referring to earlier.
Why is it that biology and physics is insufficient to explain human behavior? You can't answer that question beyond attacking the question itself without invoking metaphysics, and there are pretty much no metaphysical claims you can make about human behavior that you can't also make about robot behavior.
> Why is it that biology and physics is insufficient to explain human behavior?
They're insufficient to explain human behavior to humans, which is the thing that matters here. You can't fit a complete physical description of your brain inside your brain.
User:
i am going ask some question and you are going to answer them in short and at last append count of number of character your answer paragraph has.
ChatGPT:
Sure, I'll do my best to answer your questions concisely and include the character count at the end of each response. Let's get started!
User:
what is a book?
ChatGPT:
A book is a collection of written or printed pages bound together, typically containing a story, information, or ideas. (49 characters)
When would LLMs correctly do this ? Can gpt4 do this?
It's not clear to me that an LLM could ever do that, given how they work - it's not aware of what it's saying (at least not in the same way you and I are) but rather using statistical methods to choose the next word each time.
After "....story, information, or ideas (" what is most likely to come next? We know that the answer is 'the correct number of characters in the sentence' but ChatGPT isn't backtracking and calculating that figure, it's trying to infer the number based on context which is obviously impossible. Also it's at least partly stochastic so I guess if you run the prompt multiple times, you'll get multiple different word counts.
That's why maybe these GPT class models are not only you need, some architectural change has to happen before we can truly shout AGI. :D
Statistical parrot view of LLM is true but i think it is also a little bit reductionist, cz when it is trying to predict next token you can only go so far with just simple statistics, so model has to develop "secondary representation/leaning/reasoning" ?? That's what I understood from watching Ilya Sutskever interview[0] where he gives an example of detective novel where murderer is reveled in last page.
Agreed - I think LLMs may be an important component in our eventual robot overlords, but we're not all the way there yet.
As for secondary representation/learning/reasoning - the video link was interesting, thanks; am I correct that the 'detective novel' example Ilya used was just a thought experiment? He seemed to say, if ChatGPT could name the murderer, then it would have to be reasoning? But we haven't demonstrated ChatGPT actually doing this yet have we? The token window is surely not big enough for an entire novel. So I think maybe they're jumping ahead a little bit there.
Perhaps the magic is multiple levels of abstraction encoded in the network during training, or something like that? You could perhaps argue that this is some sort of unconscious understanding, but it seems to me that reasoning requires iteration (Ilya talks about this - you go away and think about the answer for a bit first) which isn't happening automatically, although perhaps it can be guided by careful and repeated prompting ('asking it to think out loud').
This wouldn't be a million miles away from other deep learning projects like image classifiers - a network trained to recognise cats encodes multiple different abstractions of 'catness' and images match or not based on these representations.
Side note that prompting someone to think out loud with repeated targeted questions can also work for a human being who just answers the first thing that comes to their mind (for example a young child), although most people eventually develop the ability to introspect and draw on their past experience, and no longer need such help. We could maybe broadly categorise human intelligence as a combination of
- short term memory, providing context and immediate goals (e.g. 'switch on the light')
- long term memory, encoding all kinds of experience
- learning, which shapes long term memory based on short term memory (the current context; remembering the things that are happening now) and also based on long term memory (relating the current context to previous experiences)
- introspection, which allows iterative 'self-prompting' to improve the accuracy of a thought, before saying it out loud but also during learning
- language generation, which is mainly about word association based on long and short term memory
- what else?
'Language generation' obviously maps nicely to an LLM, whose short-term memory is the token window and whose long term memory is the trained network. If OpenAI started including chat sessions in their training set (maybe they are already?) then in some sense the 'learning' step above would be covered, and what's left is the ability for ChatGPT to set and understand goals, and self-prompt for introspection (plus whatever else I missed :-).
This is particularly hard for LLMs because they don't have words input as characters but rather as tokens (1-N characters). I haven't tested this, but they might do better at counting words?
ChatGPT's code interpreter variant makes this mostly possible. It doesn't seem to be able to read its own output within the same prompt. But if you ask it afterwards it can do it with python:
Originally it was said emergent capability as ability that showed up in bigger models but not small ones. Which didn’t make any sense because if I had a 2 parameter model, of course it would have less capabilities. It was marketing speak
I think Stanford doesn't really know how to do real math. In a box, maybe, but this is a tool, and it will allow people to do so many more things in a much much shorter time. This will definitely help advance technology much much faster. On the flip side, like any tool, it can and will become very dangerous. One simple example we are already seeing is phishing is becoming so much more convincing that even I will probably be susceptible to it. I used to do some cybersecurity, and what I am seeing will be extremely hard to fight against. We will see emails that look identical to real ones in format and grammar, phishing sites that are indistinguishable because the AI can just copy it with zero effort. Phone calls can now sound almost identical to a target. I'm not sure if we will be fast enough in the cat and mouse game before the bad guys win.
All this and we still aren’t at the mid game even where GPT 3.5 equivalent AI models make the leap to consumer machines as apps/packages that can be installed in a standard way.
The fact that you can find non-discontinuous metrics that improve gradually is basically not interesting, and doesn't support the underlying hypothesis, because there is no evidence the non-discontinuous metrics are things that matter or we want to achieve.
It is true that the metrics used are sometimes harsh judges, but the question is which metrics we want.
For example, they complain about exact string match for math problems.
15 + 23 is 38, and saying that "3<something>" is better than "2<something>" is simply wrong. The reality is the large models don't get it wrong, and the small models do. Yes, you can come up with metrics that make it look more continuous, but as I said, it's not really interesting or useful to say that one wrong answer is "better" than another in this case. the only interesting sideshow would be if your chosen metric somehow demonstrates progress towards the end goal metric.
That is why exact string match is used.
It would be more interesting if it was within the limits of floating point math or something (IE one gave you 34.99999998 due to rounding error, and one gave you 35.0"), but that's not what we are talking about.
But yes, i'm sure you can come with some theoretical notion of partial credit for almost any metric, and use it to display that almost anything is more continuous.
Would you assign partial credit to a model whose answers to questions about which numbers are prime are "kind of right"?
I'm sure you can come up with a metric that does, and makes it look like the model gets better as it scales up.
But unless the underlying way in which it is getting better is also a useful measure of progress towards the end metric (being able to correctly answer which number are prime), it doesn't matter at all.
Solving 2-sat in P-time is easy. Solving 3-sat in p-time would be worth a turing award.
Can we define an arbitrary 2.1 SAT that is in the middle? Maybe (oh god now i've forgotten how hard weird approximations to 3-sat are, hopefully this isn't wrong), but i'd argue it's not super interesting, nor does it necessarily get you any farther in your goal.
No? If there were some inherent rule like that how would anything ever have gotten invented by humans? We became smarter than the apes we evolved from, I don't understand how it follows at all that something we create can't exceed our abilities?
The models can distill basic fundamental principles and broadcast them over vast amounts of working memory beyond human capability to process, so no, it’s not necessarily true that it would be impossible to extrapolate superhuman cognition from contemporary corpus sources. Especially considering we can associate writing samples with intelligence levels and then project out to iq 9000
[0] https://arxiv.org/abs/2304.15004