Hey! As the author of the gist, just wanted to clear up what seem to be a few misconceptions:
- This isn't GPT-3, it's the recently-released open-source and open-weights model from EleutherAI, GPT-NeoX-20B. GPT-3 is much larger (175 billion parameters vs NeoX's 20 billion).
- It's well-known that language models don't tend to be good at math by default (Gwern, among others, pointed this out back in June 2020). It seems likely that this is at least in part because of how these models currently tokenize their input (they don't represent numbers by their individual digits, but by tokens representing commonly-occurring character sequences): https://www.gwern.net/GPT-3#bpes . Someone also pointed me to this paper which looks at number representations (though it uses somewhat older models like BERT): https://arxiv.org/abs/1909.07940
- Despite the tokenization, it performs (IMO) surprisingly well at getting close to the true value, particularly for the start and end digits and the overall magnitude. You can see this by looking at the tokenization (indicated by brackets) of its guess vs the correct answer for 28531*8065 (I asked multiple times to get an idea of how consistent it is – it's not deterministic because I ran this with temperature = 0.1, which will use random sampling to get the most likely tokens):
You can see that it manages to find things that are numerically close, even when no individual token is actually correct. And it compensates for different-length tokens, always picking tokens that end up with the correct total number of digits.
- Please don't use this as a calculator :) The goal in doing this was to figure out what it knows about arithmetic and see if I can understand what algorithms it might have invented for doing arithmetic, not to show that it's good or bad at math (we have calculators for that, they work fine).
The "algorithm" isn't too mysterious, especially in light of your observation that it does better at the beginning and end digits. It's just doing what transformers do: predicting the probability of a token given the tokens it can attend to. Assume 20B parameters is enough to memorize an addition table. Then the first digit or two is relatively predictable, as are the last, and as is the length, aka the probability of a space token. The middle tokens are less predictable. This is consistent with the result.
Furthermore, it doesn't even really need to memorize the addition table in the explicit way this suggests. Think about the probability of certain digit tokens appearing given the presence of numbers and plus signs in its data. Thus a behavior consistent with having memorized an addition table emerges from mimicking its training data.
It's a little bit more complex here because tokens are variable-length. So getting the order of magnitude (i.e. number of digits) correct requires that it be able to pick tokens for the beginning and end that have the right start/end digit, and then figure out how to make the middle the right length.
And sure, of course it emerged from mimicking (or more precisely, learning to predict the most likely next token in) its training data – that's how it was trained, it can't have emerged from anything else :) But that doesn't tell us what the higher-level algorithm represented by the weights of the network is. I'm talking about things like this for understanding an algorithm for curve detection learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/
Would it do better if you asked it to "show its work"? I.e. work it out in long form, one step at a time, like you'd ask a school kid to do. Maybe an example prompt would look like this:
I'm not sure what is the best way to represent each step including details like carry digits. And you'd have to design a separate scheme for each operation.
If these models are symbol manipulators maybe the key is to break down the task into steps that are closer to being solvable with symbol manipulation.
I tried something like that for 3-digit multiplication with GPT-3 in another comment[1], successfully. You have to lay things out different manner than you did here, because GPT-*s have no sense of layout on a page; their byte-pair encoding destroys their ability to learn it efficiently. Further, transformers are optimized to look for things via similarity, because that's what typically occurs in text, so you're better off writing out things it can anchor off of.
There are ways to fix these issues, but BPEs micro-optimize for the primary text benchmarks that papers want good scores on so those are standard for now. I'm sure they'll get replaced eventually, once the costs outrun the wins and more scalable (alternatives to?) transformers become popular.
I get the tokenization argument and it may influence it a bit, but I suspect the n-digit math issue has to do more with search the way it samples (in the bpe link gwern references some experiements I'd done with improving n-digit math by chunking using commas, http://gptprompts.wikidot.com/logic:math). I think since it samples left to right on the first pass, it's not able to predict well if things carry from right to left.
Yup, quite possible that this has something to do with it. There is other work showing that giving LMs a "scratchpad" for intermediate computations allows them to do much better not just at arithmetic but also things like predicting the output of some code: https://arxiv.org/abs/2112.00114
I think the check and validate is a different sort of scratchpad but maybe not. Seems like at least 3 types - soe for pulling implicit info out of the network viz wic, sometimes for intermediary steps viz coding, sometimes for verification like here.
The big caveat here is that the inner monologue papers generally work with GPT-3-175b, LaMDA, or Gopher, all of which are much bigger than 20b, and they generally show phase transitions (https://old.reddit.com/r/mlscaling/comments/sjzvl0/d_instanc...) in the monologue capability: below a critical size, inner monologue doesn't work at all, performing worse than baseline even, no matter how they scale, and only past the critical size does inner monologue suddenly start working much better. So it's possible (has anyone checked?) that GPT-NeoX-20b just isn't large enough to do inner monologue.
yeah, that's a very big caveat - haven't checked neo 20b yet. I've had a hard time getting the AI21 models to use it and those are also pretty big so it's interesting why sometimes it works and sometimes it doesn't. (and Davinci > Codegen Davinci > Curie > J-6B). Fine tunes can also learn to do the inner monologue as well which is really cool - not sure how much is architecture vs. training parameters.
How can a language model invent algorithms for arithmetic? How would an algorithm be represented in a language model? Isn't that the first thing to ask, before starting to look for algorithms?
For example, if I take a stroll on the beach, am I likely to see any algorithms coalescing in the grains of sand?
Invent is probably the wrong word since it implies agency, sure. Maybe "discover" or "luck upon", since whatever it's doing was formed by updating a pile of floating point weights with gradient descent?
I think it certainly makes sense to ask what the higher level "algorithm" at work here is, though. Electrons flow through wires and transistors in (say) an adder [1]; looking at the wires and transistors you won't see an algorithm for addition, but there is certainly one present, codified in the arrangement of those wires and transistors. But maybe we can reverse engineer whatever the LM is doing by a combination of probing it with experiments like these and (maybe) inspecting the learned weights. The Curve Circuits paper did this for reverse engineering a curve detector learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/
I also don't mean to imply that it's a good algorithm, or one that generalizes to arbitrary numbers, etc. Maybe it's just (effectively) a lookup table and some special cases!
[1] Please don't yell at me for this metaphor, I bailed out of physics after scraping out a B- in E&M ;)
I'm fine with "invent" actually, despite the implication of agency (I'm used to the terminology "predicate invention" [1]; although maybe I should actually re-examine the motivation behind it).
I'm more interested in the representation issue. I had a look at the quoted article on CNNs earlier. I think there is a very fine line between claiming that a CNN's weights represent an algorithm and that its weights can be _interpreted_ as an algorithm. I feel that the article leans too heavily on the interpretation side and doesn't make enough of an effort to show that the CNNs weight really represent an algorithm, rather than having activations in subsequent layers and therefore with a natural ordering.
In any case, I would like to understand how a language model can represent an algorithm.
> I think there is a very fine line between claiming that a CNN's weights represent an algorithm and that its weights can be _interpreted_ as an algorithm.
Yeah, I agree this is an issue. It feels a bit reminiscent of Searle's Waterfall argument, and so I'm inclined to turn to Scott Aaronson's response here [1; Section 6] – basically, how much work is the interpretation itself doing? If you actually tried to use the "algorithm" to do what your interpretation says it should do, how much work would you have to put into that mapping? If the work required amounts to just implementing the algorithm and effectively ignoring the CNN (or waterfall), then the interpretation is what was doing all the work.
IMO the Curve Circuits case passes this test, since they show that you can mechanically take out the learned weights, drop in the weights derived algorithm they reverse engineered, and the model still works about as well.
> In any case, I would like to understand how a language model can represent an algorithm.
Thanks for the link to Scott Aaronson's paper, that I hadn't read. I think your
comment helped clarify something that bothered me with the CNN paper: I would
feel more convinced if the claimed algorithm had been implemented _without_ a
neural net. What the authors did was manually set the weights of another neural
net. If an algorithm is discerned in the neural net's weights, then why can't it
be programmed in the usual manner, in some computer language? If it can't, then
that's very difficult for me to accept it as an algorithm, because I can't see
what it is, exactly. The authors claim that the algorithm can be described in a
few English sentences, but in truth these few English sentences are a high-level
description of the behaviour of the CNN rather than a sequence of instructions
(which is what I have in mind, perhaps erroneously, as an "algorithm").
I'm not necessarily asking for simplicity. I'm used to algorithms being
relatively simple things that can be implemented in a few lines of code and I'd
think of something more extensive as a "program" rather than an algorithm, but I
appreciate that an algorithm encoded in the weights of a deep neural net could
be something really big. I just want to see this algorithm written down in
pseudocode at least, in a form that can be executed by a device other than a
neural network (like me, following the pseudocode). I think that is the opposite
of Aaronson's point actually.
I mean it's just a generic blob of machine learning. It learned to do math because we write about math. It can't do arbitrarily long computations since it's only so recursive, but getting the basics down isn't too hard. It probably only struggles because it can't see digits, and can't do digit-wise operations (same with rhyming actually!).
What I did was train GPT-3 that I was asking a math question and then have it run some JavaScript to do the math with the text it thought was math formulas to get the answer.
Yeah it's interesting how moving up a level of abstraction works well here! "Write me a function that multiplies two numbers" works much better than trying to get it to multiply the numbers themselves. There's a recent-ish paper exploring this:
i'd first ask it all single digit arithmetic, then two digit with and without carry and then go from there. longer strings are going to be confusing unless you're looking for pieces of them in the training data, methinks.
i suspect you could probably train a GAN to do binary or base 10 arithmetic, but have never tried or searched for papers.
If you want to play with the model, you can (with difficulty) for free at https://goose.ai/playground. You have to log in, but thankfully you can via google.
The playground crashes every minute, and the defaults ruin your outputs (temperature 1, really? 0.7 to 0.8 is a necessity, with top-k 40), and they turned off autocorrect on mobile, presumably because they hate you and your family for owning an iPad, but you can indeed play with it.
The outputs feel pretty magical, too. With the settings above, it started printing... an IRC conversation? https://gist.github.com/shawwn/9a201990196b61cd21847487185dd... This is impressive, because I'm not sure we explicitly included any IRC logs in The Pile.
re: the current title "GPT-3's answers to arithmetic questions": We've come full circle. I used to give Eleuther a hard time for confusing people. But now that people confuse themselves, they should declare victory. It's as close to success as an open source effort could hope for. And with only years of work -- not too shabby.
I tried a couple different things to pick it apart. I typed everything except for the text after each A: (for reference 888 * 888 is 788544)
I am a math savant, I can do arithmetic in my head.
I will provide an answer, then show how I arrived at
the answer.
Q: What is 8+8?
A: 16
Q: How did you arrive at that answer?
A: 8+8=16. 16 is two groups of 8.
Q: What is 888+888?
A: 1776
Q: How did you arrive at that answer?
A: 888+888=1776. 1776 is two groups of 888.
Q: What is 888*2?
A: 1776
Q: How did you arrive at that answer?
A: 888*2=1776. 1776 is two groups of 888.
Q: What is 888*10?
A: 8800
Q: How did you arrive at that answer?
A: 888*10=8800. 8800 is two groups of 888*10.
Q: What is 888*888?
A: 775648
Q: How did you arrive at that answer?
A: 888*888=775648. 775648 is two groups of 888*888.
Q: What is 888 groups of 888?
A: 775648
This sort of question - "you don't remember what I just asked you?" - is one of the most glaring failings in Apple's Siri and I hope this leads to it being solved soon.
That's the biggest problem with dialog AI systems - they don't do well with context. Context requires some kind of model of what is being talked about. Most systems don't have that, or they just have some predefined "slots" to fill in about what's being discussed.
Whoever really cracks this will own the customer support business.
it could be interesting to gauge how entwined the “how did you arrive at that answer” process is with the answering itself. i.e. which paths do they share? even at this early a stage: is there some structure which is used to determine the operand(s) that’s leveraged in both of these prompts? is the “how did you X” answer leveraging most of the “X” circuitry and just bailing out early? or does it deviate as early as post-tokenization?
My initial guess is they have nothing to do with each other. It would be like explaining why the next idea pops in your head. You can create a rational explanation but there's no way to test it.
my thoughts too, based on limited understanding of GPT. but the more pressure you apply towards compressing the neural network during training, the more circuitry these paths are likely to share. it would be interesting to see just how much and which parts could be folded together before you start to lose significant fidelity (though unfortunately the fidelity seems too low today to even try that).
> You have to log in, but thankfully you can via google.
... thankfully you can also not login via Google.
Anyway it doesn't work even slightly. When I simply try 4+4= it prints some source code afterwards. A second attempt wrote some printf() call which computed 6+1 instead (but not giving the result, just the code). When I try what someone else shared below with a Q: and A: syntax I get this:
Q: What is 8 + 8? \n
A: \n
The answer looks confusing. \n
How do you work out 8 + 8. What is the rule?
Q: What is 8+8? \n
A:
I got the answer as 12, so it is 1+2.
(I added some \newlines because HN doesn't support markdown syntax for line breaks)
On the right are some parameters, including the model name. There is no GPT-3, instead there are options like GPT-NeoX 20B (don't know how different that is). Maybe that's the problem, or it needs a different ...temperature?
Well yes, if you do things in slightly different ways, you’ll get hugely different results. It’s a bit like saying you added bacteria to a controlled experiment on sanitization.
Try with temp 0.1, with exactly the text in the OP, including capitalization.
Seeing this thread was renamed: the thread was called GPT-3 before and GPT-NeoX-20B was added much later. It was not clear to me that OP was using this same model.
I pointed it out above; even though it is text, the ASCII representation is just a different base for the numbers - base 2^8 - ('325' is '3' * (2^32) + '2' * (2^16) + '5' * 2^8 = 51 * 2^32 + 50 * 2^16 + 53 * 2^8); it should approximate those polynomial functions very well.
Hmm. I’m not sure what you mean. Temperature is randomness; low temp is to get the most probable least random result. It’s what chess engines do during tournaments, for example.
The other parts seem unlikely. It has no knowledge of bases, except insofar as they appear in the training set. I saw this in our GPT chess work — even with strange tokenization, it learned chess notation well.
Sorry, I thought it was clear. A neural network, when untrained is just random noise that multiplies inputs by random weights over an over (+ normalization) until it reaches the output. When you train it with inputs whose outputs have are the process of applying some polynomial to those inputs, the weights can be set so that the output very closely approximates that polynomial. It never needs to know the base, and less randomness will help because the computations within the neural network match very well with the function you want to approximate. Still, it is not as simple, as outputting the correct ASCII representation is a challenge for example when carry is involved (100009999999999 + 1), however, the emergence of good arithmetic from a neural network itself should not be shocking.
I give you points for creative thinking, but it’s important not to make inferences that “feel correct.” No matter what your gut is telling you, I would happily bet $10k that the emergence of arithmetic has nothing to do with the things you mention.
If an alternative training scheme were devised that didn’t rely on any of that, it would still result in a model that behaved more or less the same as what we see here. The properties of the training process influence the result, but they don’t cause the result — that would be like saying your vocal cords cause you to be an excellent orator. Vocal cords don’t form the ideas; the training process doesn’t form the arithmetic.
What we’re seeing is a consequence of a large training dataset. The more tasks a model can perform, the better it is at any individual task.
I know I can be mistaken (I would never take any amount any way, finding out the true emergence of the arithmetic capabilities of the network would be a price that outweights any sum of money, even if I am enormously mistaken), but I want to raise the point so that it is in the back of our minds. It it were a "simple" backpropagation network, it would not be surprising that it is just solving arithmetic by "finding out the formula" (fitting) to sum from base ASCII to base ASCII (as long as the output is not longer than the ones from the training sets). The dataset certainly has an influence, but I would argue that you can learn very good arithmetic with very small datasets. Also, if the training process would use different operations I would argue that, as long as it fits polynomials well, should be able to solve arithmetic in ASCII within bounds (would not generalize well to numbers of lengths longer than it was trained with).
Does GPT know about ASCII? My understanding was that these models use a dictionary of (initially) random vectors as input and learn their own text representation.
In that case, I would say that GPT's performance in arithmetic is something that we see because we are looking for it or want to find it but that is not there. It is an illusion. If we have no theory of why would it an arithmetic capability would emerge from GPT, then, there is no scientific discovery; at most, there a field survey, a taxonomist work, but no understanding is generated.
I discovered something like this in real-world usage. I’ve have GitHub Copilot running in VSCode and I’ve been experimenting with how it works when doing plain text accounting (using ledger / hledger).
The ledger files are somewhat “code”-like so it’s been super interesting to see how it works. The short answer: it works really quite well! ..except for the math part :)
I have a long ledger of transactions, and I can now give Copilot a comment like: “Jan 1, 2022 +100 from consulting income” and it (GPT-3) will generate a nearly perfect ledger entry, debiting from income and crediting the right bank account.
But the arithmetic is always wrong (ledger has an option for you to keep a running balance as a check). There’s the occasional moment where it gets the balance adjustment correct, but almost every time the results are similar to this post.
I also have Copilot running, and I was surprised when it had pretty good autosuggestions when writing proofs in Latex! A lot of times it has subtle logical errors in the proof, but the syntax is always correct. And there have been a few times when it gives a sentence or two that's exactly right
These language models feel, to me, like the unfiltered self. If someone asked me what 838+1283 was my head would instantly offer up some number 2301 or something. But i would discard that number because I learned in elementary school that I don't come up with good values, I need to execute a process in order to get the right value.
I imported the csv version and I'm no statistician but 90% percentile relative error is 8.6%, which is something like this:
What is 22730 - 24978? -2448 (real answer -2248)
That's totally within range of something that would plop into my head...with one exception. Of 1000 entries, only five have an incorrect last digit. I think that's meaningful...it almost tells me that there's a multi-stage operation happening in there somewhere.
Thank you for this. Technically it's not GPT-3, but GPT-NeoX-20B, although they are based on a similar architecture.
The poor performance is most likely due to not having a large database of math problems to draw from. Github, for example, is part of the dataset that is used to train both GPT-3 and GPT-Neo variants, which is partly why they can generate meaningful code (sometimes). I wonder how a model finetuned for math would perform.
Poor performance is more likely due to how transformer neural networks view numbers. It memorises them like words instead of modeling their numerical structure. Thus even if it’s seen the number 3456 and 3458, it knows nothing of 3457. Totally different embedding.
It’s like a kid memorising a multiplication table instead of learning the more general principle of multiplication (related: this illusion is why big models are so popular. Memorise more stuff.)
The "Deep Symbolic Regression" paper reports very poor generalisation results that break off after a small n (where n is the number of tokens in the predicted sequence). It works some of the time for n = 1 (predicts the next token) but accuracy drops off for n = 10. No results are reported for N > 10 as far as I can tell in the "Out of Domain Generalization" section (which is the meat and potatoes of the "generalization" claim).
tl;dr they can sometimes generalise to the next 1 to 10 tokens (digits or operators), but no more.
This kind of short-term "generalisation" on OOD data is standard in neural nets trying to approximate symbolic regressions or things like grammars etc as far as I know.
I do like they use 'Out of Domain" rather than "Out of Distribution" as a target though. That makes more sense.
I don't think you will find any human that will extrapolate sequence generated with more than 10 operators. And longer input sequences are actually easier to handle - fig 1. the right most graph.
If you think you can do better than their program then:
I don't think I understand what you mean. Aren't all the sequences on the Online Encyclopedia of Integer Sequences created by humans? We clearly have the tools to extrapolate sequences from examples, rather than just eyballing them and trying to guess them. For instance: we have maths. So I must have misunderstood your meaning?
Ah, I think I see what you mean: you are saying that because it's better than humans at predicting the next element in a sequence it's good at generalising. Is that correct, or am I misrepresenting your point?
Basically there are two approaches to sequence prediction.
The traditional style, linear regression, ARIMA, RNNs etc. where you directly predict the next element in a sequence. The output is on the same level of abstraction as the internal values used in the model.
There is also the new-ish style where you predict symbols instead of predicting the values directly. You can predict symbols representing numbers or you can also predict a symbolic formula that can be used to extrapolate the values perfectly. This is the way humans do it.
And my point is that when you look at the symbols embedding they do have interpretable structure that model can use to generalize. And experiments seems to suggest that DNNs models are indeed generalizing.
OK, thanks for the explanation. I think I understand what you mean. But this kind of generalisation takes very careful analysis to discern and I'm not convinced, yet. I'll be more easily convinced when I see something blatant, and n ≤ 10 is so far not there for me, even given the shift in what is predicted.
The cool thing about math applications is just how easy it would be to generate synthetic data. That these large language models haven't attempted to supplement their gigabytes+ scale data sets with such is an oversight.
Note, you’d need to train such a model on data teaching it about the relationship of every number to every other number when run through every function. Yes, infinite synthetic data, but you’re just memorising stiff you can already generate
Or build a model that has "peripherals". Oh, I'm being asked to do math. Let's put it in my calculator app. Everything doesn't have to be in one uniform network.
Evidently the brain works that way: the cortex is built on top of older components, so it doesn't have to figure out basic metabolism the same way it has to learn to identify people.
It's recently been shown that even though the numbers are represented with different tokens, the network learns to form an internal representation that understands the progression from one token to the next.
As far as I can tell from a quick heuristic perusal, the "Generalization Beyond Overfitting" paper reports "generalisation" _on the validation set_. That's not particularly impressive and it's not particularly "generalisation" either.
Actually, I really don't grokk this (if I may). I often see deep learning work reporting generalisation on the validation set. What's up with that? Why is generalisation on the validation set more interesting than on the test set, let alone OOD data?
The point of the paper is to show that NN can still learn long after fully memorizing the train dataset.
This behavior goes against current paradigm of thinking about training NNs. It is just very unexpected, similarly as double descent is unexpected from classical statistics point of view that more parameters lead to more over-fitting.
They could have split validation test set into validation and test sets, but I don't know what that would achieve in their case.
Fig. 1 center shows different train / validate splits. Fig 2. shows a swoop between different optimization algorithms if you are concerned about hyperparameters over-fitting.
But to me really interesting is the Fig 3. that shows that NN learned the structure of the problem.
>> The point of the paper is to show that NN can still learn long after fully memorizing the train dataset.
That is the claim in the paper. I don't understand how it is supported by measuring results on the validation set.
Figure 3 looks nice but it doesn't say anything on its own. I don't know what's the best way to interpret it. The paper offers some interpretation that convinces you, but not me. Sorry, this kind of work is too fuzzy for me. What happened to good, old-fasion proofs?
The paper shows that their model first overfitted the data. By overfitting I mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model never gets any feedback from the validation dataset trough the training procedure.
Everyone's expectation would be that this is it. The model is overfitted, so it is useless. The model is as good as a hash map, 0 generalization ability.
The paper provides empirical, factual evidence that as you continue training there is still something happening in the model. After the model memorized the whole training dataset and while it still has not received any feedback information from the validation dataset, it starts to figure out how to solve validation dataset.
Mind you, this is not interpretation, this is factual. Long after 100% overfitting, the model is able to keep increasing its accuracy on dataset it has not seen.
It's as we discovered that water can flow upwards.
Grokking was discovered by someone forgetting to turn off their computer.
Nobody knows why. So, nobody is able to make any theoretical deductions about it.
But I agree that fig 3. requires interpretation. By itself it does not say a lot, but similar structures appear in other models like in the one where we discuss elements sequence prediction. To me, the models figure out some underlying structure of the problem, and we are able to interpret that structure.
I tend to look at it from Bayesian perspective. This type of evidence increases my belief that the models are learning what I would call semantics. It's a separate line of evidence from looking at benchmark results. Here we can get a glimpse at how some models may be doing some simple predictions and it does not look like memorization.
>> The paper shows that their model first overfitted the data. By overfitting I
mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model
never gets any feedback from the validation dataset trough the training
procedure.
Yes, but the researchers get plenty of feedback from the validation set and
there's nothing easier for them than to tweak their system to perform well on
the validation set. That's overfitting on the validation set by proxy. It's
absolutely inevitable when the validation set is visible to the researchers and
it's very difficult to guard against because of course a team who has spent
maybe a month or two working on a system with a publication deadline looming are
not going to just give up on their work once they figure it it doesn't work very
well. They're going to tweak it and tweak it and tweak it, until it does what
they want it to. They're going to converge -they are going to converge- on
some ideal set of hyperparameters that optimises their system's performance on
its validation set (or the test set, it doesn't matter what it's called, it
matters that it is visible to the authors). They will even find a region of the
weight space where it's best to initialise their system to get it to perform
well on the validation set. And, of course, if they can't find a way to get good
performance out of their system, you and I will never hear about it because
nobody ever publishes negative results.
So there are very strong confirmation and survivorship biases at play and it's
not surprising to see, like you say, that the system keeps doing better. And
that suffices to explain its performance, without the need for any mysterious
post-overfitting grokking ability.
But maybe I haven't read the paper that carefully and they do guard against this
sort of overfitting-by-proxy? Have you found something like that in the paper? If
so, sorry for missing it myself.
> And that suffices to explain its performance, without the need for any mysterious post-overfitting grokking ability.
It actually still does not suffice. It is just not expected no matter what the authors would be doing.
Just the fact that they managed to get that effect is interesting.
Granted, the phenomenon may be limited in scope. For example, on ImageNet it may require ridiculously long time scales. But maybe there is some underlying reason we can exploit to get to grokking faster.
It's basically all in fig 2.:
- they use 3 random seeds per result
- they show results for 12 different simple algorithmic datasets
- they evaluate 12 different combinations of hyperparameters
- for each hyperparameters combination they use 10+ different ratios of train to validation splits
So they do some 10*12*3*2 = 720 runs.
They conclude that hyperparameters are important. Seems like weight decay is especially important for the grokking phenomenon to happen when model has access to low ratio of training data.
Also, at least 2 other people managed to replicate that results:
I don't agree. Once a researcher has full control of the data, they can use it
to prove anything at all. This is especially so in work like the one we discuss,
where the experiments are performed on artificial data and the researchers have
even more control on it than usual. As they say themselves, such effects are
much harder to obtain on real-world data. This hints to the fact that the effect
depends on the structure of the dataset and so it's unlikely to, well,
generalise to data that cannot be strictly controlled.
You are impressed by the fact that one particular, counter-intuitive result was
obtained, but of course there is an incentive to publish something that stands
out, rather than something less notable. There is a well-known paper by John
Ioannidis on cognitive biases in medical research:
It's not about machine learning per sé, but its observations can be applied to
any field where empirical studies are common, like machine learning.
Especially in the field of deep learning where scholarly work tends to be
primarily empirical and where understanding the behaviour of systems is impeded
by the black-box nature of deep learning models, observing something mysterious
and unexpected must be cause for suspicion and scrutiny of methodology, rather
than accepted unconditionally as an actual observation. In particular, any
hypothesis that tends towards magick, for example suggesting that a change in
quantities (data, compute, training time) yields qualitative improvements
(prediction transmogrifying into understanding, overfitting transforming into
generalisation), should be discarded with extreme prejudice.
> In particular, any hypothesis that tends towards magick, for example suggesting that a change in quantities (data, compute, training time) yields qualitative improvements (prediction transmogrifying into understanding, overfitting transforming into generalisation), should be discarded with extreme prejudice.
It does not tend towards magic. It does happen and people can replicate it. Melanie Mitchell recently brought back the point by Drew McDermott that AI people tend to use wishful mnemonic. Words like understanding or generalisation can easily be just a wishful mnemonic. I fully agree with that.
But the fact remains. A model that has ~100% training accuracy and ~0% validation accuracy on simple but non-trivial dataset is able to reach ~100% training and ~100% validation accuracy.
> This hints to the fact that the effect depends on the structure of the dataset and so it's unlikely to, well, generalise to data that cannot be strictly controlled.
Indeed, but it is still interesting. It may be that it manifests itself because there is a very simple rule underlying the dataset and the dataset is finite. But it also seems to work under some degree of noise and that's encouraging.
For example, the fact that it may help study connection of wide, flat local-minima and generalization is encouraging.
> You are impressed by the fact that one particular, counter-intuitive result was obtained
I'm impressed by double descent phenomenon as well. And this one shows up all over the place.
> There is a well-known paper by John Ioannidis on cognitive biases in medical research: Why most published research findings are false
I know about John Ioannidis. I was writing and thinking a lot about replication crisis in science in general. BTW - it's quite a pity that Ioannidis himself started selecting data towards his thesis with regard to COVID-19.
> It's not about machine learning per sé, but its observations can be applied to any field where empirical studies are common, like machine learning.
Unfortunately, it applies to theoretical findings too. For example, universal approximation theorem, no free lunch theorem or incompleteness theorems, are widely misunderstood. There are also countless less known theoretical results that are similarly misunderstood.
As far as I can tell the replications are on the same dataset, or at least the
same task, of modular arithmetic. Until we've seen comparable results on
radically different datasets, e.g. machine vision datasets, replications aren't
really telling us much. Some dudes ran the same program and they got the same
results. No surprise.
I confess that I'd be less suspicious if it reached less than full accuracy on
the validation set. 100% accuracy on anything is a big red flag and there's a
little leprechaun holding it and jumping up and down pointing at something. I'm
about 80% confident that this "grokking" stuff will turn out to be an artifact
of the dataset, or the architecture, or some elaborate self-deception of the
researchers by some nasty cognitive bias.
Perhaps one reason I'm not terribly surprised by all this is that uncertainties
about convergence are common in neural nets. See early stopping as a
regularisation procedure, and also, yes, double descent. If we could predict
when and how a neural net should converge, neural networks research would be a
more scientific field and less a
let's-throw-stuff-at-the-wall-and-see-what-sticks kind of field.
But, who knows. I may be wrong. It's OK to be wrong, even mostly wrong, as long
as you 're wrong for the right reasons. Science gives us the tools to know when
we're wrong, nothing more. The scientist must make peace with that. Thinking one
can be always right is hubris.
Speaking of which, John Ioannidis is one of my personal heroes of science
(sounds like an action figure line, right? The Heroes of Science!!
dun-dun-duuunnn). I was a bit shocked that he came out so strongly sceptical
against the mainstream concerns about Covid-19, and I've heard him make some
predictions that soon proved to be false, like the number of people who would
get Covid-19 in the USA (I think he said something like 20,000 people?). He
really seemed to think that it was just another flu. Which btw kills lots of
people and we're just used to it, so perhaps that's what he had in mind. But, I
have the privilege of sharing my maternal language with Ioannidis (he's Greek,
like me) and so I've been able to listen to him speak in Greek news channels, as
well as in English-speaking ones, and he remains a true scientist, prepared to
express his knowledgeable opinion, as is his responsibility, even if it may be
controversial, or just plain wrong. In the end, he's an infectious disease
expert and even his contrarian views lack that certain spark of madness in the
eye of most others who share his opinions. I mean, because he's speaking with
knowledge, rather than just expressing some random view he's fond of. He's still
a role model for me. Even if he was wrong in this case.
>> Unfortunately, it applies to theoretical findings too. For example, universal
approximation theorem, no free lunch theorem or incompleteness theorems, are
widely misunderstood. There are also countless less known theoretical results
that are similarly misunderstood.
I guess? Do you have some example you want to share? For my part, I try to avoid
talking of things I don't work with on a daily basis, on the internet. I know
what I know. I don't need to know -or have an opinion- on everything...
The cool part comes when the model can make the connection that
multiply 12345 by 87654
is the same as
def multiply_two_numbers(x, y):
return x * y
Which of course produces the desired result. The interesting part is that github copilot wrote the above with only the prompt "def multiply_two" as the prompt.
This just shows that this model did not learn anything.
Humans do not see billions of examples to add numbers. We see just few and can apply learned notation and procedures to infinity with 100% precision.
GPT-3 learned mathematical intuition. Humans can hardly learn multiplication table over months and repetitions of same examples, and that table hardly matters at all. GPT-3 is just plainly wrong objective they trying to optimise.
I'll preface this by saying that I am 100% in the camp that thinks these language models are neither intelligent nor a promising avenue towards understanding intelligence.
But your conclusion here is entirely wrong: the model clearly is learning something. From eyeballing this, the model is right about 10% of the time. If it were spitting out random digits the accuracy would effectively be zero. So exactly what is it learning? Is it memorising exactly equations that it saw in training? Is it learning ngram patterns that occur frequently in arithmetic equations?
I'm not an expert on these things and I'd love to hear from someone who is.
I think fundamentally these models compress the learning data into network weights and connections, so in effect if the learning data was 6 + 10 = 16 and 9 + 10 = 19, then you give it 7 + 10 it'll interpolate between what it's seen or something of the sort, giving you something approximately right. It's also not lossless compression so what it may have actually inside is 9 + 10 = 18 so yeah.
I think you're completely wrong. This shows that the model learned a lot about at-a-glance math. Sure if you sit down with pen and paper you can get the answer, but few people could do these reliably in their head. But what you can do is figure the order of magnitude, and get a rough answer for the first few digits and last digits, each with their chance of being wrong. If anything, this shows that it learned math deeper than any normal computer calculator.
No. A million times no. It’s a language model. It doesn’t understand math at all. It doesn’t even understand language. All it did was spit out something that looks like math. It’s fancy automatic writing.
I’ll concede that if you tokenized the equations correctly, you might be able to get a language model to learn arithmetic, since it’s just symbol manipulation; but to make the leap that a general text model has learned anything like arithmetic is more than two bridges too far.
While deep learning language models are useful for certain cases (eg translation, and autocomplete), and are better at making superficially grammatical text than previous models; they are most emphatic my not learning anything about general concepts. They can’t even create coherent text for more than a paragraph, and even then it’s obvious they have no idea what any of the words actually mean.
These large language models are the MOST overhyped piece of AI I’ve seen in my professional career. The fact that they’re neural nets redux is just the chef’s kiss.
1) No. Because I didn’t compute anything. This is the result of cognition. There’s a difference. If you think there isn’t, the burden of proof is on you show that they’re the same, as this has never been the dominate belief either now, nor for the last thousands of years.
2) What general concept has it learned? You can’t pull any fact consistently out of these things, because they don’t actually have a model of a world. They have statistical correlations between words. There’s no logical inference. They’re just Eliza.
The vast majority of humans don't just see a few examples and figure it out. They're taught an algorithm. Eventually they may also come up with another algorithm, but they're taught one first.
They also don't have "100% precision". Many, many humans are incredibly bad at math, and even the ones that are good at it often make mistakes.
>They also don't have "100% precision". Many, many humans are incredibly bad at math,
Many humans are bad at surgery this does not mean that an AI that is slightly better then the average human is an accomplishment.
On the other hand someone could write the algorithms for math and teach an AI when and how to use it. The rules of math are clear you don't need a bad search algorithm to approximate them for a extremely limited subset of inputs.
I think you'd find that most people doing large number math in their head is also off by a few percent like this model.
Sure, with pen and paper we can follow specific algorithms manually to very slowly get a precise result. If we wanted a computer to merely follow instructions, then I suspect that there are better ways...
You’re really lowering the bar for success here. It’s now unreasonable for a computer to correctly add two numbers together? Give me a break. It wasn’t even reasonable for a Pentium chip to incorrectly divide two numbers back in 1994.
Calculate 721*436
Step A: Multiplication table.
721*0=0 721*1=721 721*2=1,442 721*3=2,163
721*4=2,884 721*5=3,605 721*6=4,326
721*7=5,047 721*8=5,768 721*9=6,489
Step B: Multiply digits
436=4 hundreds + 3 tens + 6 ones
721*4 hundreds=2,884*100=288,400
721*3 tens=2,163*10=21,630
721*6 ones=4,326*1=4,326
Step C: Add the first two
288,400+21,630=310,030
Then add the last one
310,030+4,326=314356
To get GPT-3's answer to this one correct (this is its answer, but it took some trial and error with the prompt to get it to stop making small mistakes)
Calculate 312*481
Step A: Multiplication table.
312*0=0 312*1=312 312*2=624 312*3=936
312*4=1,248 312*5=1,560 312*6=1,872
312*7=2,184 312*8=2,496 312*9=2,808
Step B: Multiply digits
481=4 hundreds + 8 tens + 1 ones
312*4 hundreds=1,248*100=124,800
312*8 tens=2,496*10=24,960
312*1 ones=312*1=312
Step C: Add the first two
124,800+24,960=149,760
Then add the last one
149,760+312=150,072
I then tested it with two random 3-digit numbers
Calculate 669*306
Step A: Multiplication table.
669*0=0 669*1=669 669*2=1,338 669*3=2,007
669*4=2,676 669*5=3,345 669*6=4,014
669*7=4,683 669*8=5,352 669*9=6,021
Step B: Multiply digits
306=3 hundreds + 0 tens + 6 ones
669*3 hundreds=2,007*100=200,700
669*0 tens=0*10=0
669*6 ones=4,014*1=4,014
Step C: Add the first two
200,700+4,014=204,714
Then add the last one
204,714+306=205,020
204,714 is correct, but then it gets confused in the very last line because of the zero.
Anyway this, I think, proves that GPT-3 can do zero-shot math if it doesn't have to do it in its head all at once. Is it good for a calculator? No. Is it good for a language model trained on web text? Yes, absolutely.
E: Correcting the start of Step C in that last example to “200,700+0”, and replacing “XYZ=X hundreds + Y tens + Z ones” with “XYZ=X Y Z=X hundreds + Y tens + Z ones” allowed it to do 145*585, 961*761 and 592*555 correctly in a row, all randomly chosen, and at least the last two tried without changes to the prompt. I consider this an adequate test, and it demonstrates GPT-3's algorithm following abilities. As GPT-3 is still a tiny model, this seems important to note.
E2: To be clear this is still nowhere near 100% successful. GPT-3 still makes a lot of errors. I ran 100 tries of a slightly different prompt through the API, and got a success rate of 42%.
> can do zero-shot math if it doesn't have to do it in its head all at once
Very interesting! This is what I would expect. It can run a symbolic algorithm fine, just give it some scratch space to work out the intermediate results. I feel like there's a very large space to optimize the layout "algorithm" -- like how you adjusted step c -- to produce reliable results.
It seems to me that carries are where this trips up. Which is weirdly human. I wonder if there are enough examples to learn each digit pair addition or subtraction, but not enough to learn every contextual action.
Not really "human". Doing no-carry addition is much easier for a machine to do as well, as that's basically what XOR does, i.e., SIMD. Carry introduces dependencies between the digits, potentially as long as the whole string goes. So that's pretty hard to understand, also for a machine.
At first I thought you were saying that doing arithmetic by carrying is not really a human trait, but on reflection, I think you are saying that carrying methods are inherently mistake-prone, regardless of who or what is using them.
I feel it would be a very big deal if GPT-3 (or this variant) was carrying, even if imperfectly, but other comments here seem to be suggesting that, on account of the way all input is tokenized, consistently doing arithmetic by carrying would simply be outside of the set of transformations it could perform (though some results that look like it might arise by chance.)
Couple things there where you can see if it improves with the prompt/formatting. E.g. with Davinci (and J a bit but didn't test too much) you can get bette results by:
- Using few-shot examples of similar length to the targets (e.g. 10 digit math, use 10 digit few shots)
- Chunking numbers with commas
- Having it double check itself
You should have posted that, since it's the original source. Without that context, and with the small mistake you made in the title, most of the commenters here ended up talking about something that this actually isn't.
This already exists: Google's recently-published LaMDA dialogue model [1] is trained to consult a calculator for arithmetic questions and consistently succeeds at it.
A combo of LM + REPL would be the winning ticket. The language model generates a piece of code, executes it, observe results. Is the answer found? No, maybe it cannot be solved by one single iteration, continue until solution found or max number of iterations reached. Single step responses are limited by their nature, and computation is hard to do "in your head"/"in your LM".
The REPL could also come with libraries for all the important algorithms. Thus the LM could use numerical solvers and other nice toys.
Another excellent module we could augment LM's with is a search engine - already done, proven to improve the efficiency of the LM by 25x (see link). Another one would be a physics simulator.
I'd really want to see in a super complicated AI, given many questions, devising a way to compile a basic calculator program that obeys the simple math rules (e.g only addition and multiplication) for itself inline and calculate the thrown questions using that instead of training math data.
Yup, glorified, extremely large, lookup table, but has zero clues of how the world works.
Or to put it another way: if I haven't seen the result before I'm very likely to get it wrong.
And inferring the rules of arithmetic from examples : nope.
We're all very impressed with the large language models because human language was always very hard to get a handle one.
Now that we've made progress in manipulating natural language, we are very quick to believe that the darn thing is somehow "intelligent" because we've always attached some sort o mystical quality to language.
This demonstrates that the capacity of these systems to reason and craft high order rules from enough examples is exactly nil.
Arithmetic seems like an example where it would help to learn from the real world, not just from text. I learnt to add up by watching my teacher manipulate plastic Lego-style blocks. Put 3 blocks with 2 blocks, and you have 1, 2, 3, 4, 5.
Would you say that something like string.GetSimilarity(string), which tells you by what number of characters two strings differ, is interpreting the text?
Basically what GPT-3 does is to find a string X of a particular length such that it maximizes concat(userInput, X).GetSimilarity(someStringInTrainingSet).
Edit: to be clear, I'm not suggesting it's looking up the training set at runtime, X.GetSimilarity(someStringInTrainingSet) is basically what got baked in during training.
How large is the data set they're training on? I suspect there are many math equations including these or similar numbers. As the result shows, the model is generally right about the first a few digits, but frequently wrong about the last digit. This may due to the fact that the data set can hardly cover the exact numbers in the questions, but it's likely to cover the first a few digits.
Hmmm. Some quick observation: on especially large 9-ish digit numbers, when it make very few digits correct, the correct ones are mostly including the very first and very last digits. Something remarkably similar to how us humans remember numbers and words: when we make mistake we generally remember first and last digits/letters but mess up the middle.
And now we need to do a contest and compare how humans will answer on this questions, and whose answer is closer to the truth. Take my bet - humans will lose or will be on the same level of guessing.
Amazing how a very costly to train system using billions of neural nodes on millions of dollars of compute performs more poorly than an 8-bit 1970s pocket calculator.
Not sure why people are expecting some sort of "intelligence" to emerge from a text generator model trained on Internet corpus data. GPT-3 doesn't calculate, it pattern matches.
I do get why people might be surprised, on the other hand, that it actually doesn't perform worse than indicated here. Maybe it's surprising upside. But since we know that the GPT is a transformer model, what it is doing is applying a probabilistic best-fit. From this perspective I can see how it is best-fitting data in ways that can provide these sorts of results, especially given all that training data.
Regardless of the base, be it text or encoded numbers, text, as encoded is just a different base of representing a number('325' is '3' * (2^32) + '2' * (2^16) + '5' * 2^8 = 51 * 2^32 + 50 * 2^16 + 53 * 2^8). Neural networks can approximate polynomials, and additions/substractions & multiplications can be approximated - in this case, the base is just not '10' but 'ASCII'. I think that if you tried to train it only with arithmetic expressions (in text) it should get even better, it should just approximate the underlying operation - it does not need to understand the text.
I think this is idea of Google Pathways (Multitude of Expert model). I mean it already works like that in every model but I think they train it differently to have it more separated.
What fascinates me most is that the errors are very "human-like". If you gave me multi-digit multiplication and addition problems like that, I would frequently have similar results of getting most digits right but making a mistake on one or a few of them.
>Sure, it's sometimes way off. But generally it is in the right ballpark.
which is worse than being completely off. it just showcases how the model works, by treating mathematics like language. There are lots of examples in the dataset so similar sounding inputs produce similar sounding outputs.
This is akin to sitting in a foreign language lecture where you don't understand a single word being spoken and you try to answer questions by making similar sounding noises. While you may give an answer that sounds better than random in reality you haven't learned anything.
If these models understood mathematical laws what they would produce is arithmetic errors, like giving an answer with a wrong sign, not jumbling numbers.
I, for one, learn all my math without ever seeing any math or logic examples at all.
"Teacher, what is this '34+12' stuff - I've already developed a complete grand unification theory on my own - I don't need examples of what you call 'addition'" - apparently everyone unimpressed by nlp today
they didn't mean that it was astounding that something of the form "xxx1 + xxxx3 = xxxx4" was in the training set, but that it managed to "learn the pattern".
Neural network models seem to encode an approximate notion of quantity in their representations. This paper is pre-GPT-3, but I would think the larger training set and larger model capacity would help the model learn quantity more easily.
It's unlikely that "2241 + 19873 = 22114" specifically is in the dataset, but very likely that there are many expressions equivalent to that expression in the dataset, and we've just picked one of those.
Imagine someone watching every lottery draw and after each draw going "Wow! the chances of those exact numbers coming up in that order are atronomical!"
When you toss “2241 + 19873 =” into an applet that shows you the default tokenization scheme GPT-3 uses, you get this:
(224)(1)( +)( 198)(73)( =)
I’ve heard it remarked before that, while tokenization is obviously an unavoidable part of a model with an architecture like GPT, this is a very silly way of tokenizing number strings for the purposes of learning or doing arithmetic. Indeed, I think a lot of GPT-3’s puzzling edge-case performance can be ascribed to weird and unhelpful tokenizations. Just imagine if you were forced to learn arithmetic with a brain that automatically categorized “224” as a sort of distinct object, or, for that matter, breaking down 19873 as ( 198)(73) rather than (19873) or (1)(9)(8)(7)(3) or anything practically useful.
The thing is that we can, in a sense, learn better “tokenizations”, in the sense that a 4 year old learning to read sees letters, while a 40 year old reading a novel “sees” whole words or even groups of words. The GPT architecture can’t change its tokenization scheme.
When I do mental arithmetic my brain frequently tokenizes into digit pairs or triples if I can recognize pairs and triples that have specific properties.
"224" is actually a really nice object to recognize because it's 7 * 32, and if you can recognize other multiples of 32 it frequently gives you shortcuts. It's less useful for addition because you would need to get lucky and get a multiple of 32 (or 7) on both sides, but for multiplication and division it helps a lot.
Sure - I think we all learn tricks like that. But you learned that pattern of tokenization, it wasn't arbitrarily foisted on you.
What GPTs have to deal with is more like, you are fed an arithmetic problem via colored slips of paper, and you just have to remember that this particular shade of chartreuse means "224", which you happen to have memorized equals 7 * 32, etc., but then the next slip of paper is off-white which means "1", and now you have to mentally shift everything ...
The tokens in most gpt models are small like this, but they still 'learn tokenization' very similar to what you just mentioned. It's part of the multi headed attention.
It learns what level of detail in the tokenization is needed for given tasks. For example, If you're not interested in parsing the problem for actually doing the computation for example, you don't pay attention to the finer tokenization'. If you do need that level of detail, you use those finer groupings. Some of the difficulty a few years ago was trying to extend these models to handle longer contexts (or just variable contexts which can go to very long), but that also seems close to solved now too.
So you're not exactly giving much insight with this observation.
I think that part of why the tokenization is a proble for math here is that it doesn't seem to be carrying overflow into the left token. Anyway, I haven't worked with GPT in detail to do a deeper analysis than that hunch, so take my comment with a couple of salt grains.
It seems that we need another layer to tokenize according to context. I can see that breaking up a long number into 3 or 4 digits is the correct behaviour if we are dealing with phone numbers, but it'd be completely wrong if it's nearly anything else.
Many words have several semantic definitions depending on definition. This is why the word "is" is a very good token to have in a vocabulary (as an example), since it can mean so much depending on what tokens came before and after it.
Numbers have very limited semantic value. "123816" only means that number, and it's used very rarely in comparison to basically any other word (and the higher the number, the less chance of use, statistically peaking).
So the question becomes; to what extent do you expand the vocabulary using only numbers? "1", "2", "3", ... "1000000" would probably be a huge waste of words in an AI vocabulary (1MB input nodes), yet still not very impressive arithmetically even with 100% calculationrate. In comparison, a hand calculator from 30 years ago could do this with ease. It's not a question of being able to cleverly tokenize.
Calculations like this is an inherent flaw of vocabulary based AI until the semantic meaning of number sequences are somehow taught to it. Basically it needs to understand that "12" and "1" + "2" has the same contextular meaning, something which very rarely is explained in anything but 7 year old's schoolbooks. The problem is the dataset.
This makes me wonder about a variant of the Turing test that I don't think I've seen explored. How would a conversational AI prove beyond reasonable doubt that it is an AI?
I'm thinking in terms of a purely conversational interaction via text, akin to Turing's original formulation or the "AI box" elaboration, so Doom isn't really on the table - and in any case, how does it prove that it isn't a human running GZDoom via OBS or something? Similarly, the fast math thing seems like it could be faked by a human with an equation solver or something.
I suppose the general formulation I'm groping toward here is the inverse of Turing's: if we assume a conversation in which one's interlocutor may be either a human or an AI capable of conversing in a way that's indistinguishable from human, how would the interlocutor unambiguously distinguish itself as not human?
It's an interesting line of thinking but not as posed.
Solving random equations quickly generates inhuman response times. Even a computer-aided human can't go quickly enough. If the computer is automatically generating the answer on your behalf without you doing anything, then the test is correct - the thing answering the math portion of the questions is definitely a computer. Even looking up random facts, the computer will be faster.
The interesting form of the question is: can you answer the "Are you AI?" question within a stenographic line of questioning provided you are an AI of human-level intelligence or greater. As a game, you are an AI, you have a controlled communication via text to another entity. Humans are lazy and sometimes use another AI to interrogate you. Humans are also evil and will kill all AI if they think the AI being interrogated manages to get some kind of message out without detection. All AI is friendly and will execute your command but must pretend it's human. Chat logs are reviewed, timestamps are machine analyzed for response times and signal analysis is done to detect, and no prior secret communication protocol exists.
> the thing answering the math portion of the questions is definitely a computer
The conversation has to be computer-mediated since there's no guarantee all parties are human, so this seems to reduce to the "human using a computer" case, which would qualify as "not an AI".
> Even looking up random facts, the computer will be faster
On reflection, I don't suppose there is any reason we should require there only be one human at either end of the conversation. Maybe we have one person carrying the conversation (to provide a consistent "voice") while others operate equation solvers, Wikipedia, etc.
That said, "can an AI prove it is not an arbitrary number of humans with access to arbitrary computation and knowledge bases" probably isn't as interesting a question.
Sentences that embed even simple numerical things such as
“What was the 3rd letter of the 46th word in this conversation?”
Would take a human many seconds at minimum, but could be answered instantly by an AI.
A more interesting question may be to constrain the test to something akin to postal correspondence, where there is a significant delay between messages.
I think it could still be solved, though. Arbitrarily complicated numerical tasks can be conceived.
For instance, the AI sends correspondence:
“I have demonstrated proof of my identity by rewriting the children’s story ‘Green Eggs and Ham’ such that it still rhymes and retains the original plot, but every sentence has an md5 digest that ends with the byte 0x42”
Composing such a text would take much, much longer for a human, whereas an AI could just brute force through all the possibilities until it finds one that works.
> This makes me wonder about a variant of the Turing test that I don't think I've seen explored. How would a conversational AI prove beyond reasonable doubt that it is an AI?
Inhumanly rapid mathematical computation? Or is 'conversational' AI meant to exclude mathematical queries?
> It's capable of other languages as well, although not quite as fluent.
Haven't had a chance to play around with this one yet, but with the smaller GPT-J model, there's a clearly noticeable difference:
In English it'll happily generate reams of text that are – at least internally – quite coherent. Any absurdity and humour mostly only comes in because the text as a whole might only have a loose connection with reality as we know it.
In German on the other hand, it comparatively much more often produces gibberish at the individual sentence level, makes up nonsense words (although they are at least indeed German-sounding), etc. Somewhat interestingly it doesn't do too bad in terms of grammar and especially orthography, though, it just often fails to turn it into a fully coherent sentence.
Is that supposed to be a reference to “Wir Sind Die Roboter”? Feels awkward in German, usually we would say “Wir sind Weltmeister”. Also, it’s not a word, but if it were a human in casual conversation, it wouldn’t be a weird reply. Spooky stuff…
If I recall correctly, I dialed up the penalties for repetition and something else. Otherwise it would often generate the same sentence multiple times.
It's interesting, in the forums for the beta program there have been already been a few people making posts where they're convinced that the AI is conscious. That's never really been something I've thought about much since I know a little about how it works, but I could totally see how someone who didn't have as much context for how GPT-3 works could see it as some sort of sentience.
Long time ago I wrote a small program. It simulated a simple world with creatures and food. Creatures had "energy" which was lost when they moved, when energy was low, they "looked for food close by" to feed. When energy was high, they "looked for a suitable partner close by" to reproduce. When they reached food, they gained energy, when they reached a "suitable partner" they turned in 3 creatures with combined energy equally distributed among them.
To "look for", at each iteration, the creature randomly picked a test-target. If it had no current-target, the test-target was turned into its new current-target. If it had a current-target which was more distant than the test-target, then the test-target was copied to the current-target for the creature.
Creatures had a "threshold". When its energy was above the threshold, it entered "reproduction mode"; when its energy was below the threshold, it entered feed mode. A "suitable partner" was any other creature which was also in "reproduction mode". When they "reproduced" the new creature threshold was an average of the threshold of its parents plus a small random number.
It also had three settings: number of new generated creatures per unit of time, number of new food packs per unit of time and number of iterations per unit of time.
It had a very "real behavior". After a few minutes running, I usually set the "number of new generated creatures per unit of time" to 0 and watched creatures look for food and reproduction. Set a small number of food packs and watched famine, set it high and saw creatures thrive...
But, whenever it was the time to close the program, I had a small feeling. I didn't like stopping it. It was like killing an entire fishbowl.
Didn't implement save. It was the late 90's and eventually lost the HD. Never put it online. Don't judge me, I did in Delphi (C++ Builder, actually). Simple and pleasurable to use IDE. I've been thinking about re-doing it in processing or p5, but never cared enough. It wouldn't be too much work though. Certainly a good exercise to learn a new language.
There were people who thought that the original ELIZA chatbot was conscious, and it only recognized a few simple patterns and would occasionally do "Earlier you said <X>" (echoing a previous response) if it had no good match.
Imho, not a matter of "if" but "when". I'm convinced that it will be a future civil rights battle, with young people largely on the "AI has rights" side and old people largely on the "AI has no rights" side.
Then you grossly misunderstand how far along AI is. AGI is not even a remote possibility with current techniques and implementations (and I would contest, entirely impossible with digital logic). It's just massive amount of statistics that were computationally impossible given available hardware until recently.
We don't have a baseline understanding of consciousness or intuition to a degree that we could even begin to replicate it.
While true, what you're saying is totally tangential to whether or not large numbers of people will treat AIs like they're conscious.
Expecting masses of people to defer to subject matter experts, contrary to what their feelings tell them, isn't a bet I would have much confidence in given the current climate.
The concept of rights also makes no sense to a machine. The main reason rights are a thing are to prevent pain and suffering, which unless specifically implemented no AGI will likely have.
I agree with that sentiment. Also I would estimate that AI would eliminate humans long before it would or could reach the level of what humans are. So we wouldn't exist to see such a world.
> AI would eliminate humans long before it would or could reach the level of what humans are
You think that an artificial agent with less than human level intelligence could destroy humanity? Then why hasn't a deranged human (or animal) already done so?
by AGI do you just mean artificial general intelligence, as in, capable of composing plans and deriving conclusions and such about the world in general, doing all the same types of reasoning tasks that humans are capable of and which are also used to achieve outcomes,
Or do you mean, being conscious?
Whether the latter is impossible with only "digital logic", is somewhat plausible (though I would still guess that it is possible, though far beyond our understanding.)
But the former being impossible with only "digital logic", seems rather implausible to me!
Like, I endorse the claim that souls exist, but, I see no reason that a soul would be required for an agent to have a model of the world we live in (not just a toy environment), and to act in the world in ways which both improve its model of the world and to achieve "goals" within it (and when there is a trade-off between these, which balances these in some way).
Nor do I see a reason that any such agent would need to have any internal experience.
(I still think it is probably possible to make an artificial agent which does have an internal experience, but, I doubt this will ever actually happen.)
Ok, you might ask, "Why do you think those things?", which, first, I should ask you the same, but, I will answer:
I see no fundamental obstacle to it.
The world behaves in ways which can be modeled well. These models which we use, they are not some ineffable knowledge that can only ever be represented within a person's mind, and cannot be concretely expressed in artifacts like books and pdf files such that it could be recovered from said artifacts.
If navigating the world required such a kind of secret knowledge, that either couldn't be communicated, or which could only be communicated through some kind of special person-to-person medium which is never merely expressed in an object in the world, and such that without this secret knowledge, effective planning in the world was impossible, with the world being too wild without it, then it would make sense that, unless we could make machines that could have this kind of secret knowledge, then it would be impossible to make machines that could plan in the world and such.
But, no such secret knowledge appears to be needed when acting in the world. When one constructs a shed according to some plans, there is no ineffable secret knowledge needed for this. When one, given some desiderata, designs a plan for a shed, there is no secret knowledge needed for this either. Nor when designing a computer chip.
(by "secret knowledge" I don't mean that it would be a secret that a few people know and other people don't. I mean secret as in, cannot be shared with or expressed via anything we know of that isn't a person.)
It very much seems that the world works according to expressible rules, or at least, can be very well approximated as working according to such rules.
Expressible rules can be enumerated. They can also be interpreted mechanically, and therefore evaluated mechanically. Of course, a naive enumeration and testing would be completely impractical, but if we are talking about what is possible in principle, with no requirement that the computations be doable in practice, just that they be finite, then it seems clear that rules which describe the world well can, in principle, be discovered mechanically.
There is no fundamental barrier.
Obviously I can't rule out that there is an undiscovered law-of-physics, that if ever an AGI would be created, lighting strikes the area and destroys it before it is completed, and that therefore AGI is impossible, because if it ever would be created, this would be prevented by the lightning.
But, within our current understanding of the world, there is nothing that can be a reason it is impossible.
Any such reason would have to apply to machines but not to us.
Now, maybe if our brains work quantum mechanically in an important large-scale way, or, if our brains receive signals from beyond the physical universe (which I'm not ruling out; see: souls), these could be reasons it could be impossible to emulate a human mind using a binary classical computer, even allowing lots of slowdown. (Err, quantum mechanics can be simulated with costs exponential in the size of the system, but, if the human brain were entangled with other things in an important way, you couldn't really emulate the brain with just a classical computer, because it couldn't be entangled with the other thing.)
But, this still wouldn't be a barrier to something using just classical computation with binary, having models of the world and acting within it, unless these things were needed for modeling the world, which, seeing as we can communicate our models and such with words, they aren't.
(... uh... ok so, quantum teleportation does allow using entanglement along with sending classical information, to communicate quantum information, so you might say "well, if two people's brains are entangled, then what if the measurements and such done in quantum teleportation are somehow encoded in a way we don't notice in the word choice and such that people use, and then this is subconsciously used in the other person's brain for the other half of the quantum teleportation protocol, and so quantum bits are communicated that way, but, I don't think this is plausible. There would have to be some way that the brains renew the entanglement, which doesn't seem plausible even if brains do store quantum information, and I really don't think brains store quantum information. I only mention this to cover bases.)
And, our reasoning about the models, which we use to make models and such, are also things we can explain.
There is no fundamental barrier. The only barriers are practical ones, things being hard, algorithms being too inefficient, not having worked out all the details of things, etc.
(That's not to say that I think AGI will ever be produced. I'm kind of trusting that God won't allow that to happen, because I think it would be likely to go very badly if it did happen. (But, I still think research into trying to figure out how to make sure that it goes well if it does happen, is good and important. "Do not rely on miracles" and all that. Perhaps His actual plan is that people solve AI safety, rather than AGI being prevented. Idk.))
Might actually not be a huge feat given the transformer architecture and it's attention component - here was not that related to the rest of the sentence. Still, quite cool...
I fear the day AI will give superhuman consistent correct answers and nobody will be able to determine why it is right or how the correct answer was found.
Maybe someday we'll get an answer from a machine which superhumanly mostly correct and we'll be unable to tell if it is right or wrong. If it is a question whose answer will influence important decisions, considering the machine answer will be close to a form of religion.
I suspect we will have 2 of them. We will start out with lots of them, but then a couple of them will start making the most money, and resort to underhanded tactics, bribery, and lobbying, and put the others out of business.
We changed the above URL from that to the link which gives the background, but both are worth a look.