The median rating is not descriptive of median ability, because a large number of Codeforces competitors only do one or a few competitions. A very small number of competitors hone their skills over multiple competitions. If we were to restrict our sample to competitors with more than 20 competitions, the median rating would be much higher than 1300. It's amazing that Alphacode achieved a 1300 rating, but compared to humans who actually practice competitive coding, this is a low rating.
To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.
>> To clarify, this is a HUGE leap in AI and computing in general. I don't mean
to play it down.
Sorry, but it's nothing of the sort. The approach is primitive, obsolete, and its
results are very poor.
I've posted this three times already but the arxiv preprint includes an
evaluation against a formal benchmark dataset, APPS. On that more objective
measure of performance, the best performing variant of AlphaCode tested, solved
25% of the easiest tasks ("introductory") and less than 10% of the intermediary
("interview") and advanced ("competition") tasks.
What's more, the approach that AlphaCode takes to program generation is
primitive. It generates millions of candidate programs and then it "filters"
them by running them against input-output examples of the target programs taken
from the problem descriptions. The filtering still leaves thousands of candidate
programs (because there are very few I/O examples and the almost random
generation can generate too many programs that pass the tests, but still don't
solve the problem) so there's an additional step of clustering applied to pare
this down to 10 programs that are finally submitted. Overall, that's a brute-force,
almost random approach that is ignoring entire decades of program synthesis
work.
To make an analogy, it's as if DeepMind had just published an article boasting
of its invention of a new sorting algorithm... bubblesort.
The APPS benchmark was for a “small 1B parameter model”, fine-tuned “without using clustering, tags, ratings, value conditioning, or prediction”.
> Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.
You don't get answers to these questions by random search. Not even close. I have looked at non-neural program synthesis papers. It is not remotely competitive.
Yes, apparently they couldn't use their full approach because "of missing information in the dataset". That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).
Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).
But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.
> Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans.
No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.
> That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).
This does not matter typically since you'd just pretrain on the data that works. However, “[t]he CodeContests training set has a non-empty intersection with the APPS test set, and therefore CodeContests cannot be used during training when evaluating on the APPS benchmark.” This is purely an evaluation issue; leakage doesn't matter so much in production.
>> No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.
Right, that's my mistake. The APPS dataset has natural language specifications and test cases for evaluation. It actually includes Codeforces problems.
The excuse quoted in the second part of your comment is an excuse. If a large language model can complete a code generation task, that's because it's seen an example of the code it's asked to generate before. Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
Your comment is an excuse, not mine: “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”
> Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.
This is the opposite of how burden of proof works. You are the one making a claim with certainty based off of guesswork, not me. And the paper does actually have a section on this, and finds copying isn't pervasive outside of utility snippets and functions, which also occur in human solutions. It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
>> “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”
Where is that quote from? Why are you quoting it? Am I supposed to reply to it?
>> You are the one making a claim with certainty based off of guesswork, not me.
I'm not talking about you. I'm talking about the paper, the team behind it and work on large language models trained on online data, in general.
The paper indeed makes a vague claim of "guarding" against data leakage by a "strict temporal split" which means they ensured that the validation and test data used for fine-tuning was not available to the model. That of course doesn't mean much. What matters is if the data on which the model was trained included programs like the ones the model was asked to generate. Clearly, it did, otherwise the model would not have been able to generate any programs that could be used as solutions to the test problems.
And I think you have the rule on the burden of proof a bit wrong. I don't have to prove anything that is already well-known. For instance, if I said that gravity makes things fall down, I wouldn't bear any burden of proof. Accordingly, there is no doubt that neural nets can only represent what is in their training set. That's how neural nets work: they model their training data. They can't model data that is not in their training set. It wouldn't even be fair to expect a neural net to learn to represent data that it wasn't trained on, and just to be clear, I'm not saying that there should be such an expectation, or that it is even desirable. This modelling ability of neural nets is useful. In fact, this is the real strength of neural nets, they are extremely good at modelling. I mean, duh! Why are we even discussing this?
But this is something that the deep learning community is trying to deny, to itself primarily, it seems. Which is exceedingly strange. Work like the one linked above prefers to make bizarre claims about reasoning abilty that we are, presumably, expected to believe arises magickally just by training on lots of data, as if there's a threshold of volume above which data is miraculously transsubstantiated into an element with quite different propeties, from which reasoning or "critical thinking" (dear god) emerges even in the complete absence of anything remotely like a reasoning mechanism. This is nonsense. Why not admit that in order for a large language model to be able to generate code, it must see code "like" the one it's asked to generate? Then we can talk about what "like" means, which is the interesting question. All this attempt to pussyfoot around what those systems are really doing is so counter-productive.
Again, this is not about anything you specifically say, but a criticism of deep learning reserach in general. I don't presume you're a deep learning researcher.
>> It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.
Not as impressive as you think. The problem descriptions used on CodeForces etc are not arbitrary English prose. They don't ask participants to write a poem about Spring (and I don't mean the old Java library). So it's not "prose" but very precise specifications. They could be represented as a Controlled Natural Language. So something much easier to model than arbitrary English.
And, yet again, the performance of the model is crap.
I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.
> Accordingly, there is no doubt that neural nets can only represent what is in their training set.
This is not true. If it were true that there was no doubt, the paper wouldn't have challenged it and claimed it false. If you assume your conclusion well obviously your conclusion follows trivially.
> And, yet again, the performance of the model is crap.
> It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work. To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.
This is almost certainly untrue and if this position were true, it would be extremely easy for you to prove it: just write a program-generating algorithm that solves even some of the easiest Codeforces problems. Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable. Just link your program-generating algorithm here with something like instructions on how to use it, and link a few Codeforces submissions were it got ACC result.
To clarify, which part of my comment do you think is "almost certainly untrue"? What I describe above is how their approach works. It's summarised in the paper. See Section 4. ("Approach"):
1. Pre-train a transformer-based language model on GitHub code with standard
language modelling objectives. This model can reasonably represent the space of
human coding, which greatly reduces the problem search space.
2. Fine-tune the model on our dataset of competitive programming data, using
GOLD (Pang and He, 2020) with tempering (Dabre and Fujita, 2020) as the training
objective. This further reduces the search space, and compensates for the small
amount of competitive programming data by leveraging pre-training.
3. Generate a very large number of samples from our models for each problem.
4. Filter the samples to obtain a small set of candidate submissions (at most
10), to be evaluated on the hidden test cases, by using the example tests and
clustering to pick samples based on program behaviour.
>> Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable.
What I meant was that the way they announced AlphaCode is like claiming that bubblesort is a novel approach to sorting lists. Not that the effort needed to create their system is comparable to bubblesort. I think if you read my comment again more carefully you will find that this is the first interpretation that comes to mind. Otherwise, I apologise if my comment was unclear.
Aren't you missing the point that even though success percentage is low, it is still same as estimated average human performance? So it is impressive the overall system (however cluncky it is) is indeed able to match human performance. If you don't find this impressive, do you have any other example of system that exceeds this performance?
To clarify, the low percentage of 25% of correct solutions is on the APPS dataset, not against human coders. See table 10 (page 21 of the pdf) on the arxiv paper if you are unsure about the difference:
Evaluation against the average competitor on Codeforces is not the "estimated average human performance", it's only the average of the coders on Codeforce who are an unknown proportion of all human coders with an unknowable level of coding ability. So evaluating against that is actually a pretty meaningless metric.
The benchmarking against APPS is much more meaningful but the results are pretty poor and so they are omitted from the article above.
So, no. I'm not missing the point. Rather, the article above is eliding the point: which is that on the one meaningful evluation they attempted, their system sucks.
Table 10 | n@k results on APPS. If there is no filtering, then n = k and the
metric is pass@k. Finetuned GPT-Neo numbers reported from Hendrycks et al.
(2021), Codex numbers from Chen et al. (2021). We used a time limit of 3
seconds per test to match Codex 12B, and report average numbers over 3 different
fine-tuning runs for AlphaCode.
Edit 2: And now that I posted this, I note that the 25% solutions are from Codex. AlphaCode's best result was 20%.
I am rated at 2100+ so I do agree that 1300 rating is low. But at the same time it solved https://codeforces.com/contest/1553/problem/D which is rated at 1500 which was actually non-trivial for me already. I had one wrong submit before getting that problem correct and I do estimate that 50% of the regular competitors (and probably the vast majority of the programmers commenting in this thread right now) should not be able to solve it within 2hrs.
On the AlphaCode Attention Visualization website [1], the Accepted code shown for 1553D is a O(n^2) Python one, which is supposed to be TLE. It correctly implements a two-pointer solution, but failed to "realize" that list.pop(0) is O(n) in Python. I'm not sure how it passed.
> AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.
Again, it is not. CPython does not do these things.
The web page says, and this is corroborated in the paper,
> Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect sample per problem and language. Note that since our dataset only has a limited number of test cases, passing all tests we have cannot completely rule out false positives (~4%), or solutions that are correct but inefficient (~46%).
The “54th percentile” measure did use estimated time penalties, which you can see discussed in Table 4 in the paper, but 1553D was not part of that.
Apologies, I thought you meant ‘optimized’ in a different sense, not in terms of how list.pop is implemented, as AlphaCode wasn't involved in that. You are entirely correct that list.pop uses memmove.
> Someone submitted this 1553D code to Codeforces and it passed
Ah, well that shows you have a 2 second time limit, which is quite a lot of time! Not quite enough to empty a 200k element list with list.pop(0)s, but not far off; a 140k element list squeaks in under the time limit for me.
The proposed O(N²) solution contains many unnecessary operations, e.g. the creation of list c or reversal of the input strings. Maybe it has been copied from a related problem? You can easily solve the task with half as many lines in O(N).
for _ in range(int(input())):
a = list(input())
b = list(input())
while a and b:
if a[-1] == b[-1]:
a.pop()
b.pop()
else:
a.pop()
if a: a.pop()
print("NO" if b else "YES")
I'm trying to solve this for fun, but I'm stuck! I've got a recursive definition that solves the problem by building a result string. I think it's a dynamic programming problem, but right now I can't see the shared sub-problems so :). Some real sour cherries being experienced from not getting this one!
from collections import defaultdict
def backspace(s1,s2):
h = defaultdict(lambda:0)
for x in s1:
h[x] = h[x] + 1
for x in s2:
h[x] = h[x] - 1
j = 0
maxj = len(s2) - 1
for x in s1:
if x != s2[j]:
h[x] -= 1
elif j < maxj:
j += 1
else:
break
return j == maxj and all(y >= 0 for y in h.values())
def random_backspace(s1):
res = []
for x in s1:
if randint(0,1) == 0:
res.append(x)
return "".join(res)
def backspaceTest(s1):
return all(backspace(s1,random_backspace(s1)) for _ in range(100))
To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.