The median rating is not descriptive of median ability, because a large number o...

YeGoblynQueenne · on Feb 2, 2022

>> To clarify, this is a HUGE leap in AI and computing in general. I don't mean to play it down.

Sorry, but it's nothing of the sort. The approach is primitive, obsolete, and its results are very poor.

I've posted this three times already but the arxiv preprint includes an evaluation against a formal benchmark dataset, APPS. On that more objective measure of performance, the best performing variant of AlphaCode tested, solved 25% of the easiest tasks ("introductory") and less than 10% of the intermediary ("interview") and advanced ("competition") tasks.

What's more, the approach that AlphaCode takes to program generation is primitive. It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.

To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.

Veedrac · on Feb 2, 2022

The APPS benchmark was for a “small 1B parameter model”, fine-tuned “without using clustering, tags, ratings, value conditioning, or prediction”.

> Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work.

You don't get answers to these questions by random search. Not even close. I have looked at non-neural program synthesis papers. It is not remotely competitive.

YeGoblynQueenne · on Feb 3, 2022

Yes, apparently they couldn't use their full approach because "of missing information in the dataset". That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).

Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans. That's standard fare for text generation benchmarks, like evaluating machine translation on some arbitrary set of human translations. There are no good benchmarks for text generation (and there are no good metrics either).

But the comparison against the average competitor on Codeforces is even more meaningless because we have no way to know what is the true coding ability of that average competitor.

Veedrac · on Feb 3, 2022

> Btw, APPS is not much of a benchmark. It evaluates code generation according to how close it resembles code written by humans.

No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.

> That points to a further limitation of the approach: it works for Codeforces problems but not for APPS problems (so it's very purpose-specific).

This does not matter typically since you'd just pretrain on the data that works. However, “[t]he CodeContests training set has a non-empty intersection with the APPS test set, and therefore CodeContests cannot be used during training when evaluating on the APPS benchmark.” This is purely an evaluation issue; leakage doesn't matter so much in production.

YeGoblynQueenne · on Feb 3, 2022

>> No, the metric used in this paper was the percentage of questions it could solve against the hidden tests.

Right, that's my mistake. The APPS dataset has natural language specifications and test cases for evaluation. It actually includes Codeforces problems.

The excuse quoted in the second part of your comment is an excuse. If a large language model can complete a code generation task, that's because it's seen an example of the code it's asked to generate before. Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.

Veedrac · on Feb 3, 2022

Your comment is an excuse, not mine: “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”

> Any claims to the contrary need very strong evidence to support them and there's typically no such thing in papers like the AlphaCode one.

This is the opposite of how burden of proof works. You are the one making a claim with certainty based off of guesswork, not me. And the paper does actually have a section on this, and finds copying isn't pervasive outside of utility snippets and functions, which also occur in human solutions. It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.

YeGoblynQueenne · on Feb 3, 2022

>> “ignore how good the advertised model is, because a different, much smaller version without all the techniques merely does pretty well on an extremely tough problem set.”

Where is that quote from? Why are you quoting it? Am I supposed to reply to it?

>> You are the one making a claim with certainty based off of guesswork, not me.

I'm not talking about you. I'm talking about the paper, the team behind it and work on large language models trained on online data, in general.

The paper indeed makes a vague claim of "guarding" against data leakage by a "strict temporal split" which means they ensured that the validation and test data used for fine-tuning was not available to the model. That of course doesn't mean much. What matters is if the data on which the model was trained included programs like the ones the model was asked to generate. Clearly, it did, otherwise the model would not have been able to generate any programs that could be used as solutions to the test problems.

And I think you have the rule on the burden of proof a bit wrong. I don't have to prove anything that is already well-known. For instance, if I said that gravity makes things fall down, I wouldn't bear any burden of proof. Accordingly, there is no doubt that neural nets can only represent what is in their training set. That's how neural nets work: they model their training data. They can't model data that is not in their training set. It wouldn't even be fair to expect a neural net to learn to represent data that it wasn't trained on, and just to be clear, I'm not saying that there should be such an expectation, or that it is even desirable. This modelling ability of neural nets is useful. In fact, this is the real strength of neural nets, they are extremely good at modelling. I mean, duh! Why are we even discussing this?

But this is something that the deep learning community is trying to deny, to itself primarily, it seems. Which is exceedingly strange. Work like the one linked above prefers to make bizarre claims about reasoning abilty that we are, presumably, expected to believe arises magickally just by training on lots of data, as if there's a threshold of volume above which data is miraculously transsubstantiated into an element with quite different propeties, from which reasoning or "critical thinking" (dear god) emerges even in the complete absence of anything remotely like a reasoning mechanism. This is nonsense. Why not admit that in order for a large language model to be able to generate code, it must see code "like" the one it's asked to generate? Then we can talk about what "like" means, which is the interesting question. All this attempt to pussyfoot around what those systems are really doing is so counter-productive.

Again, this is not about anything you specifically say, but a criticism of deep learning reserach in general. I don't presume you're a deep learning researcher.

>> It's a weak objection anyway; just the task of translating english prose into the general algorithm you want to apply is already an impressive feat.

Not as impressive as you think. The problem descriptions used on CodeForces etc are not arbitrary English prose. They don't ask participants to write a poem about Spring (and I don't mean the old Java library). So it's not "prose" but very precise specifications. They could be represented as a Controlled Natural Language. So something much easier to model than arbitrary English.

And, yet again, the performance of the model is crap.

Veedrac · on Feb 3, 2022

I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.

> Accordingly, there is no doubt that neural nets can only represent what is in their training set.

This is not true. If it were true that there was no doubt, the paper wouldn't have challenged it and claimed it false. If you assume your conclusion well obviously your conclusion follows trivially.

> And, yet again, the performance of the model is crap.

It isn't.

YeGoblynQueenne · on Feb 3, 2022

>> I was paraphrasing your argument. AFAICT, it remains faithful to what you are saying.

No, it remains faithful to your interpretation of what I said, which is designed to support your opinion rather than mine.

Also, basic manners: if you're not quoting, don't use quotes.

>> It isn't.

Is too!

We could do that all day. Or, we could look at the reported results which are, well, crap.

baobabKoodaa · on Feb 3, 2022

> It generates millions of candidate programs and then it "filters" them by running them against input-output examples of the target programs taken from the problem descriptions. The filtering still leaves thousands of candidate programs (because there are very few I/O examples and the almost random generation can generate too many programs that pass the tests, but still don't solve the problem) so there's an additional step of clustering applied to pare this down to 10 programs that are finally submitted. Overall, that's a brute-force, almost random approach that is ignoring entire decades of program synthesis work. To make an analogy, it's as if DeepMind had just published an article boasting of its invention of a new sorting algorithm... bubblesort.

This is almost certainly untrue and if this position were true, it would be extremely easy for you to prove it: just write a program-generating algorithm that solves even some of the easiest Codeforces problems. Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable. Just link your program-generating algorithm here with something like instructions on how to use it, and link a few Codeforces submissions were it got ACC result.

YeGoblynQueenne · on Feb 3, 2022

To clarify, which part of my comment do you think is "almost certainly untrue"? What I describe above is how their approach works. It's summarised in the paper. See Section 4. ("Approach"):

1. Pre-train a transformer-based language model on GitHub code with standard language modelling objectives. This model can reasonably represent the space of human coding, which greatly reduces the problem search space.

2. Fine-tune the model on our dataset of competitive programming data, using GOLD (Pang and He, 2020) with tempering (Dabre and Fujita, 2020) as the training objective. This further reduces the search space, and compensates for the small amount of competitive programming data by leveraging pre-training.

3. Generate a very large number of samples from our models for each problem.

4. Filter the samples to obtain a small set of candidate submissions (at most 10), to be evaluated on the hidden test cases, by using the example tests and clustering to pick samples based on program behaviour.

>> Since you're claiming this feat by Alphacode is comparable in difficulty to writing bubblesort (which you could write in 5 minutes), it shouldn't take you a lot of effort to produce something comparable.

What I meant was that the way they announced AlphaCode is like claiming that bubblesort is a novel approach to sorting lists. Not that the effort needed to create their system is comparable to bubblesort. I think if you read my comment again more carefully you will find that this is the first interpretation that comes to mind. Otherwise, I apologise if my comment was unclear.

sytelus · on Feb 2, 2022

Aren't you missing the point that even though success percentage is low, it is still same as estimated average human performance? So it is impressive the overall system (however cluncky it is) is indeed able to match human performance. If you don't find this impressive, do you have any other example of system that exceeds this performance?

YeGoblynQueenne · on Feb 2, 2022

To clarify, the low percentage of 25% of correct solutions is on the APPS dataset, not against human coders. See table 10 (page 21 of the pdf) on the arxiv paper if you are unsure about the difference:

https://storage.googleapis.com/deepmind-media/AlphaCode/comp...

Evaluation against the average competitor on Codeforces is not the "estimated average human performance", it's only the average of the coders on Codeforce who are an unknown proportion of all human coders with an unknowable level of coding ability. So evaluating against that is actually a pretty meaningless metric.

The benchmarking against APPS is much more meaningful but the results are pretty poor and so they are omitted from the article above.

So, no. I'm not missing the point. Rather, the article above is eliding the point: which is that on the one meaningful evluation they attempted, their system sucks.

Edit: Here's table 10, for quick reference:

                       Filtered From (k)  Attempts (k)  Introductory  Interview  Competition
                                                        n@k           n@k        n@k
  GPT-Neo 2.7B         N/A                1             3.90%         0.57%      0.00%
  GPT-Neo 2.7B         N/A                5             5.50%         0.80%      0.00%
  Codex 12B            N/A                1             4.14%         0.14%      0.02%
  Codex 12B            N/A                5             9.65%         0.51%      0.09%
  Codex 12B            N/A                1000          25.02%        3.70%      3.23%
  Codex 12B            1000               1             22.78%        2.64%      3.04%
  Codex 12B            1000               5             24.52%        3.23%      3.08%
  AlphaCode 1B         N/A                1000          17.67%        5.24%      7.06%
  AlphaCode 1B         1000               5             14.36%        5.63%      4.58%
  AlphaCode 1B         10000              5             18.18%        8.21%      6.65%
  AlphaCode 1B         50000              5             20.36%        9.66%      7.75%

And its caption:

Table 10 | n@k results on APPS. If there is no filtering, then n = k and the metric is pass@k. Finetuned GPT-Neo numbers reported from Hendrycks et al. (2021), Codex numbers from Chen et al. (2021). We used a time limit of 3 seconds per test to match Codex 12B, and report average numbers over 3 different fine-tuning runs for AlphaCode.

Edit 2: And now that I posted this, I note that the 25% solutions are from Codex. AlphaCode's best result was 20%.

gfd · on Feb 2, 2022

You can find the rating distribution filtered for >5 contests here: https://codeforces.com/blog/entry/71260

I am rated at 2100+ so I do agree that 1300 rating is low. But at the same time it solved https://codeforces.com/contest/1553/problem/D which is rated at 1500 which was actually non-trivial for me already. I had one wrong submit before getting that problem correct and I do estimate that 50% of the regular competitors (and probably the vast majority of the programmers commenting in this thread right now) should not be able to solve it within 2hrs.

rfoo · on Feb 2, 2022

1553D is a quite confusing case though.

On the AlphaCode Attention Visualization website [1], the Accepted code shown for 1553D is a O(n^2) Python one, which is supposed to be TLE. It correctly implements a two-pointer solution, but failed to "realize" that list.pop(0) is O(n) in Python. I'm not sure how it passed.

[1] https://alphacode.deepmind.com/#layer=30,problem=34,heads=11...

Jensson · on Feb 2, 2022

Likely the python runtime has a strange string implementation for cases like this, just like javascript strings.

Veedrac · on Feb 2, 2022

It does not. Really the strings just never get long enough that O(n²) would be catastrophic; the maximum possible length is 2e5.

rfoo · on Feb 3, 2022

2e5 is enough for making a naive O(n^2) solution to get TLE.

This is likely due to the fact that in AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.

Veedrac · on Feb 3, 2022

> AlphaCode's solution the "inner O(n) loop" is actually a memmove(), which is optimized to be insanely fast.

Again, it is not. CPython does not do these things.

The web page says, and this is corroborated in the paper,

> Solutions were selected randomly, keeping at most one correct (passes all test cases in our dataset) and one incorrect sample per problem and language. Note that since our dataset only has a limited number of test cases, passing all tests we have cannot completely rule out false positives (~4%), or solutions that are correct but inefficient (~46%).

The “54th percentile” measure did use estimated time penalties, which you can see discussed in Table 4 in the paper, but 1553D was not part of that.

rfoo · on Feb 3, 2022

> CPython does not do these things.

Again, it is.

https://github.com/python/cpython/blob/2d080347d74078a55c477...

This is the memmove() I mentioned above. Like, I actually perf-d the code and confirmed this is in the hot loop.

> but 1553D was not part of that.

Someone submitted this 1553D code to Codeforces and it passed: https://codeforces.com/contest/1553/submission/144971343

Veedrac · on Feb 3, 2022

Apologies, I thought you meant ‘optimized’ in a different sense, not in terms of how list.pop is implemented, as AlphaCode wasn't involved in that. You are entirely correct that list.pop uses memmove.

> Someone submitted this 1553D code to Codeforces and it passed

Ah, well that shows you have a 2 second time limit, which is quite a lot of time! Not quite enough to empty a 200k element list with list.pop(0)s, but not far off; a 140k element list squeaks in under the time limit for me.

johndough · on Feb 2, 2022

The proposed O(N²) solution contains many unnecessary operations, e.g. the creation of list c or reversal of the input strings. Maybe it has been copied from a related problem? You can easily solve the task with half as many lines in O(N).

    for _ in range(int(input())):
        a = list(input())
        b = list(input())
        while a and b:
            if a[-1] == b[-1]:
                a.pop()
                b.pop()
            else:
                a.pop()
                if a: a.pop()
        print("NO" if b else "YES")

pedrosorio · on Feb 2, 2022

> But at the same time it solved https://codeforces.com/problemset/problem/1553/D

To be fair, it generated a set of (10) possible solutions, and at least one of them solved the problem.

the-smug-one · on Feb 2, 2022

I'm trying to solve this for fun, but I'm stuck! I've got a recursive definition that solves the problem by building a result string. I think it's a dynamic programming problem, but right now I can't see the shared sub-problems so :). Some real sour cherries being experienced from not getting this one!

thorwwaskeas · on Feb 3, 2022

  from collections import defaultdict
  def backspace(s1,s2):
      h = defaultdict(lambda:0)
      for x in s1:
          h[x] = h[x] + 1
      for x in s2:
        h[x] = h[x] - 1
      j = 0
      maxj = len(s2) - 1
      for x in s1:
        if x != s2[j]:
            h[x] -= 1
        elif j < maxj:
                j += 1
        else:
            break
    return j == maxj and all(y >= 0 for y in h.values())

  def random_backspace(s1):
    res = []
    for x in s1:
        if randint(0,1) == 0:
            res.append(x)
    return "".join(res)

  def backspaceTest(s1):
    return all(backspace(s1,random_backspace(s1)) for _ in  range(100))