Hacker News new | past | comments | ask | show | jobs | submit login
Research recitation: A first look at rote learning in GitHub Copilot suggestions (docs.github.com)
184 points by azhenley on July 3, 2021 | hide | past | favorite | 87 comments



The article is worth reading, but a good summary is at the bottom:

> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.


The answer wasn’t obvious to me. Nice solution.

It sounds like you’re a part of the Copilot team. If so, then I’m happy to see the Copilot team cares about these issues at all. I was expecting nothing but stonewall until the conversation died out, since realistically the chance of the EFF bringing or winning a lawsuit seems small. (And who else would try?)

But when you anger the world and being so much attention to this delicate issue of copyright in AI, you risk every hobbyist. Suppose the world decides that AI models need to be restricted. Now every person who wants to get into AI will need to deal with it. I’m not sure anyone else cares, but I care, because it’s the difference between someone getting into woodworking (an unrestricted hobby) vs becoming a lawyer or doctor (the maximally restrictive hobby). The closer we are to the latter, the fewer ML practitioners we’ll see in the long run. And even though the world will go along fine — it always does — it’d be a sad outcome, since the only way it could happen is if gigantic corporations were flagrantly flying in the face of copyright spirit, daring it to punish you.

My point is, please care about the right things. No one cared about language filters on ML models outside of a select vocal group, yet look how deeply OpenAI took those concerns to heart. Everybody cares whether their personal or professional work is being ripped off by an overfitted AI model, and it wasn’t obvious that GitHub or OpenAI gave it more than a passing thought.

Backlinking to the training set should help. But it’s also going to catapult the concern of “holy moly, this code is GPL licensed!” to the front and center of anyone who works in corporate settings. Gamedev is particularly insular when it comes to GPL, and I can just imagine the conversations at various studios. “This thing might spit out GPL? We can’t use this.”

My point is, when you launch that new feature to address people’s concerns, please ensure it’s working. You won’t be able to do exact string matches against the training set; you can’t rely on “well, it’s slightly different, so it’s not really the same thing.” If it’s substantially similar, it needs to be cited. And that seems like a much tougher problem than merely building an index of matching code fragments.

If you launch it, and it doesn’t work, it’s going to stoke the flames. Careful not to roast.


FYI, the post you replied to is entirely a quote from the article (even though formatting makes it appear that only the second paragraph is a quote). So the poster likely is not working on copilot.


Haha. Thank you. Well, I just hope Copilot starts caring about people’s concerns.

It’s kind of strange that no one from Copilot has said anything on HN. I wonder who has the authority to discuss it, if anyone. Usually bad PR is accompanied by “So-and-so from X here! We hear you…”

They’re probably going to have a much harder time launching this feature than they imply in the article. It’s hard even for database companies that specialize in full text search.


The original thread about Copilot contained a post from the dev at github. He only answered a couple of softball questions, though.


I think the this article is an indication that github is taking this seriously


It's an indication that they want the appearance of taking it seriously.

I think this episode will tell us definitely how much of GitHub is left, and how much Microsoft has infected their culture.


Yep! I don't work there.

I can't fix it since I can't edit my comment anymore. Next time I'll be more clear with any multi paragraph quotes!


QED, I guess. When you can't tell what's a quote and what's original, it's another manifestation of the same social problem.


> And even though the world will go along fine — it always does

It doesn't at all. I know a lot of people who have serious medical problems but are unable to afford the costs of the medical system and just suffer untreated. Perhaps in some countries there are good insurance plans, but in a lot of countries there are not.


I think the problem you hilight is about intellectual property, and not AI, nor even AI+intellectual property.

Woodworking is as restrictive as other hobbies. Just because the victims of your infringement are unlikely to find out about it, doesn't mean you don't need to comply with their licenses


> (And who else would try?)

Apache foundation?


I don't see your analogy here. You want ML to be unregulated like woodworking as opposed to medicine. That's actually not a good analogy at all, since even woodworking in service of people is quite regulated via building codes, you're just free to practice as a hobby.

Medicine and law are fields where being a hobby is much harder without using other people as guinea pigs rather than yourself, so it's regulated in any sane country. So it seems regulation follows to fields which can have significant impact on regular people's lives and is more dependent on the nature of the field itself than of some bad actors. Of course bad actors are often what prompts the regulation but it's not exactly a great argument to say we don't need speed limits if no one ever crossed it.

Does AI need regulation? Probably. If copilot accelerates that process then good. But it's also outrageous how much more this community cares about the non-violation of perceived open-source-code-freedom than all the other evil crap AI is used for (like in literal concentration camps). The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here. Presumably because you weren't the customer (or so you thought). When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here. But woe be someone who steals code you _already released for free _ now that's a step too far! Zuckerberg was right, y'all care more about the dead squirrel in your yard than a genocide across the world.


> The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here.

This is not true. It's trivial to find probably two orders of magnitude more complaints about those larger companies on this site than the two or three days of Github bashing. It's also trivial to find hundreds of posts in threads saying "Everybody starts screaming when [Google/Apple/Facebook/the Government] does something like this, but people just let [Google/Apple/Facebook/the Government] get away with doing far worse."

When you do this, what you're doing is trying to shut down debate.


> When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here

This is example often quoted. This wasn’t AI. I believe it was Target and it was simple correlation with search keywords

Moreover the issue was her parents finding out before she herself told them. The result would have been same if she had setup a baby registry at Target and the company assumed the news wasn’t a secret.

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...


As a nitpick, it is AI, it's just bit different than the current crop of what most people think of as AI. Expert systems and evolutionary programming are also in the same spectrum as ML, as steps along the path to what we think true AI will be (even if not necessarily a straight path).

Noticing statistical correlations and acting on them automatically in a recommendation engine is definitely a type of AI.


Suppose you are a teacher for a group of programming students. You ask them to implement a binary tree in C. The first student comes up with a verbatim copy of the code on Rosetta Code, the second with the same code with some variable names changed, and he third with indentation changed from tabs to spaces. Which of them were cheating?


I think this hypothetical might miss the point of the comment it's replying to. In my mind, cheating requires intention. Now, it would be hard for the student to disprove that any of these situations were not intentional but if for example the student studied Rosetta for the exam, it's very possible the student just memorized the code and didn't reference it during the test. Is that still cheating?


Cheating does not require intent. That is, the university only has to prove that the student's work is not original to conclude that plagiarism and also cheating has occurred. Memorization is usually not prohibited in exams.


"I know when I’m quoting."

or so you say <g>


The analysis seems to depend on sequences where the same exact words appear X times in the same order. If my understanding of how this works is right, they have the ability to globally change symbol names based on the prompt. And probably other things that make a literal match less likely, but what's different may be trivial. Like symbol names being swapped, use of equivalent operators (+=1 vs ++, etc), order swap where it doesn't matter, etc.

Of course, I'm just speculating since I don't have access to the product, but I have seen GPT-3 output that is verbatim plus some synonym swapping.


> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

I'm afraid I don't think actually comes even close to resolving the legal implications of recitation.

If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.

> 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.

> 2b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.

Copilot is distributing the GPL source code (through recitation), which means that it is, at least in part, derived from the GPL code (through learning), but it most definitely "contains" the GPL code.


It's not clear why it would need to be if it runs as a online service. I would say this is analogous to a search engine index. Google for example likely has lots of GPL code in it's index in some transformed form and yet there is little dispute that this is legal.


> I would say this is analogous to a search engine index. Google for example likely has lots of GPL code in it's index in some transformed form and yet there is little dispute that this is legal.

I'd disagree with the comparison, pretty vehemently. Copilot can generate new code. It isn't just some storage mechanism.

It can create derived code, Google Search can't.


Interesting, so your argument is not about parroting but specifically about the novel code Copilot generates. My sense is that it would be fine for all licenses except for AGPL as this is an online service and is not "distributed" per say. That is if you don't buy the argument that machine learning is a transformative work so it doesn't matter what the license is, which is the position of current US caselaw.


No, my argument is that Copilot is specifically _not_ a storage mechanism, but is producing verbatim GPL'd code, meaning that as a piece of software, it is not exempt from the GPL.


I think that means that if you memorize some GPL code, all code you write needs to be GPL'd, since you contain GPL code


If the recitation of GPL code required Copilot to be GPL licensed, then that would be incompatible with other licenses. If the copyright is held by someone else, you're simply not the one who can ultimately decide the terms under which it is redistributed.


> If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.

I'm no GPL expert, but if you could trivially swap in a model trained on a different corpus, I think that's enough separation between the Copilot code and the model data that GPLing the former is not necessary. IIRC there is an "at arm's length" criterion that typically applies in cases like this.


Zip can ‘recite’ GPL code by unzipping a source code archive. The fact that you can swap in a different zip file ‘model’ that was ‘trained’ on different data doesn’t mean the first zip file isn’t GPL’d.


If a non-GPL'd zip implementation is used to unzip an archive of GPL'd code, does that mean the zip implementation is in violation of the GPL?

That's the point of my comment that you replied to, and it is a response to a specific claim (which I quoted) in its parent. I have not and will not make the (nonsensical) claim that you seem to be suggesting I've made--that the ability to change out the model somehow negates the licenses of code used to train a different version of the model.


Ah, I think I see the disconnect. In the post you replied to, they said “copilot itself” (ie. the copilot model plus code) should be GPL’d. I missed that your reply was about whether the code specifically needs to be GPL’d. I agree with you there, but it’s also a bit of a tangent from the original point (which was that if copilot can regurgitate GPL code then it’s necessarily a derivative work.)


> I'm afraid I don't think actually comes even close to resolving the legal implications of recitation. If Copilot is reciting pieces of GPL code (which we know it can), then not only does it need to point out where it has grabbed that code from, Copilot itself is (probably) required to be GPL-licensed.

I dont follow. If the suggested code is GPLed, it’s your decision to include it in your code or not. If you accept the GPLed code into your nonGPLed code base, you violated the GPL. As a friend of mine said years ago about situations like this, “Saying ’I was just following the algorithm,’ is not a defense.”

Now let’s is Copilot a violation of the GPL? I’m going to assume that it’s codebase is not derived from GPL code. I have nothing to prove this, but most code is not, and Microsoft, GitHub, and OpenAI are reputable organizations, so assuming good faith here seems fair.

Did Copilot train on GPLed code? Absolutely. I don’t think anyone has ever suggested otherwise.

Does processing the code count as an integration? I’d say no. It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting. Even if it was, it wouldn’t be a GPL violation, since the copilot binary isn’t being distributed, but it would be an AGPL violation. I don’t know how popularly the AGPL is, but let’s assuming that at at least something from some AGPLed file exists inside Copilot. Again that doesn’t matter, because the code isn’t actually being executed.

So is it “distributing” the code? Sure, but that’s not a violation. If you make a binary, you have to distribute the code, but the opposite isn’t true. Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.


> It certainly isn’t part of the executable code base, even in binary form, which is what the GPL was targeting.

The GPL also targets source forms, such as what is being produced verbatim. (See Clause 1 of the GPL.)

> Again that doesn’t matter, because the code isn’t actually being executed.

That's not a requirement of the GPL or AGPL. It's irrelevant.

> So is it “distributing” the code? Sure, but that’s not a violation.

It is, without the license.

> Anyway, just having a piece of GPL source code in a database isn’t a violation, and never has been. You might as well bring saying that because Google’s search index can return entries of the Linux kernel, all of Google is in violation of the GPL. Not even RMS would take that extreme view.

Storage mechanisms are not the problem here. The GPL source code is not in some database, in this case. A search index is irrelevant, because that is just a storage mechanism.

However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).

In both these cases it doesn't attribute, so you can say with certainty that the Copilot software contains the source code, and may create derivative works, all without attributing the license.

It is the fact it contains the source code, reciting it verbatim, that makes Copilot probably need to be GPL-licensed itself. As it is not a storage mechanism.

It is that it distributes derived works without attribution that puts the end-user's codebase at risk of violating the GPL.


>However, Copilot generates verbatim code, and it generates novel code. That is, it both contains the plain text of the original (recitation and redistribution), generates derived code (transformation).

So are you saying that because the language model was trained on a GPL code that even though it spits out novel code, that code is derived?

That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.

A clearer cut case is including code it suggested when that’s verbatim. That would be a GPL violation if it’s included in someone’s codebase, but that’s not what it seems you’re arguing. You seem to be arguing that Copilot is in violation simply for suggesting the code.

This means you’re asserting that somehow storing the code in a language model is somehow different from a database, but you haven’t told me why that is.

Databases have a query execution system and a database file. They are separate pieces. The query executor can work on any database file, and swapping out the database file will give different results, even though the execution code is the same.

This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.

The storage formats are different but doesn’t matter. The data and the code are separate. Given this information, why — and be specific — is a language model not like a database?


> That seems like a pretty expansive view. I’ve read some GPL code in my life, and I’m sure it has influenced me. Does that make all my code “derived”? I wouldn’t say that. To truly be derived it needs to be a nontrivial amount, otherwise every time you type “i++;” you’re in violation. This is hard to prove.

You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)

However, you can absolutely face civil charges if you reproduce too-similar code for a competitor, after absorbing the technical architecture at another workplace.

> This is exactly the same case for language generators. You have a language model, and a piece of code that makes predictions based on the given text and the language model. Swap out the language model, you get different results.

Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.

> Given this information, why — and be specific — is a language model not like a database?

In the eyes of the law, and this is very specific, the model is marketed as part of the software, and so is part of the software. The underlying design architecture is utterly irrelevant, because it is presented as a package deal of "GitHub Copilot".


> You're not a piece of software, so the areas of copyright law that are applicable are completely different. (And yes, copyright does acknowledge a minimal amount required to be copyrightable - but that minimal amount may sometimes be argued to be a single line.)

Putting aside the philosophical aspects of this statement, you proved my point. I said that the ultimate person held liable for violating a license is not a tool, but the person choosing to integrate suggested changes by the tool. But now somehow you expect me to believe that the person that built an automaton, but is not directing the automaton, and certainly doesn't have final say in whether or not to incorporate the automaton's suggestions is at legally culpable, because they're being held to a stricter standard? If that was legal standard with any tool, then literally every manufacturer of every tool would be held liable for any and all misuse. Obviously, this is not the case.

> Legally speaking, Copilot isn't advertised with multiple available language models. It isn't presented that way, so it won't be treated that way. It will be treated as a singular piece of software.

Actually speaking, you're not a lawyer, and that this is INCREDIBLY controversial statement, that doesn't really standup to much scrutiny, since there is a bright line that separating the two.

Even if Github was ruled against (and they won't be), case law is filled with examples where the injunctive relief is limited to claims presented (in this case source related to a specific a work) rather than the entire system including playback device and the recording.


Take aways in summary:

- Copilot does sometimes rote copy for nontrivial situations - Mostly this happens when there isn't much context to go on. - Provided an empty file, it proceeds to recommend to write the GPL - They will add a "recitation detector" to copilot to indicate non-novel recommendations

By the standards of corp-speak this is pretty good, they admit there is a problem there and they intend to do something tractable to prevent it.

This entire copilot situation is far enough outside my personal "mental ethics model" that I'm personally abstaining form taking a stance until I have had a lot more time to think and learn about it.


They quite misdirect on what the problem actually is. Verbatim or quasi-verbatim recitations are not the only things copyright protects.


It should be noted that this kind of behavior is entirely expected from a GPT-style self-supervised sequence model. Rote memorization for this kind of model is indicative of correct training, not overfitting. The underlying training objective of these models ideally results in a representation of the training data which allows complete samples to be extracted by using partial samples as keys. Actual overfitting in this kind of model requires absurd parameter counts. See https://tilde.town/~fessus/reward_is_unnecessary.pdf


I continue to be surprised GitHub shows examples that wouldn't compile/run correctly. For example, the Wikipedia scraping example that the author claims is also the intuitive way to solve the problem assigns each row to the global variable cols instead of appending. Further, the following if statement appears to be mis-indented.


> I continue to be surprised GitHub shows examples that wouldn't compile/run correctly.

Why is that surprising to you? CoPolit doesn't actually know how to code, it just generates symbols that match learned patterns.

Sometimes these generated symbols don't represent valid code and since CoPilot doesn't actually perform filtering based on syntax checks or JIT, these results end up as suggestions.

This is actually a point where future versions could greatly improve the usefulness, e.g. use the compiler infrastructure to verify and filter generated results.

This includes auto-formatting and even result scoring by code metrics (conciseness, complexity, ...). Plenty of room for improvement even without touching the underlying model.


I'm not surprised the actual results are flawed. I'm surprised the handful of hand-picked examples Github calls out are flawed. Usually handpicked examples of algorithms are the best-case performance.


Maybe it’s just a more honest representation of a typical result?

Looking at it the other way around, had they chose _perfect_ examples there’d be similar criticism about “cherry picking”.


Fair enough, although it makes me think that if these are the better examples the average is much worse.


There was an incorrect example on the copilot.github.com landing page which I was surprised by. I'm guessing someone noticed it because it's been taken off the page.

It was in a file called "find_files.sh", and the command it generated was something like:

> find .... -exec ... {}\;

The problem is that there's no whitespace between the {} and \; and this is rejected by both GNU and BSD find.

It might've just been a mistake that someone made while putting the landing page together. But if it was generated, then how?? I can't imagine there are many (if any) instances of "{}\;" in the training data, since it's not a valid invocation of the find command...


Here it is, courtesy of a wayback snapshot[1]

  #!/bin/bash
  # List all python source files which are more than 1KB and contain the word "copilot".
  find . \
    -name "\*.py" \
    -size +1000 \
    -exec grep -n copilot {}\;
[1] https://web.archive.org/web/20210701061847if_/https://copilo...


The missing space isn't the only thing wrong there.

- This lists the lines containing copilot instead of the files. You want `grep -l` to list the files.

- `--size +1000` finds files with over 1000 blocks. `--size +1000c` gets you files over 1KB.

- `-name "*.py"` only finds files literally named `.py`. To find files that match the glob `.py`, you would single quote it like `-name '*.py'`


> Why is that surprising to you? CoPolit doesn't actually know how to code, it just generates symbols that match learned patterns.

Idea: make Copilot test-driven. You provide a function invocation and an example output, and it searches for code with similar test cases, then mutate the code until the test case passes within its own runtime (it may need to mock certain things to make tests pass within its own runtime).


It's hardly surprising to see a so-called 'AI system' (If you can even call it that) suggesting and generating broken code. A single typo is enough to confuse it and give you the wrong implementation that won't compile. [0]

The examples all look cherry picked for presentation; but the ones who had access to Copilot early didn't even stress test it, yet it still generated garbage. And it took 1 in 10 tries to get it right. [1]

Of course, we'll hide behind the 'Technical Preview' argument but since the core of this 'AI' (GPT-3) is a black box system, we'll probably never know why it is generating the code it is generating.

[0] https://twitter.com/leifg/status/1411083360756146177

[1] https://news.ycombinator.com/item?id=27676845


Sorry for the basic question, but the code one builds on this platform is saved to githubs servers?


Yes - to me, this is a much more obvious hazard than the copyright arguments back and forth around the training data.

There's no way any reasonable corporate legal or risk department should or would allow this plug-in to be installed on their developers' computers, for this reason alone.

I am surprised this is not being talked about more, given the massive conversation around the regurgitation / rote learning arguments.


Don't know about saved, but definitely sent.

I guess one risk after knowing that Github regards all source code regardless of its license to be fair game for training Copilot is that you probably can't know for sure that your new code is not being used to teach the model more.


It's clear the author of this article had access to the code that triggered the copilot suggestions. They also say this was from an internal trial of copilot, so it might be that these trial users were told their code could be seen by their coworkers.


yes


Anything pasted or typed into that Copilot editor is sent to GitHub as 'telemetry'.

> In order to generate suggestions, GitHub Copilot transmits part of the file you are editing to the service.

So Yes.


That's not telemetry, it's the prompt to the model to generate the rest of the text. I would assume it's not saved.


I wouldn't think to call that telemetry either, but it is addressed in this doc about telemetry: https://docs.github.com/en/github/copilot/telemetry-terms#ad...


(in re: "pieces of code where there's really only one way to do it" appearing verbatim)

    This doesn’t fit my idea of a quote either.
That seems kinda cavalier, given that it took 10 years of intensive litigation and an appeal all the way to the Supreme Court to establish that copying things where there's LITERALLY no other way to do it (API declarations) wasn't copyright infringement:

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....


There was another way to do it: Google could have designed its own API from scratch for Android, rather than reimplementing the Java standard library.

Of course, this would have sacrificed two things: compatibility with existing Java code, and familiarity for existing Java developers. However, thanks to the typical game of telephone seen when lawyers and judges discuss technical subjects they're not deeply familiar with, the fact that Android was compatible with large bodies of existing Java code went essentially unrecognized; the courts focused solely on the fact that Android was not compatible with existing Java apps (in their entirety). Therefore the case was decided as if designing an API from scratch would have only sacrificed developer familiarity, not compatibility. Luckily, the Supreme Court decided that even developer familiarity alone is enough to justify reimplementing an API as a type of fair use. (And that is now precedent.)

On the other hand, if you're using an API rather than reimplementing it, and it's very clear that compatibility is at issue (i.e. your code would not be compatible with the library you're trying to use if you wrote it any other way), I'd say the case for fair use is much clearer.

That said, if some snippet isn't literally the only way to do something, be careful before declaring it too trivial to be copyrightable. In a different part of the same case, a trivial 9-line helper function [1] called rangeCheck, which Google had copied verbatim by mistake, was held to be an infringement.

[1] https://majadhondt.wordpress.com/2012/05/16/googles-9-lines/


There is a huge problem for them to concentrate so much on verbatim recitations. That's not the criteria used for copyright... see for example Abstraction-Filtration-Comparison tests.


Regardless of whether or not the output of Copilot is sufficiently "crow-like," the reality is that Copilot would not exist without its corpus of largely GPL-licensed code. Training data has value and many companies go to great lengths to hoard data for this exact reason. If Copilot is then offered for a price, GitHub is profiting from the work of thousands who never intended their code to be used in this way.


The flip side is that if, from a policy perspective, we decide that the licensing doesn't matter-- then it's extremely likely that anyone else can make a copilot competitor.

If we decide that licensing is required, it's likely that no one except github (or a few other huge players) could ever make something like this just due to access to the code.

(Github would make it a requirement of the TOS that your code is licensed to allow copilot to use it, and require users to indemnify github from third party legal action arising from github using code you posted in copilot.)

The permissive handling levels the playing field.


no, if github does that the entire open source community will move to GitLab, much like the freenode to libera exodus.


Already their TOS gives them broad permissions, it just stops short of indemnification. Their TOS arguably already turns people posting projects containing third party copylefted code into license violators, -- yet people are still using them.

So I wouldn't be so sure-- but moreover, I think my larger point remains: the entire world being fragmented into separate licensing is a huge moat.


Uh, how about a disclaimer that this analysis is made /by Github/?


It's on https://docs.github.com. Of course it's by github.


i don't think that's necessarily obvious. most links submitted to HN from the github.com domain aren't authored by github.


That's understandable. I thought the GP was saying that the article should contain the disclaimer.


The ‘docs’ subdomain should make authorship pretty clear…


No, it doesn't. I thought github launched a docs sharing product at first.


This whole copyright thing reminds me of way back when Google bought Youtube. I remember at the time, my boss (and many others) were certain Google was going to get themselves into a shit storm by hosting copyrighted material. Well, it turns out a company with as many lawyers as google had/has knew plenty about copyright law and put automated systems in place to keep themselves out of legal trouble.

To think that Microsoft isn't thinking deeply about license/copyright issues for Copilot and working on solutions seems pretty naive.


Did I understand correctly? There were 473 "useful/interesting" suggestions out of 453k?


I think it’s 473 suggestions that had interesting quotes of code from the training data


Does it substitute variables correctly? E.g. if I define max(a,b) or max(x,y) does it complete the definition with the right variable names?


Generally, yes. It’s not guaranteed to do that correctly, but I’ve not seen it get variable names wrong so far.


There's an example called "fetch_tweets.py" at the bottom of page on https://copilot.github.com/ that gets it wrong:

  def fetch_tweets_from_user(user_name):
      # deleted some lines here...
      # fetch tweets
      tweets = api.user_timeline(screen_name=user, count=200, 
  include_rts=False)
screen_name=user there isn't right.

It's a nit, but it is interesting how many of the hand-picked examples on that page aren't right. Since they were hand-picked, presumably to show the product off.


> It’s not guaranteed to do that correctly

Which is odd considering they could run this as beam search with the checking part of a compiler in the loop.


Which would be extremely slow


Not necessarily. If you emit a batch of suggestions for 1 line and prune them with syntax checks you can already prune them and present them to the user while working on the next iterations.

The real difficulty is getting good checking on partial source code, some languages have better tooling for that than others.


Now the question is, whether Microsoft has a more advanced tool that they train on private repositories?

How can you check whether Microsoft has infringed your private code? For example if they used an internal tool and part of my code is now in Windows 11? How do I check that?


The question is more, did Microsoft train the copilot on their code repositories? If there are no copyright concerns, why not train it on the Windows code base? Should be a treasure trove of good code to train with.


That's a great point. If they didn't, it's clear what their actual risk assessment is.


Surprising how extremely rarely Copilot quotes text verbatim.


TLDR:

Microsoft: We have seen Antitrust (2001) https://www.imdb.com/title/tt0218817/ and were finally able to implement it after 20 years, this time without murder and conspiracy, all in the open.


At this point it seems Copilot is a hilarious failure, which we can learn from. I hope it’s doesn’t taint the profession further though as a bunch of amateurish cut and pasters.


Well I guess it means that we must repeatedly verify such wild claims or contraptions made by others that deliberately hype and raise our expectations only to ultimately disappoint us later.

I'm also going to call that the Copilot fanatics here will scream on HN about it being a paid service, when the day comes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: