GitHub scraped your code. And they plan to charge you

iliekcomputers · on July 3, 2021

If you open-sourced code and allowed it to be used for commercial purposes, I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)

hmfrh · on July 3, 2021

> If you open-sourced code and allowed it to be used for commercial purposes

Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.

> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

blibble · on July 3, 2021

> People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

it's worse than that: it's Microsoft trying to completely undermine the concept of open source

meanwhile: they're unaffected as their high-value proprietary code remains private and doesn't train the model

Engineering-MD · on July 3, 2021

Finally the angle for Microsoft buying GitHub becomes clear.

oaiey · on July 3, 2021

There is zero rationale why Microsoft could not do it without buying GitHub. Hell, they could have trained the model from GitLab or Bitbucket public repos.

They need developer mindshare and they lost a lot of developers in the 2000s. Buying GitHub (and being involved in 100s of projects) brings this mindset of Microsoft as a developer company back to developers. Like as an advertisement platform/branding for some novel technology like this CoPilot.

Engineering-MD · on July 3, 2021

It was a bit of a tongue in cheek comment. It’s very unlikely they were even planning this when they bought GitHub. While you’re completely right about their rationale, a lot of people were at the time concerned about the acquisition, and it feels now a bit short sighted to burn all the good will they have generated by an act like this which undermines open source.

tekknik · on July 4, 2021

I don’t even think most people care about open source code. What if they’re reading private repos as well? The only option to escape this is to leave github. Which I’m kinda considering even with all the work involved.

greyhair · on July 7, 2021

I think the rational for buying it was they were the single largest user of it, and were concerned about funding stability.

Does anyone have the numbers before they bought github, I saw them once, don't remember the values, just remember being shocked at what percentage of the total github code base was Microsoft. I had no idea they were using it at all.

thinkingemote · on July 3, 2021

And once again RMS was proven right.

krapp · on July 3, 2021

How would the GPL have prevented this, exactly?

More explicitly, how would a license that gives everyone the right to copy, modify and redistribute source code for any purpose without compensation or attribution have prevented Github from building a tool that copies, modifies and redistributes source code without compensation or attribution?

IAmLiterallyAB · on July 4, 2021

Because if you use GPL code, you must release your use/changes under the GPL. That's why the GPL is often called infectious. If copilot spits out GPL code verbatim, the user of the code must respect the license.

lal · on July 4, 2021

You don't know how the GPL works. The GPL explicitly requires attribution. I don't understand what you thought you were saying. The GPL places strict requirements on the licensing of derivative works. You don't know how the GPL works.

greyhair · on July 7, 2021

And if they provide attribution?

spixy · on July 4, 2021

afaik GitHub Inc is still independent company and independent from Microsoft

kuschku · on July 3, 2021

> you grant each User of GitHub a nonexclusive, worldwide license to […] reproduce Your Content […] as permitted through GitHub' functionality

If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

This ToS change happened 2017, and I actually had to get approval from all contributors of my projects to accept to the changed ToS: https://github.com/justjanne/QuasselDroid-ng/issues/5

What GitHub’s doing is shady, but it’s been obvious it was going to happen for years.

randallsquared · on July 3, 2021

Yeah, and https://docs.github.com/en/github/site-policy/github-terms-o... has this language:

This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. [...]

So, it's not clear if they could charge for Copilot, given that last line, but indexing, analysis, and sharing with other users are all included.

hvis · on July 3, 2021

Reproducing copylefted code is neither indexing nor analysis.

To be clear, a piece of code that creates a copy of some GPL code is not a problem by itself. However, it's misrepresented as something the tool "generated" rather than code belonging to an existing human actor, without appropriate attribution and licensing info.

randallsquared · on July 3, 2021

I think we're assuming that they used GPT3 (or whatever) on code that was stored on github.com in accordance with their TOS, which means (if I understand correctly as a non-lawyer) that they didn't share it under the terms of the GPL, since they can't, but if whatever licenses the code already had attached were not sufficient, that code was also licensed to github for the purposes of indexing, analysis, and sharing with other github service users.

Does this mean that if you picked up GPL code from gnu.org and put it in your public repo, you're in violation of the GPL for illegitimately licensing that code to gihub under a non-GPL-compatible license? Maybe!

hvis · on July 3, 2021

> that code was also licensed to github for the purposes of indexing, analysis, and sharing with other github service users

Even if said code was licensed to Github for sharing with other users (e.g. displaying it when another user browses the page), that doesn't mean that the same permission is given to all other users under the same conditions. Or allows them extra liberties, like using said code for purposes other than "sharing with other github service users".

randallsquared · on July 3, 2021

No argument, there! I'm pretty concerned about the consequences of using Copilot, legally. It all seems to turn on whether a judge or judges believe that a long sequence can be produced by "learning" rather than "copying". ¯\_(ツ)_/¯

anothernewdude · on July 3, 2021

It's irrelevant. There is no guarantee that the people uploading code even have the ability to grant github such a licence if they aren't the copyright holders.

anothernewdude · on July 3, 2021

> If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

You can upload code to GitHub without the ability to grant such a license, or is github now only for primary copyright holders?

kuschku · on July 5, 2021

> or is github now only for primary copyright holders?

That is how I interpreted the 2017 ToS change: You can only upload code to GitHub if you’re the primary copyright holder, or the code was already on GitHub to begin with (which is only an issue if the primary repository of some project is another git host, e.g. gitlab).

oaiey · on July 3, 2021

Well a public repo can be cloned. And needs to be replicated in CDNs etc.

Is not that reproduction?

ineedasername · on July 3, 2021

Good point; if the code does not have a specified license then standard copyright terms apply and inclusion for commercial use should be verboten. It it's actually open source without commercial restrictions though, I don't see an ethical difference in using the code directly or for an meta analysis driving ML for enhanced code completion.

TeMPOraL · on July 3, 2021

> It it's actually open source without commercial restrictions though, I don't see an ethical difference in using the code directly or for an meta analysis driving ML for enhanced code completion.

Most Open Source code comes with a requirement to carry over license note, which Copilot does not do. Additionally, ethics dictate you attribute the source when copying directly, something the Copilot also doesn't do.

ineedasername · on July 3, 2021

Are they actually using the source to make a derivative/fork though? If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area. Though I would still agree that the right thing to do would be to have an attribution area, even if it was thousands of entries long. Whether technically required by the license or not, the spirit of these licenses I think would come down on pushing for attribution regardless of the nature of the re-use.

TeMPOraL · on July 3, 2021

> If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area.

It's already been demonstrated that Copilot - like all tools in the GPT family - frequently output large chunks of their training dataset verbatim. It's not hard to trigger this behavior, even unintentionally. To me, this is much closer to "reusing". But I'm not a lawyer.

It's also worth remembering that there are two parties potentially open to liability here - GitHub, with the way the code was used with the Copilot, and the user, who may be unwittingly including licensed code in their codebase. Given the well-known behavior of the GPT family I mentioned above, it might be hard to argue that Copilot "just chanced" into generating code that's identical to existing, non-public-domain code.

ineedasername · on July 3, 2021

frequently output large chunks of their training dataset verbatim

Ah, that then is extremely problematic.

I really like the idea of Copilot to speed development-- basically code completion taken to an extreme-- but this seems like a very bad way to go about it.

hvis · on July 3, 2021

Don't forget the usage restrictions, as specified in each individual license.

Taniwha · on July 3, 2021

And if course GPL would attach where applicable .....

messe · on July 3, 2021

> Even with an MIT license it is a copyright violation to copy the code without attribution

Probably a copyright violation. There are surely circumstances in which copying a small portion would either fall under fair use, or for other reasons not constitute a violation. The question then is whether or not codepilot is causing a violation. I don't think it's as clear cut as most commenters are making out.

All in all though, it's probably going to take a few court cases to figure out. In the mean time, I'd expect most companies to steer clear of codepilot.

dogma1138 · on July 3, 2021

Snyk did the same with Snyk code to build their “ML driven SAST” offering.

Pretty much anyone can scrape GitHub and train their model.

What exactly is the legal implications of this has yet to be tested.

Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.

By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.

ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?

I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.

But I also don’t think that the ML model is necessarily a derivative work.

For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.

blibble · on July 3, 2021

if I feed the entirety of github (sans licenses) into a java HashMap and provided an interface to query that I very much doubt that its output would qualify as "fair use"

why is it different if a slightly more complicated data structure is used?

dogma1138 · on July 4, 2021

I’m not saying it is, the courts should really decide on what exactly counts as fair use in this case.

All I’m saying that you don’t need to be a huge corporation to do it, and that others are doing similar things as well.

I passed on Snyk code due to similar concerns especially since they pull out examples from FOSS projects directly and even had a “fix me” option where they push pull requests into your repo with fixes.

On ML in general the current policy I’m working on for my org is that we do not use any pre-trained models trained on public data and pushed the legal team to actually start figuring out how we should deal with these issues properly in the future.

ML currently is a Wild West it’s too new to have been tested and defended in court regardless of how to chips would or should fall.

As far as your specific example it would really depend on what data is actually preserved.

Since they do parrot whole code snippets comments and all it seems that they don’t have a generalized model at least for every problem.

However it’s also my personal legal opinion (ANAL) that if you can prove that the model holds nothing but a generalized solution for a given problem the code it outputs isn’t a derivative work anymore than a the code of a person learning from copyleft code.

However then there is the whole issue of “allowed use” none of the existing licenses specify if the code can be used to train a model, this also means that we probably need to update all existing licenses to include a clause that explicitly states the limitations for this use case.

For code under existing licenses the fair use needs a proper judgement.

My gut feeling would be that it would count as fair use just as using code in a course or a book would be. GitHub definitely needs to make a page with attributions tho for that to happen and make sure their model doesn’t output anything but a generalized solution.

kenniskrag · on July 3, 2021

is copy not allowed by theirs terms of service? I read something about that it is needed for e.g. forking feature. But it was years ago, when I created the account.

jwalton · on July 3, 2021

> Even with an MIT license it is a copyright violation to copy the code without attribution.

That was my take originally, but apparently this is not as cut and dry as you may think:

https://www.technollama.co.uk/is-githubs-copilot-potentially...

neolog · on July 3, 2021

> People share code for the betterment of society, and while copyleft used to be popular early on, the software industry has been moving towards less problematic and more permissive licences. The spirit of open source is to share code, and make it possible to use it to train machine learning.

It sounds like this writer doesn't understand the point of the GPL or the distinction between free software and open source. Also quite a crude historical perspective given the increasing number of major projects licensed virally. Claiming that MITL's popularity justifies pirating GPL code makes no sense.

insulanus · on July 3, 2021

> the software industry has been moving towards less problematic and more permissive licences [sic]

... less "problematic" for who, exactly?

That is conveniently left out.

fsckboy · on July 3, 2021

not only is "problematic" problematic, so is "permissive".

more permissive for whom? because the GPL is the most permissive license for users of the code: GPL guarantees, they're permitted to see the source.

a1369209993 · on July 6, 2021

> so is "permissive".

The proper term is "pushover"[0][1].

0: https://www.gnu.org/licenses/license-recommendations.html#sm...

1: https://www.gnu.org/licenses/license-compatibility.html

gavinhoward · on July 3, 2021

All of my open source licenses require attribution, but Copilot does not give that attribution. So while my code is open source, Copilot is still violating the open source license. Just because it's open source doesn't mean there are not any terms that must be abided by.

I believe that gives me the right to be mad and to demand they fix their violations, one way or another.

SamWhited · on July 3, 2021

If you write MIT code you expect them not to strip your license out in derivative works. This is exactly what license are for and GitHub is blatantly violating it while people applaud.

dopaminefasting · on July 3, 2021

Maybe read the MIT license before you grab the pitchforks:

"The above copyright notice and this permission notice shall be included in all COPIES OR SUBSTANTIAL PORTIONS of the Software."

Reusing a snippet doesn't require reproducing the MIT license. People who publish MIT software know they're basically giving their code out with basically no strings attached.

However, GitHub should be careful with the GPL variety.

pdonis · on July 3, 2021

The problem here is that a person using this automated tool is not being given the required information in order to decide whether the code they are re-using is a "substantial portion" of the software it's taken from; in fact, they aren't even being told they're re-using existing code at all.

This does not relieve the person using the code of the responsibility to make that determination, so anyone who is re-using code shown to them by this automated tool is doing it without having fulfilled their responsibility under the license. The fact that they don't know they are doing this, because the tool is not telling them, doesn't change that.

> People who publish MIT software know they're basically giving their code out with basically no strings attached.

No, they aren't. The license has terms. Using an automated tool that doesn't tell you when you are re-using existing licensed code, or whether your re-use is within the terms of the license, doesn't mean you can just ignore the license. It means you're re-using code without knowing whether or not you're violating a license.

chx · on July 3, 2021

> Reusing a snippet doesn't require reproducing the MIT license.

In light of Google v. Oracle going as far as the Supreme Court I find your confidence in this quite astonishing.

JBorrow · on July 3, 2021

SUBSTANTIAL PORTIONS can mean the core five lines of some key algorithm buried deep in a 1000 line wrapper library with a bunch of language wrappers.

eesmith · on July 4, 2021

I don't interpret "substantial" here as affecting the 'de minimis' requirement for infringement, as that would require other language.

I'm pretty sure every court will instead interpret the word "substantial" in the MIT license as referring to the concept of "substantial similarity", which must be demonstrated in any copyright infringement case - https://en.wikipedia.org/wiki/Substantial_similarity

I distribute software under the MIT. "Basically no strings" != "no strings". I expect attribution for any derived program which has substantial similarity to my software.

I regard the presence of the word 'substantial' to indicate that the license applies to even modified forms of the software. As that Wikipedia link points out:

> Under the doctrine of substantial similarity, a work can be found to infringe copyright even if the wording of text has been changed or visual or audible elements are altered

arp242 · on July 3, 2021

What counts as a "substantial portion"? Personally I'd say that a function is substantial, whereas one or two lines would not be.

ClumsyPilot · on July 3, 2021

We can argue that point, but it seems githib in no way is even aware of how substantial their copying is

jacobsenscott · on July 3, 2021

I write a lot of one or two line functions though.

chartreusek · on July 3, 2021

Sure, but there's still the license at play here. It's not like they trained it only on public domain/CC0 code. What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution. It can create unintended copyright violations and potentially open people using it up to liability.

bryanrasmussen · on July 3, 2021

>What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution

You would sue.

And then Github would argue that their algorithms did not spit out verbatim the code by copying but rather it generated code that looked exactly like the other code based on learning from millions of codebases. ¨

And then there would be lots of lawyers.

And then a judge would have to decide.

t0mas88 · on July 3, 2021

And the judge would really not care about the "we did not copy it, we made an algorithm that created the exact code" technicality. It's their job to see through such things and consider the case at a higher level.

So the judge would look at two pages of exactly the same code and then decide whether the "not really copied" part is big enough to be considered an original work or not. If it is big enough it is a copyright violation. Nobody cares that you used an algorithm in between, you took the original as an input and ended up with exactly the same thing as an output, copyright violation, case closed...

ineedasername · on July 3, 2021

But it would still potentially have used code not licensed for commercial use as a data set for a commercial product, which is problematic.

GitHub really needs to clarify which code was allowed for inclusion here. Until then we're can only speculate And enumerate potential scenarios.

dheera · on July 3, 2021

I suppose it really depends on if they spit out verbatim reproductions of code or whether it is the equivalent of a 10-year experienced programmer who has just seen a lot of code but isn't reproducing anything verbatim.

We shall see, by Googling some of the code it spits out.

FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

Gaelan · on July 3, 2021

Copilot spitting out fast inverse square root, verbatim: https://twitter.com/mitsuhiko/status/1410886329924194309

HN discussion: https://news.ycombinator.com/item?id=27710287

toxik · on July 3, 2021

I have worked a bit with transformers, the model underlying GPT. They absolutely learn to copy training data, and that’s perfectly normal.

What is happening here is we’re running into exactly what modern ML is NOT capable of: deductive reasoning. It does not think “I need to query the Twitter API for some posts, then filter them. Right, the API works like this…” No. It doesn’t think at all. It is a regression machine. “This sequence begins/looks like something I have seen before, here’s the corresponding output modulo adaptations.”

ML does not self-reflect, question motives and analyse causes. It’s just a complete lie to suggest otherwise, and to call this “pair programming”? What an absolute joke. It’s a lot like Tesla calling its glorified lane keeping an autopilot.

TeMPOraL · on July 3, 2021

> FWIW GPT-3 doesn't really tend to spit out verbatim reproductions of copyrighted books.

But it does spit out whole paragraphs at a time. This is easy to test by going to any of the GPT-2/3 playgrounds on-line (e.g. AI Dungeon), and playing with prompts. Very specific prompts work best, but sometimes even with a generic prompt, if you let the model continue on its own past the first output, it might just shunt itself into a path where following the most probable continuations happens to reproduce a substantial portion of some work verbatim.

JBorrow · on July 3, 2021

See this discussion: https://news.ycombinator.com/item?id=27710287

dylan604 · on July 3, 2021

Maybe they should train the ML to read the license? If the ML can undertand the license, then we'll have to bow down to their superiority. However, if it did understand the license, then it would do the right thing.

toomuchtodo · on July 3, 2021

So sue them and a court opinion can demonstrate where the line is and how much code can be replicated before attribution is required (and the product can be refined to ensure compliance).

Innovation should push boundaries.

detaro · on July 3, 2021

They could push boundaries and publish one trained on all of Microsofts internal source code. Would for me be a great demonstration that they believe the "it's fair use and not violating copyright on the training data" argument.

fartcannon · on July 3, 2021

It's more likely they'd sue someone who used it to develop something that ate into their lunch by saying it infringed on one of their 'secret' Linux patents they sabre rattle about every now and then.

lc9er · on July 3, 2021

Are you equipped to fight a protracted legal battle with Microsoft? Neither is anyone else.

joe_the_user · on July 3, 2021

The product is already dead. It's not just Microsoft that would be violating the license but any company using the application and Microsoft can't shield them.

jeroenhd · on July 3, 2021

Open source code still has a license. That license may or may not require distributing the license along with the code. MIT may allow distribution without a license unless the code share is significant, but reusing GPL3 is a no-go for commercial companies.

The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.

Complaining that companies use the software you told the world was free to use without restriction is dumb. However, not everyone gives away their software for free without restrictions. The fact that Github isn't respecting those licenses is a much bigger problem.

The tool autocompleting some random guy's personal information because he uploaded his blog to Github is highly problematic. The idea of using permissively licensed code to train an AI is not bad, but some human with knowledge of software licenses would need to pre-select those projects.

If all code came from one of those "do whatever the fuck you want" licenses, then there wouldn't be a problem. I'd consider it to be a great product and have no issue paying a fee. There's a huge market for a Copilot product, but this iteration just.. isn't it.

jefftk · on July 3, 2021

> reusing GPL3 is a no-go for commercial companies

The GPL is completely compatible with commercial use. You just need to share modifications to the source with anyone you share the binary with. Many tech companies make extensive use of GPL software, and since they are not providing binaries to their end users they don't even have to share their changes to the source.

Even the AGPL, which does require you to share the source with users, still completely allows commercial use (though not compatible with as many business models).

jeroenhd · on July 3, 2021

If you include GPL3 code, your code also effectively also becomes GPL3 licensed. This isn't a problem for companies producing GPL3 software, of course, but most companies keep their source code to themselves.

True, the web loophole is a way around this, assuming the GPL'd code isn't part of the compiled/minified/integrated frontend Javascript libraries that gets sent to the client (because then you have the exact same problem).

It's not that commercial use is prohibited by GPL3, it's more that most businesses do not want to risk accidentally releasing a version of their software that requires them to share all source code to that version. And even if you're producing open source software, GPL3 might still be a problem because of incompatible licenses (see ZFS for an example) if you don't own _all_ the copyright so you can dual-license your software.

bphogan · on July 3, 2021

I think you're missing my point. I have tons of MIT code out there, including a node.js project used by lots of companies. I don't care about people using my code for money because I open sourced it under a permissive license. So I'm not really objecting to that.

But what's bothering me about this is that it's not a small company doing this. It's a company that's got crazy amounts of cash, who has been trying to trade on a "we're nice now and we love open source" image in the last few years, now taking all the open source code and balling it up in a closed-source app they will charge us for.

I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.

But right now it looks like they'll charge, and that bugs me.

hbz65 · on July 3, 2021

“Free and open source, assuming I approve of the usage” is a common sentiment among people who paste Apache or MIT and don’t think about the ramifications. It’s increasingly common.

I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.

iotku · on July 3, 2021

Dev: "Anyone can use my code for any purpose including commercial purposes."

$BigCorp: "I want to use to use Dev's code for commercial purposes as he has explicitly granted me the right to do so."

Dev: "Wait, no not like that."

As much as I am a proponent of permissive licenses (my favorite is the wtfpl), you have to pick your license wisely especially if you're going to be picky about usage (Be it by $BigCorp, government agencies, or other companies that you might not be fond of).

If you really want "full control" over your code you have to make it proprietary.

mcbits · on July 3, 2021

This is why I think AGPL is a reasonable default for personal projects where the dev doesn't want to fuss over licenses or sue anyone, but would be uncomfortable with $BigCorp exploiting their work. Even though it doesn't explicity prohibit them from using it, it tends (or tended) to have that effect.

dopaminefasting · on July 3, 2021

All the people in this thread angry that GitHub is using MIT software in a way permitted by its license... depressing.

The MIT license doesn't require attribution for small snippets, only for full copies or substantial portions.

fartcannon · on July 3, 2021

As others have already told you in this thread, substantial portions can mean the few lines of significant algorithm wrapped in boiler plate. So either way, Microsoft would have to prove the MIT code was not substantial.

Retr0id · on July 3, 2021

MIT licensed code must still be distributed with a copy of the license.

jrm4 · on July 3, 2021

Once again, Stallman (will have been) right.

Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.

Those who do not learn history yadda yadda.

jefftk · on July 3, 2021

How does this cut off our access to the code?

jrm4 · on July 5, 2021

Someone's GPL'd code will "strongly inspire" someone elses proprietary (or otherwise "controlled") code down the line, and there won't be much recourse.

Remember, the GPL isn't fundamentally about "enforcing licenses," it's a tool that reverses the usual power of copyright for a higher goal of "software freedom."

brutal_chaos_ · on July 3, 2021

Let's say your code on GitHub is not opensourced. Do we know it wasn't used for training?

Abishek_Muthian · on July 4, 2021

Has GitHub agreed that it scraps based on license of the repository?

If so then what about private repositories with a permissible license but not been made public for what ever reason.

What about those projects whose dependencies has permissible license but main repo doesn't? Can GitHub just go oops!

I think the point that so much confusion exists regarding their product & possible violation of user's trust is a valid reason to be pissy about.

matsemann · on July 3, 2021

> If you..

But we didn't.

licenseauth · on July 4, 2021

> quite a lot of MIT code

Where is this MIT licensed codes of yours, because it definitely is not on your github.

seph-reed · on July 3, 2021

Frankly, I think the reason people are upset is because a tool that once revolved around sharing work with others has been bought by a super giant corporation and then all of that sharing is being turned into a means of putting the people who shared out of work. Or in the very least, cutting their salaries dramatically.

richardfey · on July 3, 2021

How do you see this technology putting people out of work or having their salaries cut dramatically? I do not write any code that could be found and copy/pasted from somewhere online.

firebaze · on July 3, 2021

Downvoted. There's a difference between "commercial purpose" and "global player", and Microsoft crossed another line. One of many.

yangff · on July 3, 2021

So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure. Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt. The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.

wildmanx · on July 3, 2021

The difference is that it's a judgement call when to include attribution, whom to attribute with how much, and overall whether something is too close to be counted as a copyright or other license violation or not. Intelligent humans sometimes, or even often times, have a hard time doing this judgement call. An artificial intelligence would too, and a somewhat simple ML model (no offense) certainly does.

I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.

dathinab · on July 3, 2021

> A defect model, sure

More like a defect approach, behavior like that is well known(1) to be basically guaranteed to happen with GPT-3 and similar approaches.

(1): By people involved in the respective science categories (Representation Learning/Deep Learning, NLP, etc.).

joe_the_user · on July 3, 2021

Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us.

It's an interesting question.

1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.

2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.

dathinab · on July 3, 2021

> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????

TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.

The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".

What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").

This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).

I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".

Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.

nlh · on July 3, 2021

I have a genuine question about this whole thing with Copilot:

A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.

Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?

rdw · on July 3, 2021

The feature of TabNine that uses the "public" dataset is optional. It can also provide completions only based on local code. That optionality is important.

Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.

My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.

paulgb · on July 3, 2021

> It can also provide completions only based on local code. That optionality is important.

I don’t see how? The question is about the ethics of building such a tool, not whether anyone is forced to use it.

rdw · on July 4, 2021

For many, the question is about the code quality as well. Having an AI write substantial chunks of code based on the work of "the average github committer" is being criticized as a problem for security, correctness, and understanding.

I think such arguments are a little overheated, but I do have my copy of tabnine configured to use only local code because depending on the full dataset (which is available over the could only IIRC) seemed like it was going to be more work than it saved.

Lariscus · on July 3, 2021

Yes, and it is also not an OK thing to do for the start-up. They were just lucky that nobody noticed their licence violations.

echelon · on July 3, 2021

Because the repository trusted by millions is starting to do things we never anticipated. It's growing in ways that are a touch uncomfortable for some.

I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.

Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.

Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.

gavinhoward · on July 3, 2021

I personally have never heard of TabNine until now. Now that I have, I don't want my code to be part of that.

jchw · on July 3, 2021

GitHub has more visibility and yes, more scrutiny. But that doesn’t mean TabNine would’ve survived without scrutiny, especially after an acquisition. The fact is, size matters.

moocowtruck · on July 3, 2021

it's just what the community does these days, bored and have to be upset about something, and being upset at big companies is trendy

ineedasername · on July 3, 2021

Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:

if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.

If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.

I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?

Public datasets have been used extensively for ML already. I don't see this as much different.

pessimizer · on July 3, 2021

> Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.

It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.

ineedasername · on July 3, 2021

My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it. GitHub wouldn't necessarily have needed to change it to include it in a data set, so I'm not sure there's a license violation.

However the gray area is that the massive data set of which it is a part will spit out new code that has, in some way big or small, been influence by the AGPL code, which... well, I don't think that sort of use was anticipated by the terms of AGPL. I see can reasonable arguments in both directions. Personally though, I would favor an interpretation that limits GitHub's use for commercial purposes, if not for strictly licensing restrictions then at least for the spirit of these licenses.

In truth, I would very much have liked GitHub to gone out big & loud with an aggressive awareness campaign asking repo owner to opt-in to the use of their code for this. Again, for pure opens source licenses I don't thing that would be required, but I still think it would be the right thing to do. And certainly less damaging to their reputation & future hesitancy for project maintainers to trust GitHub with their code.

I don't think this will be a tipping point by itself, but if this behavioral pattern continues I could imagine devs big & small shifting to hosted or on-prem instances of things like GitLab.

jraph · on July 3, 2021

> My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it

Any restriction imposed by the GPL is also true for the AGPL. In particular, if you reuse some code from a (A)GPL project, even if you don't change the code, you must release your whole project under this license too, and give attribution to the author.

For the LGPL, the same thing applies, except you can reuse LGPL code in your code without releasing it under a *GPL license if someone can replace this code at execution time by another code (static linking does not allow this).

ineedasername · on July 3, 2021

But AGPL & LGPL still don't prohibit commercial use do they? They impose obligations on those using the code to--potentially-- release their own code, but commercial use in & of itself isn't prohbited.

Again, I think GitHub is going against the spirit of these licenses and should behave differently. What I'm trying to do in this discussion is simply explore the limitations within the letter of the license.

Basically the boundary between being legal, and being legal but still wrong in some other sense. Consider it my exploration of the ethical constraints vs. legal constraints.

jraph · on July 4, 2021

You can use GPL code commercially, as long as you release what needs to be released as GPL for the AGPL, that's probably your entire codebase.

You cannot use AGPL code in non-AGPL codebase or as a library in a non-AGPL. Same for LGPL, except you can use LGPL code as a dynamic library in a non-GPL codebase, as long as you redistribute modifications to the LGPL code.

You can sell *GPL software or services around it.

hvis · on July 3, 2021

> My non-expert reading of the AGPL seems to indicate that using AGPL code in a commercial project is find as long as you don't change it.

When you combine some [A]GPL code with said commercial project, you create a derivative work. And that work has to be compatible with [A]GPL's conditions as well. Which is usually not great for a commercial (closed-source) project.

zmmmmm · on July 3, 2021

That seems really dumb since they have a well formalised system for people to declare their licenses. Even if people have shown isolated incidents of it, I'm still sceptical (for example, maybe someone put AGPL headers into a project they explicitly licensed MIT in their main license file etc).

ghaff · on July 3, 2021

If an individual hypothetically painstakingly searched through GitHub to see how others wrote an API call and copy-pasted, almost no one would have a problem with that even if they didn't attribute every little code snippet. But some are bringing out the pitchforks because ML can basically do that painstaking search (yes, I know it's not literally a search) so efficiently that it's actually (maybe) useful as a tool. But it's not fundamentally different from what many programmers do all the time.

zmmmmm · on July 3, 2021

Scale, intent and impact do actually matter in copyright. Copying a single API can be ok, and copying a million of them might not be OK. There's a reason for the word "fair" in "fair use".

If a whole project was copied verbatim and the license violated I think everyone would agree that was wrong. So then is copying the same quantity of code across 1000 projects wrong?

Is setting up a process and a system that does that systematically, at scale with intent and then commercialises the result wrong?

ineedasername · on July 3, 2021

Doesn't "fair use" limitations go out the window if the license is permissive enough to allow commercial use of the source code? As I've said in other comments, I still think this was the wrong way for GitHub to do this-- opt-in would have been much better-- but outside of licenses with commercial restrictions I'm not sure there's a license violation here, no more than scraping hundreds of public domain books to create paid reprints would be a problem, or using the same to create a corpus for NLP machine learning.

TheCoelacanth · on July 4, 2021

Fair use wouldn't matter if you were distributing the code in a way that followed the license, but the vast majority of permissive licenses require including the copyright notice, which they don't do. Only niche licenses like WTFPL would be permissible to use like this without invoking fair use.

jeroenhd · on July 3, 2021

If large company violated my open source license, I'd certainly have a problem with it. Use my code all you want, but follow the basic rules.

There's entire websites dedicated to GPL violations; people do care.

ineedasername · on July 3, 2021

If the license allows commercial use, what would the GPL violation be in this case?

jeroenhd · on July 3, 2021

The GPL allows commercial use, provided that you ship the source code to the software bundled with the GPL3 source code you use to your customers. Publish a mobile app on any app store with a GPL3 snippet and suddenly you must provide the source code to your entire app under GPL3 terms. That in turn might violate the licenses for the other libraries you use, so mixing libraries might permanently leave you open to lawsuits from either copyright owner.

The GPL also inherently violates the Apple App store guidelines, so you cannot use any GPL snippets for iOS applications you plan to publish there.

There's workarounds for GPL requirements (notable the web loophole, writing applications that don't distribute binaries but generate web content instead, thereby not violating the GPL by not providing the backend source code) but even that has dangers (for example, when you include a snippet of GPL3 code in your (compiled) Javascript code).

a1369209993 · on July 6, 2021

> what would the GPL violation be in this case?

Since it incorporates GPL source code, the entire copilot neural network needs to distributed under the GPL and - since it is a non-source form of the software - it's entire corresponding source (namely, the entire training dataset it was generated from) needs to be made available, also under the GPL, to anyone it's distributed to. (AGPL is a bit^Wlot more clear about the fact that remote interaction over a network is a kind of distribution, though.)

hvis · on July 3, 2021

GPL allows commercial use. Under certain conditions.

nomercy400 · on July 3, 2021

Unless, it is searching through repositories which someone paid to keep private.

What if that private key you accidentally committed, pushed, removed and pushed last week to your private repo is now showing up in everybody's Copilot?

ghaff · on July 3, 2021

I would hope they're not using private repos.

joe_the_user · on July 3, 2021

The difference appears when copyrighted material is repeated verbatim. And because it's obvious Github has no control over how much copyrighted material is being repeated verbatim. And that copyrighted material is intended to be used by commercial companies who copyright their own material and don't want to have their copyright challenged.

ineedasername · on July 3, 2021

Under US rules at least, everything automatically falls under copyright. But if the license allows full use even for commercial purposes, verbatim repetition is no different in this context than if I included it in my own new piece of software. (Of course attributions and distribution of the code & modifications would frequently be requires... with Pilot it's... potentially a gray areas on whether code lumps together with countless other software is actually modified in the traditional sense of the term, but attribution should still be an obvious requirement. And where the other license requirements are strictly required or not, I still think it would be the right thing to do for GitHub to honor the spirit of those strictures, especially considering their entire business model is based on a majority of users trusting their code to them.

If they're not going to act accordingly, there's no reason someone couldn't roll their own GitLab instance, or a competitor with more respect enter the marketplace.

rcxdude · on July 3, 2021

OpenAI's argument is that this is fair use, in which case the license does not apply at all (though if the court's decision hangs on certain parts of the fair use tests, especially the fourth part, what was contained in the license may have some relevance).

st_goliath · on July 3, 2021

> Hi. I know you’re excited about copilot.

> ...

> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.

Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?

Nuzzerino · on July 3, 2021

Well, for whatever reason, the HN thread for it is in the top 30 most upvoted threads of all time. That probably counts for something.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

dinglejungle · on July 3, 2021

The comments on the original announcement here were pretty positive: https://news.ycombinator.com/item?id=27676266

nomercy400 · on July 3, 2021

How amazing would it be if you could ask a search engine for a piece of working code based on a short description?

That would be exciting tech for me.

gavinhoward · on July 3, 2021

There have been a lot of tweets from developers with access raving about how cool it is.

rvz · on July 3, 2021

> But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it

What you just saw 3 days ago was a hype driven unveiling of a cherry picked contraption by GitHub, OpenAI and Microsoft. Open source became the loser once again and got taken advantage of this clever trick and will soon become a paid service. (With lots of code that is under copyright of various authors.)

Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.

I wanted to see those who had access to it (Not GitHub or Microsoft fans) to demystify and VERIFY the claims rather than blindly trust it. Those suspicions by the skeptics were right, and lots of questions still remain unanswered.

Well done for re-centralising everything to GitHub. Again.

andrewjl · on July 3, 2021

Here's the brutal and ugly truth: why isn't our personal data treated as private property? It's because those who write the laws governing its status either lack the requisite understanding or else practice a form of, to put it mildly, motivated reasoning.

scrollaway · on July 3, 2021

Your public source code is not "personal data".

seph-reed · on July 3, 2021

Well, for the most part my code isn't going to do anyone a ton of good. I don't use much in the way of popular frameworks, but I also guess this means I'm gonna be out of a job for not writing "normal" enough code at some point.

Time to move on to the carbon age I suppose.

rikroots · on July 3, 2021

I don't know - I think a variety of coding approaches may make for some interesting AI code suggestions in the next year or two.

I do pity the poor algorithm that has to parse sense into my coding idiosyncrasies.

seph-reed · on July 3, 2021

Perhaps. My experience with ML is that it's more like:

* a person that follows popular trends than

* a person who finds/dissects clever/unique solutions to add to their tool belt

Honestly, I'm pretty sure ML hates my guts. Anything I've ever used involving it ends up burying my voice and slowly trying to etch away that parts of me that aren't normal enough.

sfg · on July 3, 2021

Is there no licence with any sort of model training clause: "If this licence or the source code it covers is used to train a statistical model, then the model and code used to create the model are covered by this licence (which has terms like the AGPL)"?

If not, will anybody quietly slip something like this into Copilot's training data?

greatgib · on July 3, 2021

I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.

But, in that case, I think that the things that are put to charge GitHub are not right.

I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.

I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.

Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.

gavinhoward · on July 3, 2021

Unfortunately, just because code is open source doesn't mean that there aren't terms of use attached with it. One of the simplest and most widely used terms is attribution.

This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.

greatgib · on July 3, 2021

My point is that somehow, it is not copilot that is violating the licenses, but the code that is generated is.

So, if you just use copilot to generate random things, you are ok. But if you try to use the generated code for anything: distribution, selling, eventually usage. Then, you are violating the licenses in the same way as you took yourself the parts of code to reuse.

It is possible users of copilot that should avoid that or be very careful to check any line produced (that is almost impossible).

Also, by itself, copying one or two lines of code can hardly be limited by copyright. But, as we so, copilot can spit big full block of code from existing projects.

gavinhoward · on July 3, 2021

You have good points, but I would argue that Copilot itself is the entity distributing copyrighted code since there are times when it copies code verbatim. That puts the legal onus on Copilot itself.

maxbendick · on July 3, 2021

What's hilarious about auto-generating the GPL license is that it's provable Copilot is trained on GPL code, but it's essentially impossible to tell which code it came from. Any legal battle will be strange... Is it enough for Copilot to not regurgitate GPL licensed code exactly? Is it enough for Copilot to create a slightly modified version? Laughably, as soon as slight variation is added, there is so much code in the world that it'll be impossible to prove wrongdoing for HTML or JavaScript synthesis. A model trained on all permissively licensed code on GitHub looks a lot like your own GPL code? Are you sure your code is so unique?

Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?

eqtn · on July 3, 2021

Github should list all the projects they scrapped the code from to make copilot.

superkuh · on July 3, 2021

If you're hosting at the free github service, or even paid, github did not scrape your code. They just accessed the information on the hardware they owned. HTTP wouldn't have to be involved at all. They could just look at the disks.

Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""

The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.

blibble · on July 3, 2021

believe it or not there's more countries in the world than the United States

> "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy."

this is definitely not the case for 100% of the rest of the world

vinay427 · on July 4, 2021

It's not even true in the US. It's not a specific law, but rather a doctrine that is more of a vague legal idea as I understand it. There are always laws that don't follow it, such as the CCPA in California or several (narrower) approaches in other states, or even cases such as Carpenter v. US that rejected a possible application of it. This is without even considering more obvious holes in this concept including IP law, healthcare data, etc.

superkuh · on July 6, 2021

Github isn't incorporated and mostly centered in the rest of the world.

smoldesu · on July 3, 2021

The good news is that Github then also has no reasonable expectation for me to use their service. Most developers can just as easily set up a Gitlab or self-hosted alternative with zero friction.

croes · on July 3, 2021

That's why the GDPR is right. Like you mentioned, there is no cloud just other peoples hardware.

They will do whatever they want with your code.

MS didn't change a bit.

yayr · on July 3, 2021

see some analysis of the scope of this issue here: https://docs.github.com/en/github/copilot/research-recitatio...

especially: Conclusion and Next Steps.

This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

temac · on July 3, 2021

The way "AI" works for now, Copilot never comes with its own ideas, as it is incapable of deductive reasoning. It basically just detects from the context then mixes variations of things it learned. If there is nothing to mix (that is if there is a single source), the risk of spitting verbatim is high. But if there are multiple sources and some mixing and some amount of tiny differences, said differences better not be not too trivial because I don't see why we would suddenly drop Abstraction-Filtration-Comparison approaches...

So their defense of the like "oh it's fine it very rarely emits verbatim things" is bullshit anyway. That's an answer to a wrong question, at least given the answer is in this direction (would there be ton of verbatim recitation, they obviously would not try to wave away the problem like that -- however we can not conclude anything from verbatim output being rare, despite them stating that as if it a quite central and strong argument)

bphogan · on July 3, 2021

Hi HN. Didn't really expect to see this tweet make it here of all places. But that's cool.

abetusk · on July 3, 2021

Here is the relevant portion of GitHub's terms of service (section D.4) [0]:

"""

4. License Grant to Us

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

...

"""

Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.

[0] https://docs.github.com/en/github/site-policy/github-terms-o...

ChrisMarshallNY · on July 3, 2021

This is my take-home:

> We are obsessed with shiny without considering that it might be sharp.

yayr · on July 3, 2021

To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.

If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.

Let's look at MIT:

____________________

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]

____________________

From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.

Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?

Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?

TeMPOraL · on July 3, 2021

> Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?

Let's handle the simplest case first: Copilot can and does regurgitate large pieces of its training dataset verbatim. This is a well-known and trivially demonstrable property of all ML models in this family. Would such exact copy fall under the license of the code being copied? This of course needs to be tested in courts, but my gut says "yes". The problem now is, if you're using Copilot, you may end up with such copied code in your codebase without ever knowing, and this might open you to liability.

lokl · on July 3, 2021

How does Copilot avoid training on malicious code? There might be bad actors who would love to have their code scraped for this...

TeMPOraL · on July 3, 2021

I bet there are bad actors already starting to spam GitHub with sensibly looking projects that have hidden vulnerabilities, in hopes the next retraining of Copilot will pick them up.

cush · on July 3, 2021

Such a great question. Remember Tay?

coliveira · on July 4, 2021

Newsflash: all open source means that you're already doing free work for the largest corporations in the world! It seems like developers, as a group, decided that it would be better to spend their nights writing free code for FAANG, so they would be able to keep their day jobs. Bezos and friends thank you all. #genius

haolez · on July 3, 2021

Source Hut[0] is getting more attractive with each passing day, but I'm not sure I can adapt to it's weird e-mail centric pull requests (and I know that this is a standard Git feature, but the UX seems bad).

[0] https://sourcehut.org/

gavinhoward · on July 3, 2021

The other thing is that Source Hut is immensely complicated to set up. I wish I could love it, but Gitea was just so much easier to get going.

lmarcos · on July 3, 2021

A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".

It's not that crazy.

SergeAx · on July 4, 2021

> It’s truly disappointing to watch people cheer at having their work and time exploited

Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.

mensetmanusman · on July 3, 2021

Is this analogous to gtp-3 reading every sentence ever written without attribution to all of mankind?

wizzwizz4 · on July 3, 2021

A little. But all works output by GPT-3 are provided in “source form” to everyone who uses them – whereas lots of the output of Co-Pilot (trained on copyleft code, among other things) is going into proprietary software projects.

(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)

ricardobeat · on July 3, 2021

I don’t understand this mentality. The AI is trained (or at least supposed to be - that’s fixable) on code that was published under open licenses. The “exploited by the man” trope after publishing OSS feels entirely backwards.

gdsdfe · on July 3, 2021

Nothing is free people! ... People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data. Facebook used your face to train some algos, google your personal emails etc.

goodpoint · on July 3, 2021

> nobody is going after Facebook or Google

A lot of people dislike them and minimize their use.

More importantly, we are seeing a bait-and-switch. People agreed on GitHub storing, showing and indexing their code and issues, not using the code for Copilot, regardless of what the fine print in the usage agreement says.

SamWhited · on July 3, 2021

It's not about it being free, it's about GitHub taking something you licensed with conditions (ie. attribution or keep this copyright notice and license file, etc.) and blatantly ignoring your license because they know you probably can't afford to sue them for copyright infringement. Open Source doesn't mean you can reproduce and copy the code freely, licenses exist for a reason. Also: of course people care about Facebook et al. (not enough, I'll grant you). Plenty of people complain about Facebook violating privacy every single day.

gdsdfe · on July 3, 2021

Your license means nothing if you can't defend it. If you paid for the service they wouldn't dare using your code regardless of the license.

wizzwizz4 · on July 3, 2021

> Your license means nothing if you can't defend it.

It's okay to commit crimes if the victims are poor enough? (And yet what you've said is still true.)

gdsdfe · on July 4, 2021

I'm not saying it's ok, I'm saying nagging on HN, Twitter, etc. Won't solve any problem

wizzwizz4 · on July 4, 2021

And yet what problem was solved without people talking about it?

gdsdfe · on July 9, 2021

ok have at it :)

joe_the_user · on July 3, 2021

Well, "nothing is free" doesn't mean someone can blatantly violate a license 'cause they have rent to pay.

Maybe people should be mad about what Facebook or Google do but that stuff doesn't involve taking stuff outside their terms of use.

Maybe Github could try attaching a "we can relicense all your code whenever we want" condition to their hosting but they'd lose all their business.

ericmay · on July 3, 2021

I mean.. I am? I care much more about Google or Facebook profiling and profiting off of my data (especially when I don’t consent to giving it to them in any meaningful way) than I do letting GitHub do things with code I knew was freely available and that other entities could use in profitable ways.

_6pvr · on July 3, 2021

> People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data.

...what?

shakow · on July 3, 2021

Because people accept a EULA when they give their data to FB or Google. Github is exploiting a grey area to leverage a big fat chunk of GPL (or other) licensed code in ways that are perceived to be a technically probably legal, but morally very ambiguous understanding of these licenses.

code_duck · on July 3, 2021

I wish that I could somehow turn my personal data into IP, but I’m not sure how to accomplish that.

macintux · on July 3, 2021

Well, some of us try to keep Facebook and Google out of our private lives, but our faces aren't copyrightable.

nomercy400 · on July 3, 2021

What if you paid to keep it private?

yongjik · on July 3, 2021

There may be discussions to be made about licenses, but "to watch people cheer at having their work and time exploited by a company worth billions" is a disappointingly myopic take, especially from a developer.

Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.

We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.

pessimizer · on July 3, 2021

We would, if they weren't required to pay for potatoes.

Are you making a claim that Netflix shouldn't be required to pay for individual movies because they sell a collection of movies?

hekec · on July 3, 2021

On their website they say that "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set."

So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.

rhn_mk1 · on July 3, 2021

First, it does copypaste code: https://docs.github.com/en/github/copilot/research-recitatio...

Second, we can't ignore that if someone deliberately tries to make it spit out copyrighted code, the chances are going to be much greater.

Why would anyone? Plausible deniability: "I didn't copy this GPL procedure, the copilot gave it to me!"

gavinhoward · on July 3, 2021

GitHub has 56 million users as of September 2020 (according to Wikipedia). Let's assume that only 1 million of them use Copilot at an average of once a week.

That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.

0.1% may not seem like a lot, but at the scale of Internet companies, it always is.

macintux · on July 3, 2021

> So it won't copypaste your code.

You might want to check out this video...

https://twitter.com/mitsuhiko/status/1410886329924194309

einpoklum · on July 3, 2021

> the vast majority of the code that it suggests is uniquely generated and has never been seen before.

Original code in somebody's GitHub repo:

  int x = y + z;

Copilot code:

  int Eisaa7ha = Wu8iazo7 + Roh0Eesh;

Not copy pasted! Uniquely generated! Never before seen!

skc · on July 3, 2021

I think the product is pretty cool, but I wish it had been announced by GitLab instead so there would be less of this brouhaha.

mrkramer · on July 3, 2021

Isn't Google's BigQuery also scraping GitHub and is making it accessible/available for commercial use.

tasubotadas · on July 3, 2021

God forbit somebody profits from the code that you've posted publicly.

It's a NET POSITIVE FOR EVERYBODY.

TeMPOraL · on July 3, 2021

To the extent Copilot is doing something illegal, or making its users inadvertently engage in illegal behavior, it is copyright infringement, as (most) license violations are copyright violations.

Copyright cuts both ways. Free Software and Open Software exist in context, and because of, copyright laws. This means that a person or a company using output from Copilot may be engaging in copyright infringement. In other words, Copilot is enabling software piracy.

I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.

rhn_mk1 · on July 3, 2021

Copyleft exists for a reason.

Traubenfuchs · on July 3, 2021

> I have a SoundCloud and books and whatnot I could promote here.

You just did.

stakkur · on July 4, 2021

Github is (owned by) Microsoft. This is just an appetizer.

tedunangst · on July 3, 2021

What if I don't pay?

speedgoose · on July 3, 2021

Then you can't use the code completion tool made by Github using open-source code.

justbored123 · on July 3, 2021

[flagged]

gavinhoward · on July 3, 2021

For the record, I moved all of my code away from GitHub because I didn't like them. Now I self-host, with all of the work that entails.

So I think I have a right to be mad when they do something like this to code I previously stored on GitHub.

throwaway2048 · on July 3, 2021

what about the "super entitle" companies that take code and violate its license.

einpoklum · on July 3, 2021

Hey, how come what is essentially the equivalent of an HN comment, only made on Twitter, gets to be an HN story? :-(