Hacker News new | past | comments | ask | show | jobs | submit login

> If you open-sourced code and allowed it to be used for commercial purposes

Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.

> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.

People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.




> People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.

it's worse than that: it's Microsoft trying to completely undermine the concept of open source

meanwhile: they're unaffected as their high-value proprietary code remains private and doesn't train the model


Finally the angle for Microsoft buying GitHub becomes clear.


There is zero rationale why Microsoft could not do it without buying GitHub. Hell, they could have trained the model from GitLab or Bitbucket public repos.

They need developer mindshare and they lost a lot of developers in the 2000s. Buying GitHub (and being involved in 100s of projects) brings this mindset of Microsoft as a developer company back to developers. Like as an advertisement platform/branding for some novel technology like this CoPilot.


It was a bit of a tongue in cheek comment. It’s very unlikely they were even planning this when they bought GitHub. While you’re completely right about their rationale, a lot of people were at the time concerned about the acquisition, and it feels now a bit short sighted to burn all the good will they have generated by an act like this which undermines open source.


I don’t even think most people care about open source code. What if they’re reading private repos as well? The only option to escape this is to leave github. Which I’m kinda considering even with all the work involved.


I think the rational for buying it was they were the single largest user of it, and were concerned about funding stability.

Does anyone have the numbers before they bought github, I saw them once, don't remember the values, just remember being shocked at what percentage of the total github code base was Microsoft. I had no idea they were using it at all.


And once again RMS was proven right.


How would the GPL have prevented this, exactly?

More explicitly, how would a license that gives everyone the right to copy, modify and redistribute source code for any purpose without compensation or attribution have prevented Github from building a tool that copies, modifies and redistributes source code without compensation or attribution?


Because if you use GPL code, you must release your use/changes under the GPL. That's why the GPL is often called infectious. If copilot spits out GPL code verbatim, the user of the code must respect the license.


You don't know how the GPL works. The GPL explicitly requires attribution. I don't understand what you thought you were saying. The GPL places strict requirements on the licensing of derivative works. You don't know how the GPL works.


And if they provide attribution?


afaik GitHub Inc is still independent company and independent from Microsoft


> you grant each User of GitHub a nonexclusive, worldwide license to […] reproduce Your Content […] as permitted through GitHub' functionality

If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

This ToS change happened 2017, and I actually had to get approval from all contributors of my projects to accept to the changed ToS: https://github.com/justjanne/QuasselDroid-ng/issues/5

What GitHub’s doing is shady, but it’s been obvious it was going to happen for years.


Yeah, and https://docs.github.com/en/github/site-policy/github-terms-o... has this language:

This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. [...]

So, it's not clear if they could charge for Copilot, given that last line, but indexing, analysis, and sharing with other users are all included.


Reproducing copylefted code is neither indexing nor analysis.

To be clear, a piece of code that creates a copy of some GPL code is not a problem by itself. However, it's misrepresented as something the tool "generated" rather than code belonging to an existing human actor, without appropriate attribution and licensing info.


I think we're assuming that they used GPT3 (or whatever) on code that was stored on github.com in accordance with their TOS, which means (if I understand correctly as a non-lawyer) that they didn't share it under the terms of the GPL, since they can't, but if whatever licenses the code already had attached were not sufficient, that code was also licensed to github for the purposes of indexing, analysis, and sharing with other github service users.

Does this mean that if you picked up GPL code from gnu.org and put it in your public repo, you're in violation of the GPL for illegitimately licensing that code to gihub under a non-GPL-compatible license? Maybe!


> that code was also licensed to github for the purposes of indexing, analysis, and sharing with other github service users

Even if said code was licensed to Github for sharing with other users (e.g. displaying it when another user browses the page), that doesn't mean that the same permission is given to all other users under the same conditions. Or allows them extra liberties, like using said code for purposes other than "sharing with other github service users".


No argument, there! I'm pretty concerned about the consequences of using Copilot, legally. It all seems to turn on whether a judge or judges believe that a long sequence can be produced by "learning" rather than "copying". ¯\_(ツ)_/¯


It's irrelevant. There is no guarantee that the people uploading code even have the ability to grant github such a licence if they aren't the copyright holders.


> If you upload code to GitHub, you grant them (and every GitHub user) a license to do exactly what Copilot does.

You can upload code to GitHub without the ability to grant such a license, or is github now only for primary copyright holders?


> or is github now only for primary copyright holders?

That is how I interpreted the 2017 ToS change: You can only upload code to GitHub if you’re the primary copyright holder, or the code was already on GitHub to begin with (which is only an issue if the primary repository of some project is another git host, e.g. gitlab).


Well a public repo can be cloned. And needs to be replicated in CDNs etc.

Is not that reproduction?


Good point; if the code does not have a specified license then standard copyright terms apply and inclusion for commercial use should be verboten. It it's actually open source without commercial restrictions though, I don't see an ethical difference in using the code directly or for an meta analysis driving ML for enhanced code completion.


> It it's actually open source without commercial restrictions though, I don't see an ethical difference in using the code directly or for an meta analysis driving ML for enhanced code completion.

Most Open Source code comes with a requirement to carry over license note, which Copilot does not do. Additionally, ethics dictate you attribute the source when copying directly, something the Copilot also doesn't do.


Are they actually using the source to make a derivative/fork though? If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area. Though I would still agree that the right thing to do would be to have an attribution area, even if it was thousands of entries long. Whether technically required by the license or not, the spirit of these licenses I think would come down on pushing for attribution regardless of the nature of the re-use.


> If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area.

It's already been demonstrated that Copilot - like all tools in the GPT family - frequently output large chunks of their training dataset verbatim. It's not hard to trigger this behavior, even unintentionally. To me, this is much closer to "reusing". But I'm not a lawyer.

It's also worth remembering that there are two parties potentially open to liability here - GitHub, with the way the code was used with the Copilot, and the user, who may be unwittingly including licensed code in their codebase. Given the well-known behavior of the GPT family I mentioned above, it might be hard to argue that Copilot "just chanced" into generating code that's identical to existing, non-public-domain code.


frequently output large chunks of their training dataset verbatim

Ah, that then is extremely problematic.

I really like the idea of Copilot to speed development-- basically code completion taken to an extreme-- but this seems like a very bad way to go about it.


Don't forget the usage restrictions, as specified in each individual license.


And if course GPL would attach where applicable .....


> Even with an MIT license it is a copyright violation to copy the code without attribution

Probably a copyright violation. There are surely circumstances in which copying a small portion would either fall under fair use, or for other reasons not constitute a violation. The question then is whether or not codepilot is causing a violation. I don't think it's as clear cut as most commenters are making out.

All in all though, it's probably going to take a few court cases to figure out. In the mean time, I'd expect most companies to steer clear of codepilot.


Snyk did the same with Snyk code to build their “ML driven SAST” offering.

Pretty much anyone can scrape GitHub and train their model.

What exactly is the legal implications of this has yet to be tested.

Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.

By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.

ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?

I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.

But I also don’t think that the ML model is necessarily a derivative work.

For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.


if I feed the entirety of github (sans licenses) into a java HashMap and provided an interface to query that I very much doubt that its output would qualify as "fair use"

why is it different if a slightly more complicated data structure is used?


I’m not saying it is, the courts should really decide on what exactly counts as fair use in this case.

All I’m saying that you don’t need to be a huge corporation to do it, and that others are doing similar things as well.

I passed on Snyk code due to similar concerns especially since they pull out examples from FOSS projects directly and even had a “fix me” option where they push pull requests into your repo with fixes.

On ML in general the current policy I’m working on for my org is that we do not use any pre-trained models trained on public data and pushed the legal team to actually start figuring out how we should deal with these issues properly in the future.

ML currently is a Wild West it’s too new to have been tested and defended in court regardless of how to chips would or should fall.

As far as your specific example it would really depend on what data is actually preserved.

Since they do parrot whole code snippets comments and all it seems that they don’t have a generalized model at least for every problem.

However it’s also my personal legal opinion (ANAL) that if you can prove that the model holds nothing but a generalized solution for a given problem the code it outputs isn’t a derivative work anymore than a the code of a person learning from copyleft code.

However then there is the whole issue of “allowed use” none of the existing licenses specify if the code can be used to train a model, this also means that we probably need to update all existing licenses to include a clause that explicitly states the limitations for this use case.

For code under existing licenses the fair use needs a proper judgement.

My gut feeling would be that it would count as fair use just as using code in a course or a book would be. GitHub definitely needs to make a page with attributions tho for that to happen and make sure their model doesn’t output anything but a generalized solution.


is copy not allowed by theirs terms of service? I read something about that it is needed for e.g. forking feature. But it was years ago, when I created the account.


> Even with an MIT license it is a copyright violation to copy the code without attribution.

That was my take originally, but apparently this is not as cut and dry as you may think:

https://www.technollama.co.uk/is-githubs-copilot-potentially...


> People share code for the betterment of society, and while copyleft used to be popular early on, the software industry has been moving towards less problematic and more permissive licences. The spirit of open source is to share code, and make it possible to use it to train machine learning.

It sounds like this writer doesn't understand the point of the GPL or the distinction between free software and open source. Also quite a crude historical perspective given the increasing number of major projects licensed virally. Claiming that MITL's popularity justifies pirating GPL code makes no sense.


> the software industry has been moving towards less problematic and more permissive licences [sic]

... less "problematic" for who, exactly?

That is conveniently left out.


not only is "problematic" problematic, so is "permissive".

more permissive for whom? because the GPL is the most permissive license for users of the code: GPL guarantees, they're permitted to see the source.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: