Hacker News new | past | comments | ask | show | jobs | submit login

Do we have any evidence that copilot doesn't check/filter by license?



One of the (ex?) programmers from Valve managed to get it to spit out parts of the Source engine verbatim. He posted a Twitter thread yesterday I believe.


Does that prove it ignores licenses or does that imply the source engine exists verbatim (minus licenses) multiple times on Github?


If it's minus a license then it should be assumed that rights are retained (in the same way you can't just take ownership of an image you find on the internet) so if it were filtering it shouldn't take code from repo's without explicit and favorable licenses. If it is taking code only from repo's with permissive licenses (e.g. MIT) then why aren't they following the attribution requirements?

I don't think you can have your cake and eat it on this one.


If I steal some code and put it on Github under MIT that doesn't really make it MIT, I'm just lying that it is. If Copilot then uses that it's still in violation of the law I'd assume (ignorance doesn't exonerate you etc.). So they'd have to verify on a case by case basis, which they obviously haven't given the volume of data they had to feed the thing.

It's kinda shocking that they think they can sell this, even providing it for free is extremely sketchy but at least complies with BSD/GNU/CC licensed stuff I guess.


And especially with such blanket statements as "the code you write with GitHub Copilot’s help belongs to you".


Why do you think that the recipient is responsible for verifying that no one else has copyright of code they recieved under license?

Is every product user liable when a vendor ships some stolen code?


> Is every product user liable when a vendor ships some stolen code?

The user would be unlicensed, and in lieu of the vendor resolving this then the user would need to purchase licences to continue using the software legally (ie if a vendor gives you a pirate version of photoshop, you can’t just use it forever just because someone sold it to you).

There are usually clauses in enterprise software agreements that attribute liability for unlicenced components to the vendor for this reason. But ultimately if there isn’t a contract or the vendor vanishes, the user will need to go get a licence.

If you want to test the theory, I’ll send you a few images to put on your website, and when you get a claim through from the copyright owner you can try to argue that I sent it across without a copyright notice so I am liable ;)


> Is every product user liable when a vendor ships some stolen code?

No, but the difference is the users of a product are typically not making and distributing copies. That’s not the case if you use someone else’s code in your project.


It would prove that it doesn't honour all licences - just because the source code exists on Github without a licence doesn't automatically grant a licence to Copilot from a legal perspective.


just because someone else ignored the license doesn’t mean github is free to blindly vacuum that up


Can you post the link to that?



3 lines of fairly generic code?

That's not what copyright is protecting.


Just for the record I was providing some evidence to support this question: "Do we have any evidence that copilot doesn't check/filter by license?"


I mean, even if the license was placed on the code, that doesn't mean, if it's not protected by copyright, then it's fair game for copilot to scrape, learn from, and emit variations of, the code.

I believe github's lawyers would have had hundreds of hours of dicussion about this and at this point, they believe they are in the right, and anybody who disagrees should use the legal system to resolve the matter.

In the meantime, what it is and isn't doing wrt licenses seems to be poorly understood externally.


This is in fact impossible.

All they could do is filter by the LICENSE file in the repo.

Unfortunately for them, by law copyright and license are determined by the authors and merely represented by a LICENSE file, which could be lying about both.

The court isn't going to accept that excuse when this goes to trial.


And you can have multiple licenses in the same repository, folders with copyright exceptions, etc.

It's hard enough for us human to find our way in this mess, I've little hope for an AI.

But maybe it's just the first step. The final step being able to sell an AI that understands Copyright management. I'm sure there is a big market for that.


I feel like a few guidelines and standards could help simplify a baseline process:

1) Require each repository to opt-in to be learned from.

2) Require any source file used for learning to have an SPDX license heading.

3) Have a list of approved permissive licenses to avoid any proprietary or copyleft arguments.

Using SPDX headings as the explicit guide would solve the problem of different code content using a different license within a project. An example being QtWayland: the client pieces are Proprietary/LGPL/GPL, whereas the compositor parts are Proprietary/GPL. That's not something you'd know from the license files at the root of the project (and post-6.3 they use SPDX instead of the prior license template heading).

Granted, this doesn't solve the problem of the chain of trust (is the individual publishing the code truly the copyright owner), but I think it would be a basic start for a program like this. The opt-in nature would make things... difficult, but I think that's a fair trade-off for something like this.


Yes a standard would probably solve the issue.

But until lawyers push for a standard that would make this part of their work irrelevant, I can't see how it could happen :)


And that is why this project should never have made it past the brainstorming session.


There was a tweet by Nora Tindall (which is deleted) having a screenshot of a mail direct from GitHub stating that GPL code is included in the training of the Copilot and will indeed use it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: