Do we have any evidence that copilot *doesn't* check/filter by license?

meheleventyone · on June 23, 2022

One of the (ex?) programmers from Valve managed to get it to spit out parts of the Source engine verbatim. He posted a Twitter thread yesterday I believe.

mustyoshi · on June 23, 2022

Does that prove it ignores licenses or does that imply the source engine exists verbatim (minus licenses) multiple times on Github?

meheleventyone · on June 23, 2022

If it's minus a license then it should be assumed that rights are retained (in the same way you can't just take ownership of an image you find on the internet) so if it were filtering it shouldn't take code from repo's without explicit and favorable licenses. If it is taking code only from repo's with permissive licenses (e.g. MIT) then why aren't they following the attribution requirements?

I don't think you can have your cake and eat it on this one.

moffkalast · on June 23, 2022

If I steal some code and put it on Github under MIT that doesn't really make it MIT, I'm just lying that it is. If Copilot then uses that it's still in violation of the law I'd assume (ignorance doesn't exonerate you etc.). So they'd have to verify on a case by case basis, which they obviously haven't given the volume of data they had to feed the thing.

It's kinda shocking that they think they can sell this, even providing it for free is extremely sketchy but at least complies with BSD/GNU/CC licensed stuff I guess.

Hamuko · on June 23, 2022

And especially with such blanket statements as "the code you write with GitHub Copilot’s help belongs to you".

lupire · on June 23, 2022

Why do you think that the recipient is responsible for verifying that no one else has copyright of code they recieved under license?

Is every product user liable when a vendor ships some stolen code?

Closi · on June 23, 2022

> Is every product user liable when a vendor ships some stolen code?

The user would be unlicensed, and in lieu of the vendor resolving this then the user would need to purchase licences to continue using the software legally (ie if a vendor gives you a pirate version of photoshop, you can’t just use it forever just because someone sold it to you).

There are usually clauses in enterprise software agreements that attribute liability for unlicenced components to the vendor for this reason. But ultimately if there isn’t a contract or the vendor vanishes, the user will need to go get a licence.

If you want to test the theory, I’ll send you a few images to put on your website, and when you get a claim through from the copyright owner you can try to argue that I sent it across without a copyright notice so I am liable ;)

ryukafalz · on June 23, 2022

> Is every product user liable when a vendor ships some stolen code?

No, but the difference is the users of a product are typically not making and distributing copies. That’s not the case if you use someone else’s code in your project.

Closi · on June 23, 2022

It would prove that it doesn't honour all licences - just because the source code exists on Github without a licence doesn't automatically grant a licence to Copilot from a legal perspective.

micromacrofoot · on June 23, 2022

just because someone else ignored the license doesn’t mean github is free to blindly vacuum that up

leakbang · on June 23, 2022

Can you post the link to that?

meheleventyone · on June 23, 2022

Sure: https://twitter.com/ChrisGr93091552/status/15397316329318031...

dekhn · on June 23, 2022

3 lines of fairly generic code?

That's not what copyright is protecting.

meheleventyone · on June 23, 2022

Just for the record I was providing some evidence to support this question: "Do we have any evidence that copilot doesn't check/filter by license?"

dekhn · on June 23, 2022

I mean, even if the license was placed on the code, that doesn't mean, if it's not protected by copyright, then it's fair game for copilot to scrape, learn from, and emit variations of, the code.

I believe github's lawyers would have had hundreds of hours of dicussion about this and at this point, they believe they are in the right, and anybody who disagrees should use the legal system to resolve the matter.

In the meantime, what it is and isn't doing wrt licenses seems to be poorly understood externally.

samatman · on June 23, 2022

This is in fact impossible.

All they could do is filter by the LICENSE file in the repo.

Unfortunately for them, by law copyright and license are determined by the authors and merely represented by a LICENSE file, which could be lying about both.

The court isn't going to accept that excuse when this goes to trial.

gjadi · on June 23, 2022

And you can have multiple licenses in the same repository, folders with copyright exceptions, etc.

It's hard enough for us human to find our way in this mess, I've little hope for an AI.

But maybe it's just the first step. The final step being able to sell an AI that understands Copyright management. I'm sure there is a big market for that.

mroche · on June 23, 2022

I feel like a few guidelines and standards could help simplify a baseline process:

1) Require each repository to opt-in to be learned from.

2) Require any source file used for learning to have an SPDX license heading.

3) Have a list of approved permissive licenses to avoid any proprietary or copyleft arguments.

Using SPDX headings as the explicit guide would solve the problem of different code content using a different license within a project. An example being QtWayland: the client pieces are Proprietary/LGPL/GPL, whereas the compositor parts are Proprietary/GPL. That's not something you'd know from the license files at the root of the project (and post-6.3 they use SPDX instead of the prior license template heading).

Granted, this doesn't solve the problem of the chain of trust (is the individual publishing the code truly the copyright owner), but I think it would be a basic start for a program like this. The opt-in nature would make things... difficult, but I think that's a fair trade-off for something like this.

gjadi · on June 23, 2022

Yes a standard would probably solve the issue.

But until lawyers push for a standard that would make this part of their work irrelevant, I can't see how it could happen :)

mnd999 · on June 23, 2022

And that is why this project should never have made it past the brainstorming session.

bayindirh · on June 23, 2022

There was a tweet by Nora Tindall (which is deleted) having a screenshot of a mail direct from GitHub stating that GPL code is included in the training of the Copilot and will indeed use it.