GitHub Copilot as open source code laundering?

belter · on June 30, 2021

One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation would be the answer to the following question: Was Copilot trained using Microsoft internal source code or will it be in the future ?

As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)

And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...

wongarsu · on June 30, 2021

Alternatively, wait for co-pilot to add support for C++, then start writing an operating system with Win32-compatible API using co-pilot.

There is plenty of leaked Windows source code on Github, so chances are that co-pilot would give quite good suggestions for implementing a Win32-compatible kernel. Then watch and see if Microsoft will try to argue that you are violating their copyright using code generated by their AI.

yuppiepuppie · on June 30, 2021

Oh man, that got meta super fast. Its like a mobius strip!

function_seven · on June 30, 2021

It can always get more meta.

For example, the AI tool that Microsoft's lawyers use ("Co-Counsel"), will be filing the DMCA notices and subsequenct lawsuits against Co-Pilot generated code.

This will result in a massive caseload for the courts, so naturally they'll turn to their AI tool ("DocketPlus Pro") to adjudicate all the cases.

Only thing left is to enter these AI-generated judgements into Etherium smart contracts. Then it's just computers suing other computers, and being ordered to send the fruits of their hashing to one another.

yesbabyyes · on June 30, 2021

Have you read Accelerando by 'cstross? It plays out kind of like this, only taken to a tangent. Notably, it's written before ethereum or bitcoin were conceived. Great storyline.

https://en.wikipedia.org/wiki/Accelerando

function_seven · on June 30, 2021

I have not. But I will. Thanks!

sslayer · on June 30, 2021

Don't forget settlements paid in Ai-generated crypto-currencies backed by Gold mined in Australia fully automated mine. Run it all on solar and humans can just fuck right off.

hyperpallium2 · on June 30, 2021

The market ultimately obeys customer demand, so all these problems will be sorted out... until customer AI.

pornel · on July 1, 2021

That's the next step for Amazon Prime: 0-click shopping. Just buys stuff from your recommendations every month and sends it to you.

FridgeSeal · on July 4, 2021

"Local man in mental distress as microwave-lasagna meals get delivered to his house every 20 minutes"

jamiek88 · on July 4, 2021

Amazon refuses request for microwave.

sbierwagen · on June 30, 2021

Nick Land-style accelerationism, or the "ascended economy". https://slatestarcodex.com/2016/05/30/ascended-economy/

emptyparadise · on June 30, 2021

And while the machines are distracted by all that, we can get back to writing code.

danny_taco · on June 30, 2021

Who could have predicted machines would be very good at multitasking. As of today they are STIL writing code AND creating more wealth through gold hoarding AND smart contracts at the same time!

jedberg · on June 30, 2021

The legal system moves swiftly now that we've abolished all lawyers!

darioush · on July 1, 2021

Somehow, I find this a plausible and not entirely undesirable outcome for society. The less time humans spend interfacing with machines, the more points humanity gets anyway.

boxslof · on June 30, 2021

Isn't this similar to how ads and adblocker fight, just extrapolated?

gogopuppygogo · on June 30, 2021

laurent92 · on June 30, 2021

The nice thing about co-pilot is that it will suggest to do the same mistakes as in other software. If you accept all autosuggestions in C++ you might end up with Windows.

rowanG077 · on July 1, 2021

This is such a ridiculous statement to me. If this were a real problem we would have noticed by now with stackoverflow. I truly believe the vast majority of capable developers read, understand and test code they copy from somewhere. This is even more obvious with an AI that will never suggest 100% correct code all the time.

6510 · on June 30, 2021

And eventually you will be forced to do it the way everyone does it.

zelphirkalt · on July 1, 2021

An imaginary conversation between a real developer and some kind of managing person:

"Why are you typing all this stuff by hand? All your coworkers are much more efficient by using the AI!"

"But I need to actually understand ..."

"You should get more efficient! Look at how much time this costs us."

"Yeah but they are copying in mistakes from ..."

"No, the system works! Just do it like everyone else does it and do not waste more time!"

Or at the next code interview ...

cyanydeez · on July 1, 2021

oracle is probably already arming their lawyers. just setup a git, put a restrictive license, and scan any new github projects.

akerl_ · on June 30, 2021

Without weighing in on the overall question of “is this a license violation”, you’ve created a false dichotomy.

“GitHub included Microsoft proprietary code in the training set because they view the results as non-derivative” and “GitHub didn’t include Microsoft proprietary code because they view the results as derivative” are clearly not the only options. They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.

not2b · on June 30, 2021

Or: they used the entire open source corpus because they thought it was free for the taking, and when people point out that it is not (that there are licenses) they spin that (claim that only 0.1% of output is directly copied, but that would mean 100 lines in 100k program) and pass any risk onto the user (saying it is their responsibility to vet any code they produce). So they aren't saying that users are in the clear, just that it isn't their problem.

nerpderp82 · on June 30, 2021

Use neural indexes to find the code that most closely matches the output. Explainable AI should be able to tell you where the autocompletion results came from, even if it is a weighted set of files.

abecedarius · on June 30, 2021

That's a good idea in theory, but the smarter the agent gets, the less direct the derivation and the harder to explain it (and to check the explanation). We're already a long way from a nearest-neighbor model.

Yet the equivalent problem for humans gets addressed by the clean-room approach. This seems unfair.

visarga · on July 1, 2021

> the smarter the agent gets, the less direct the derivation and the harder to explain it

at some point it should be different enough to stand on its own, right? then we have no problem with copyrights

abecedarius · on July 1, 2021

Yeah, also in principle. But the cleanroom approach isn't technically required for humans either -- it became standard because the legal notion of a derived work is very fuzzy and gradually changing, and lawsuits are expensive and chancy, so you want a process that's provably not infringing. "Yeah I learned some general ideas from this code, but I didn't derive any of my code from theirs" seems to be a logical rats-nest. With the explainable-AI approach to this particular problem, the more intelligent the AI, the more this solution is like analyzing brain scans of your engineers. If your engineers could have produced "derived work" without literal copying, why can't the AI?

nerpderp82 · on July 8, 2021

I agree, but we aren't anywhere close to that level yet. For that to be true, I think the AI should possibly have the ability to explain the code it created. What we have now is basically a fancy markov adlib code completion tool.

A more intelligent agent should be able to tell you where it learned all of its knowledge from. I personally would like my AI to be above "gut level instincts" otherwise it reinforces blind trust.

vharuck · on June 30, 2021

>saying it is their responsibility to vet any code they produce

But, if some of the code produced is covered by copyright, isn't Microsoft in trouble for distributing software that distributes copyrighted code without a license? How would it be different from giving out bootlegs DVDs and trying to avoid blame by reminding everyone that the recipients don't own the copyright?

visarga · on July 1, 2021

this complicated copyright problem shows we're still using last century concepts on new and emerging technology that surpassed it; it's time to think hard about it because we need neural nets and they need training data

zelphirkalt · on July 1, 2021

Some are more equal than others though, aren't they? I mean, if MS throws out licensed code from others, as if to say: "Ahh, software licensing, such an outdated concept ..." but then keeps its own code out of that loop. "Yeah, but that's our own code, no one is allowed to copy that!"

visarga · on July 1, 2021

I doubt they will corner the market for AI code assistants. ML models are replicated or surpassed in a few months by the competition. We will all benefit from them, it won't remain concentrated in a few hands.

Closi · on June 30, 2021

Also, 0.1% of output is directly copied doesn't include the lines where the variable names were slightly changed, but the code was still copied.

If you got the Microsoft codebase and Ctrl+F'd all the variable names and renamed them, I bet they would still argue that the compiled program was still a copy.

yunohn · on June 30, 2021

> 100 lines in 100k program

The intention is autocomplete boilerplate, not write a kernel.

jonathankoren · on June 30, 2021

This is not a difference in kind.

Autocomplete, do you have anything to say to the commenter ?

“This isn’t the best thing to say.”

visarga · on July 1, 2021

Coding a snippet is not different in kind from designing a Kernel? It's the difference between tactics and strategy.

jonathankoren · on July 2, 2021

I’ll direct you to my other comment in this thread. But give you the TL;DNR no it isn’t.

rowanG077 · on July 1, 2021

How is designing a very large system even close to the same thing as writing a few small functions? That's like saying an architect designing a building is doing the same thing as a brick layer putting down cement.

jonathankoren · on July 2, 2021

“””Computers can already author documents at near human quality. Research is continuing to increase the accuracy and volume of these models.

Language processing research will not only help doctors, but will allow machine-based language translation, and eventually automated chat bots that can converse in our languages.

The next steps in human-machine collaboration are to allow people and machines to co-create. A recent Chinese report suggests that 50% of scientific papers in this field will be written without human intervention by 2033, compared with only 11% today.

One of the biggest challenges of machine learning is giving the machine what it lacks. This usually means gaining enough training data to teach the algorithm how to make inferences from data points it has never encountered before.

Many of the large organisations involved in advancing AI's ability to develop documents can improve how the algorithms learn by building on the knowledge and experience of human workers.”””

The above text was automatically written by https://app.inferkit.com/demo . It uses a language model to predict the next word in a sequence. In other words, to use your example, it not only architects, but builds, the entire building simply by predicting where to put the next brick.

So to answer your question: Yes. That’s exactly how it’s done.

rowanG077 · on July 2, 2021

And such a thing has never been achieved with code. Besides very often the texts such an ai creates are non-sensical. And they are very short. Writing a few pages of text would equivalent to small tool of a few hundred lines. Or about the same as building a wooden shed. You don't need much skill for that. Come back when an AI can write multiple internally consistent books such as LOTR and the Dilation or the Harry Potter series. That's the scale of architecting a system.

jonathankoren · on July 2, 2021

True, but I also think this is showing a lack of imagination about where things are going.

You're trying to say architecting is some big woo idea that's somehow different from writing code. Kind of, maybe. But I bet you could build a functional kernel with central design. Given that's how biological systems work, I'm sure it could be done. Then what say you?

dragonwriter · on June 30, 2021

> They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.

They don't claim they used an “open source corpus” but “public code” because such use is “fair use” not subject to the exclusive rights under copyright.

whoisthemachine · on July 1, 2021

They could have just used their own open source code, of which they now have plenty, in many languages.

dragonwriter · on June 30, 2021

> One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation

They don't claim it wouldn't be a license violation, they claim licensing is irrelevant because copyright protection doesn't apply.

> And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ?

No, that would just show them to not want to expose their proprietary code. It doesn't prove anything about derivative works.

Also, their own claim is not that the results aren't a derivative work but that training an AI is fair use, which is an exception to the exclusive rights under copyright, including the exclusive right to create derivative works.

m3at · on July 1, 2021

Late to the thread but: OpenAI is not a non-profit since 2019 (technically they call it a capped profit company [1], but until the singularity you can ignore the cap). I guess this does impact the dynamic with Microsoft

[1] https://openai.com/blog/openai-lp/

outside1234 · on June 30, 2021

It probably wasn't because Github is treated as a separate company by Microsoft.

Literally people need to quit Microsoft and join Github to take a role at Github.

alex_young · on June 30, 2021

That's an interesting employment detail, but what does it have to do with the other parts of the organization? I happen to know that they work together on security and contract areas, and it wouldn't surprise me if there were other similar arrangements in place.

skeeter2020 · on June 30, 2021

>> Was Copilot trained using Microsoft internal source code...

They explicitly state "public" code so the answer is most certainly "no".

alternatetwo · on July 1, 2021

Well, the source of various Windows versions is public on GitHub ...

visarga · on June 30, 2021

Not a problem because it's possible to check if the code is verbatim from the training set (bloom filters).

AlotOfReading · on June 30, 2021

It's not clear to me that verbatim would be the only issue. It might produce lines that are similar, but not identical.

The underlying question is whether the output is a derivative work of the training set? Sidestepping similar issues is why GCC and LLVM have compiler exemptions in their respective licenses.

visarga · on June 30, 2021

If simple snippet similarity is enough to trigger the GPL copyright defense I think it goes too far. Seems like GPL has become an obstacle to invention. I learned to run away when I see it.

AlotOfReading · on June 30, 2021

It's not limited to similar or identical code. The issue applies to anything 'derived' from copyrighted code. The issue is simply most visible with similar or identical code.

If you have code from an independent origin, this issue doesn't apply. That's how clean room designs bypass copyright. Similarly if the upstream code waives its copyright in certain types of derived works (compiler/runtime exemptions), it doesn't apply.

klipt · on June 30, 2021

So if you work on an open source project and learn some techniques from it, and then in your day job you use a similar technique, is that a copyright violation?

Basically does reading GPL code pollute your brain and make it impossible to work for pay later?

If so you should only ever read BSD code, not GPL.

throwawayboise · on June 30, 2021

> Basically does reading GPL code pollute your brain and make it impossible to work for pay later?

It seems to me that some people believe it does. Some of the "clean room" projects specifically instructed developers to not even look at GPL code. Specific examples not at hand.

visarga · on July 1, 2021

I start seeing Ballmer's point of view. It's like cancer.

astrange · on June 30, 2021

Microsoft appears to believe this (or maybe just MacBU) because I've met employees who tell me they're not allowed to read any public code including Stack Overflow answers.

the_gipsy · on June 30, 2021

This has nothing to do with GPL. Copyright is copyright. You can’t even count on public domain everywhere in the world.

radmuzom · on June 30, 2021

If that's the case then GPL code should not have been used in the training set. Open AI should have learned to run away when they saw it. The GPL is purposely designed to protect user freedom (it does not care about any special developer freedom) which is it's biggest advantage.

woah · on June 30, 2021

Don't come in here with your common sense

emodendroket · on June 30, 2021

Since quite a lot of Microsoft code is on GitHub, I'd say yes.

pc86 · on June 30, 2021

The "because" in your last bit is a huge leap.

It wasn't trained on internal Microsoft code because the training set is publicly available code. It has nothing to do with whether or not it suggests exactly identical, functionally identical, or similar code. MS internal isn't publicly available. Copilot is trained on publicly available code.

_yid9 · on June 30, 2021

You stated a fact "Copilot is trained on publicly available code".

The question (and implication) is: why not train it on MS internal code, if the claim that the output isn't license-incompatible is true.

If the output doesn't conflict with any open-source license (ie. it springs into existence from general principles, not from "copying" licensed code -- then MS-internal (in fact, any closed-source code) should be open-season.

I can imagine a few of the non-obvious segments of code I've written being "recognizable" methods to solve certain problems. And, they are certainly licensed (GPL + Commercial, in my case).

I think, at the very least, that a set of AIs should be trained on different compatible sets of code, eg. GPL, AGPL, BSD, etc. Then, you could select what amount of license-overlap is compatible with your project.

monocasa · on June 30, 2021

Honestly I think a large part of the value add of machine learning is going to be the ability for huge entities to launder intellectual property violations.

As an example, my grandfather (an old school EE who got his start on radar systems in the 50s, who then got his radiology MD when my jewish grandmother berated him enough with "engineer's not doctor though...") has some really cool patents around highlighting interesting parts of the frequency domain in MRIs that should make detection of cancer a whole lot easier. As an implementation he did a bunch of tensor calculus by hand to extract and highlight those features because he's an incredibly smart old school EE with 70 years experience cranking that kind of thing out with only his trusty slide rule. He hasn't gotten any uptake from MRI manufacturers, but they're all suddenly really into recurrent machine learning models to highlight the same sorts of stuff. Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.

I'm personally pretty anti intellectual property (at least how it's implemented in the states), but a system where large entities that have the capital investment to compute the large ML models can launder IP violations, but little guys get stuck to the letter of the law certainly seems like the worst of both worlds to me.

ipsum2 · on July 1, 2021

I don't understand how your example relates to "launder intellectual property violations". What you're saying is that your grandfather hand wrote some feature extractors that look similar to the neurons that ML models have learned from backpropagation. There's no stealing of IP there at all.

monocasa · on July 1, 2021

He has a set of patents on certain types of highlighting frequency domain patterns in MRIs. In a lot of ways recurrent neural networks can be frequency domain feature extractors as the backwards data flows create sort of delay line memories tapped at interesting periods. The MRI manufacturers after refusing to license his patents, heavily invested in ML models that focus on using recurrent networks for frequency domain feature extraction. Patents aren't like copyright where independent reinvention is a way out; he has a monopoly on the concepts regardless of how someone came about them, even by growing them a bit organically like how ML works.

otterley · on July 1, 2021

You can't patent a concept. You can only patent a process, a machine, an article of manufacture, or a composition of matter. And the invention must be described sufficiently such that a practitioner skilled in the relevant art can reproduce the subject matter.

monocasa · on July 1, 2021

You can patent a software process or machine running classes of software in the US as long as it doesn't conflict with Alice Corp's test which is analogous to what I mean by concept in this case. And his patents are extremely well documented in the patents so that they can be reproduced. I guess if someone manually cranked through the math for each video's set of pixels they wouldn't be infringing, but oncologists aren't really ok with waiting a year for results. Any practical implementation would be infringing.

And like I said, I'm pretty anti US structures around intellectual property (including software patents), but I'm not for the only ones being able to circumvent the legal process being entities with large banks of capital.

908B64B197 · on June 30, 2021

> Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.

How many models are back-propagated first and then hand-tuned?

monocasa · on June 30, 2021

That's a great question. I had assumed that the workflow of an ML engineer consisted of managing the data and a relatively high level set of parameters around a search space of layers and connectivity, as the whole shtick of ML is that the hyperparameter space of the tensors themselves is too complex to grok or tweak when generated from training. But I only have a passing knowledge of the subject, pretty much just enough to get myself in trouble in these kinds of discussions.

Any chance some fantastic HNer could chime in there?

srcreigh · on July 1, 2021

I'm no data scientist but many statistical methods rely on prior knowledge and even computed inputs.

Two examples I can think of are doing linear regression on the square of your input. For deep learning, people have improved visual representation by taking samples of the colors at various frequencies. [1]

[1]: https://arxiv.org/pdf/2003.08934.pdf

monocasa · on July 1, 2021

Yeah, that's a better way of saying what I meant by managing the data. Mentally projecting data through, massaging said data, and building reproducible pipelines rather than manually tweaking the learned weights after the fact.

naikrovek · on June 30, 2021

I don't see the point of this tool, independent of the resulting code being derivative of GPL code or not.

being able to produce valid code is not the bottleneck of any developer effort. no projects fail because code can't be typed quickly enough.

the bottleneck is understanding how the code works, how to design things correctly, how to make changes in accordance with the existing design, how to troubleshoot existing code, etc.

this tool doesn't make anything any easier! it makes things harder, because now you have running software that was written by no one and is understood by no one.

izgzhen · on June 30, 2021

It doesn’t calm to solve the bottleneck either. On the contrary, it clearly states that its mission is to solve the easy parts better so developers can focus better on the true challenging engineering problems as you mentioned.

uncomputation · on June 30, 2021

This reminds me of a startup pitch where it’s always “oh we take care of x so you don’t have to,” but the problem is now I just have another thing to take care of. I cannot speak for people who use Copilot “fluently,” but I know for every chunk of code it spat out I would need to read every line and make sure “Is this right? Is the return type what I want? Will this loop terminate? Is ‘scan’ the right API? Is that string formatted properly? Can I optimize this?” etc. To me it’s hardly “solving the easy parts,” but rather putting the passenger’s hands on the wheel.

gotostatement · on June 30, 2021

Upvoted. I think the only good use case for this is spitting out 10-line, annoying, commonly used API boilerplate for commonly used APIs

izgzhen · on June 30, 2021

That is a valid use case despite being small and incremental. I think it will still be helpful to some people.

izgzhen · on June 30, 2021

The easy part is the copy-paste-from-SO part ;)

naikrovek · on July 2, 2021

if it doesn't claim to help any code production bottlenecks, then what good is it? it's just piping in code that may or may not contain a subtle bug or three.

That doesn't help anyone!!

I am usually pretty pro-Microsoft, but this tool is a security nightmare and a bad idea all around. It will cause many (most? all?) who use it far more work than it saves them, long-term.

fckthisguy · on June 30, 2021

Whilst I absolutely agree that writing code fast enough isn't the bottleneck, it's always nice to have tools that reduce repeat code writing.

I use the React plugin for Webstorm to avoid having to write the boilerplate for FCs. Maybe in the future Copilot will replace that usage.

ImprobableTruth · on June 30, 2021

To me that - and really any form of common boilerplate - is just evidence that we're lacking abstractions. If your editor is generating code for you, that means that the 'real' programming language you're using 'in your head' has some metaprogramming facilities emulated by your IDE.

I think we should strive to improve our programming languages to make less of this boilerplate necessary, not to make generating boiler plate easier. The latter is just going to make software less and less wieldy. Imagine the horror if instead of (relatively) higher level programming languages like C we were all just using assembly with code generation.

kemitchell · on July 1, 2021

In a very real sense, we are all just using assembly with code generation.

I really like your point on symptoms of insufficient abstraction. I do worry that we always see abstraction as belonging in language. Which in turn we treat as a precious singleton, and fight about.

At least in my own hacking, I'm surprised how infrequently I see programmers write programs that write programs. I'm surprised how infrequently I see programmers programming their shell, editor, or IDE.

catlifeonmars · on July 1, 2021

I’ve read this argument before and I don’t buy it. Boilerplate is an emergent property of composable abstractions.

bobsomers · on June 30, 2021

Completely agree. If anything, I see tools like this actually decreasing engineering speed. I don't see how it doesn't lead to shipping large quantities of code the team didn't vet carefuly, which has is a recipe for subtle and hard to find bugs. Those kinds of bugs are much more expensive to find a squash.

What we really need aren't tools that help us write code faster, but tools that help us understand the design of our systems and the interaction complexity of that design.

mslm · on June 30, 2021

Have to fully agree; just seems like a "cool" tool where if you had to actually use it for real world projects, it's going to slow you down significantly, and you'll only admit it once the honeymoon period is over.

jordemort · on June 30, 2021

What happens when someone puts code up on GitHub with a license that says "This code may not be used for training a code generation model"?

- Is GitHub actually going to pay any attention to that, or are they just going to ingest the code and thus violate its license anyway?

- If they go ahead and violate the code's license, what are the legal repercussions for the resulting model? Can a model be "un-trained" from a particular piece of code, or would the whole thing need to be thrown out?

invokestatic · on June 30, 2021

By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS[1].

That effectively “overrides” any license or term that you’ve specified for your repository, since you’ve already licensed the content to GitHub under different terms. Of course, people who are not GitHub are beholden to the terms you specify.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...

jordemort · on June 30, 2021

I think more specifically, the relevant bit is here: https://docs.github.com/en/github/site-policy/github-terms-o...

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

But, it goes on to say:

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

I'm not a lawyer, but it seems ambiguous to me if this ToS is sufficient to cover CoPilot's butt in corner cases; I bet at least one lawyer is going to make some money trying to answer the question.

buu700 · on June 30, 2021

IANAL, but I wouldn't read that as granting GitHub the right to do anything like this. There's definitely a reasonable argument to be had here, but I think limiting the grant of rights to incidental copies should trump "[...] or otherwise analyze it on our servers" and what they're allowed to do with the results of that analysis.

On the extreme end, "analysis" is so broad that it could arguably cover breaking down a file of code into its constituent methods and just saving the ASTs of those methods verbatim for Copilot to regurgitate. That's obviously not an acceptable outcome of these terms per se, but arguably isn't any different in principle from what they're already doing.

Ultimately, as I understand, courts tend to prefer a common sense outcome based on a reasonable human understanding of the law, rather than an outcome that may be defensible through some arcane technical logic but is absurd on its face and counter to the intent of the law. If a party were harmed by an instance of Copilot-generated copyright infringement, I don't see a court siding with this tenuous interpretation of the ToS over the explicit terms of the source code license. On the other hand, it would probably also be impossible to prove damages without something like a case of verbatim reproduction, similarly to how having a developer move from working on proprietary code for one company to another isn't automatically copyright infringement.

I doubt that GitHub is doing anything as blatantly malicious as copying snippets of (GPL or proprietary) code to explicitly reuse verbatim, but if they're learning from license-restricted code at all then I don't see how they wouldn't be subjecting themselves and/or consumers of Copilot to the same risk.

yaitsyaboi · on June 30, 2021

Wait so does this mean a “private repo” is meaningless and GitHub can share any code in any repo with anyone?

ipaddr · on June 30, 2021

That is not even the right question.

Why are developers so myopic around big tech? Of course they can. Facebook can use your private photos. It's in their terms and services. Cloud providers have more generous terms.

The response has always been they won't do that because they have a reputation to manage. The further they grow the further they control the narrative so the less this matters.

Wait until you find out they sell your data or use your data to sell products.

Why in 2021 are we giving Microsoft all of our code? It seems like the 90s, 2000s never happened and we all trust microsoft. They have a free editor and a free operating system that sends packets of activity the user does back to microsoft but that's okay.. we want to help improve their products? We trust them.

cercatrova · on June 30, 2021

Of course. A "private" repo is still on their servers. It's only private from other GitHub users, not the actual site administrators. This is the same in any website, of course the admins can see everything. If you truly want privacy, use your own git servers.

sandyarmstrong · on June 30, 2021

Why do you think people care so much about end-to-end encrypted messaging?

Yes, the concept of a "private" repo is enforced only by GitHub's service. A bug in their auth code could lead to others having access. A warrant could lead to others having access. Etc.

notatoad · on June 30, 2021

yes, that's what that specific section means, but as always with these documents you can't just extract a single section, you need to take the document as a whole (and usually, more than one document - ToS privacy policy are usually different)

these documents are structured as granting the service provider extremely broad rights, and then the rest of the document takes away portions of those rights. so in this case they claim the right to share any code in any repo with anyone, and then somewhere else they specify which code they won't share, and with whom they won't share it.

ocdtrekkie · on June 30, 2021

Fun fact: Every major cloud provider has a similar blanket term. For example, Google doesn't need to license music to use for promotional content, because YouTube's terms grant them a worldwide license to use uploaded content for purposes including promoting their services, and music labels can't afford to not be on YouTube. (It's probable even uploading content to protect it, as in Content ID, would arguably cause this term to apply.)

It all comes down to the nuance of whether the usage counts as part of protecting or improving (or promoting) their services and what other terms are specified.

vageli · on June 30, 2021

No.

> GitHub may permit our partners to store and archive Your Content in public repositories in connection

joeyh · on June 30, 2021

Anyone can upload someone else's freely licensed code to github. Without giving them such a license.

I do not upload my code to github, or give them any special permissions, and I am confident my code was included in the model's corpus.

lucideer · on June 30, 2021

The use of the definition Your Content may make GitHub's own ToS legally invalid in a large number of cases as it implies that the uploader must be the sole author and "owner" of the code being uploaded.

From the definitions section in the same doc:

> "Your Content" is Content that you create or own.

That will definitely exclude any mirrored open-source projects, any open-source project that has ever migrated to Github from another platform, and also many forked projects.

diffeomorphism · on June 30, 2021

How is this different from uploading a hollywood movie to youtube? Just because there is a passage in the terms that the uploader supposedly gave them those rights, this does not mean they actually have the power to do that.

jcranmer · on June 30, 2021

You can't give Github or Youtube or anybody else copyright rights if you don't have them in the first place. This is what ultimately torpedoed "Happy Birthday" copyright claims: while it's pretty undisputed that the Hill sisters gave their copyright to (ultimately) Warner/Chapelle, it was the case that they actually didn't invent the lyrics, and thus Warner/Chapelle had no copyright over the lyrics.

So if someone uploads a Hollywood movie to Youtube, Youtube doesn't get the rights to play that movie from them because they didn't have the rights in the first place. Of course, if the actual copyright owner uploads it, it's now permissible for Youtube to play it, even if it's the copy that someone else provided. [This has torpedoed a few filesharing lawsuits.]

macinjosh · on June 30, 2021

Not sure how much it would matter but the main difference I see is that if I upload my own code to GitHub I have the ability to give away the IP, but if I upload Avengers End Game to YouTube I don't have the right to give that away.

makeitdouble · on June 30, 2021

I wonder how it would work if we consider you flagged your code as GPL before it hits Github.

We could end up in the same situation as the Hollywood movie even if you are also the one setting the original license on the work. Basically you have a right to change the license, but it doesn’t mean you do.

im3w1l · on June 30, 2021

A very plausible scenario: Alice creates GPL project. Bob forks it and uploads to github. Bob does not have a right to relicense Alices' parts.

amelius · on June 30, 2021

> By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS.

That's nonsense because they could claim that for almost any reason.

E.g. assume Google put the source code of Google search in Github. Then Github copies that code and uses it in their own search, since that "improves the service". Would that be legal?

It's like selling a pen and claiming the rights to anything written with it.

invokestatic · on June 30, 2021

If the pen was sold with a contract that said the seller has the rights to anything written with it, then yes. These types of contracts are actually quite common, for example an employment contract will almost certainly include an IP grant clause. Pretty much any website that hosts user-generated content as well. IANAL, but quite familiar with business law.

joepie91_ · on June 30, 2021

> These types of contracts are actually quite common, for example an employment contract will almost certainly include an IP grant clause.

In the US, maybe. In most of the rest of the world, these sorts of overreaching "we own everything you do anywhere" clauses are decidedly illegal.

nitwit005 · on June 30, 2021

I rather suspect judges would not see "improving the Service over time" as permission to create derivative works without compensation.

The person uploading files to github is also not necessarily doing so with permission from the rights holder, which might be a violation of the terms of service, but would mean there's no agreement in place.

Hamuko · on June 30, 2021

I sort of doubt that GitHub could include GPL code in a piece of closed-source program that they distribute that "improves the service" and claim that this gives them the right.

antattack · on June 30, 2021

That does not mean that you give them license to your code. In fact some or all of the code may not be yours to give in a first place.

2OEH8eoCRo0 · on June 30, 2021

It's aggravating that there is no escape. If you host somewhere else it will be scraped. If you pay for the service it will be used.

carlosperate · on June 30, 2021

Good point, to me that explains why this is a GitHub product instead of a Microsoft (or VSCode) product.

sipos · on June 30, 2021

Seems like a good reason to never use GitHub, and encourage other people not to.

rjp0008 · on June 30, 2021

I would bet this as applicable as the Facebook posts of my parents friends something like, 'All my content on this page is mine alone and I expressly forbid Facebook INC usage of it for any purpose.'

jordemort · on June 30, 2021

I'm not sure why it would be any less binding than any other license term, except for possibly the ToS loophole that invokestatic points out below.

willseth · on June 30, 2021

It's not binding because the other party hasn't agreed. You agree to terms when you use the site. One party can't unilaterally change the agreement without consent of the other party.

jordemort · on June 30, 2021

I see where you're coming from but it's not quite the same thing; Facebook doesn't encourage people to choose a license for the content that they post there, so there's no expectation that there are any terms aside from those in Facebook's ToS. OTOH GitHub has historically very strongly encouraged users to add a LICENSE to their repositories, and also encouraged users to fork other people's code and and push it to GitHub. That GitHub would be exempt from the licensing terms of the code pushed to it, except for the obvious minimal extent they might need to be in order to provide their services, seems like an extremely surprising interpretation.

willseth · on June 30, 2021

It has nothing to do with GitHub being exempt from anything. It's that users are bound by the terms they agreed to in a ToS. If there is a conflict between a user-created license and a site's ToS, the burden is on the user to resolve it.

To be clear, I'm not suggesting this is some kind of loophole GitHub is using to trample on users' licenses, even though maybe they could. It's probably completely legal for GitHub to use even the most super-extra-double-GPL-licensed code because copyright law allows it.

The author of the Twitter post's suggestion that Copilot's output must be a derivative work is based on a naive understanding of "derivative" as it's defined in copyright law. It's not hard to find clear explanations of how this stuff works, and it's obvious she didn't bother to do any homework. Several criteria would appear to rule out GitHub's use as infringement. e.g.:

'In essence, the comparison is an ad hoc determination of whether the protectable elements of the original program that are contained in the second work are significant or important parts of the original program.'

https://copyleft.org/guide/comprehensive-gpl-guidech5.html

Avamander · on June 30, 2021

Someone might have published a project I've contributed to, on GitHub. There's no permission.

moolcool · on June 30, 2021

NO COPYRIGHT INTENDED

vbezhenar · on June 30, 2021

I expect them to check /LICENSE file and if it deviates from standard open source license, they'll skip that repository.

anfelor · on June 30, 2021

They don't do that it seems. In the footnotes of https://docs.github.com/en/github/copilot/research-recitatio... they mention two repositories from the training set none of which specify a license.

cxr · on June 30, 2021

The existence of a LICENSE file is neither necessary nor sufficient to determine the terms that apply to a given work.

diffeomorphism · on June 30, 2021

Why not? If it does not exist you treat it as proprietary (copyrighted by default) and if it does exist at least the author claims that the given license is an option (possibly their mistake, not mine)

junon · on June 30, 2021

Because individual source files might have license headers that override the root license file in the repository.

jordemort · on June 30, 2021

They haven't made any public statements on if they're looking at LICENSE or not; I'd sure appreciate it if they did!

cortesoft · on June 30, 2021

Also, how would you know if your code was included in the training or not?

Then, let’s say the AI generates some new code for someone, and it is nearly identical to some bit of code that you wrote in your project.

If they didn’t use your code in the model, then the generated code is clearly not a copyright violation, since it was effectively a “clean room” recreation.

If your code was included in the model, is it therefore a violation?

But then again, it comes down to how can someone prove their code was included or not?

What if the creators don’t even know? If you wrote your model to say, randomly grab 50% of all public repos to use in the model, then no one would know if a specific repo was used in the training.

BlueTemplar · on July 4, 2021

They "just" have to comply with all the licenses for all the code that the program was trained on ?

I suppose that for most open source licences this at the very least involves attribution for all the people that produced the code that the program was trained on ?

all_rights_rsvd · on June 30, 2021

I post my code publicly, but with an "all rights reserved" licence. I don't mind everyone reading my code freely, but you can't use it for anything but learning. If I found out they were ingesting my code I would be angry. It's like training your replacement. I don't use GitHub, anyways, but now I'll definitely never even think about it.

toyg · on June 30, 2021

Technically then I'm infringing as soon as I clone your repo, possibly even as soon as a webserver sends your files to my browser.

"All rights reserved" makes sense on final items, like books or physical records, that require no copy or change after owner-approved manufacturing has taken place. It doesn't really make sense on digital artefacts.

all_rights_rsvd · on June 30, 2021

So don't clone it, read it online. I reserve all rights, but I do give license to my host to make a "copy" to let you view it. I do that specifically to prevent non-biological entities like corporations or AI from using my code. If you're a biological entity, I specify you can email me to get a license for my code for a specific, defined purpose. I have a conversation with that person, then I send them a record number and the terms of my license for them in which I grant some rights which I had reserved.

Also, in your example, the copyright for the book or dvd is for the content, not the physical item. You can do anything you want with that item but not the content. My code is similar, I'm licensing my provider to serve you a visual representation of the files so you can experience the content, not giving you a license to run that code or use it otherwise.

BlueTemplar · on July 4, 2021

> possibly even as soon as a webserver sends your files to my browser.

Considering how it works for personal data with the RGPD, I doubt that this is even needed ?

Also copyright is something you have by default, no licence terms necessary.

OTOH if they aren't a human, then copyright barely applies to them anyway (consider search engine crawlers indexing your website for instance), and I don't think that putting up a notice will legally change anything ?

(You'll probably have better luck with robots.txt ...)

baryphonic · on June 30, 2021

If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work, especially if that section of code was somehow novel or specific to a narrow domain, I suspect they'd have a case. I don't know much about OpenAICodex, but if it's anything like GPT-3, or uses that under the hood, then it's very likely that certain sequences are simply memorized, which seems like the maximal case for claiming derivative works. On the other hand, if someone has GPL'd code that implements a simple counter, I doubt the courts would pay much attention.

I do wonder, though, if GPL owners worried about their code being shanghaied for this purpose could file arbitration claims and exploit some particularly consumer-friendly laws in California which force companies to pay fees like when free speech dissidents filed arbitrations against Patreon.[0] Patreon is being forced to arbitrate 72 claims individually (per its own terms) and pay all fees per JAMS rules. IANAL, so I don't know the exact contours of these rules, or if copyright claims could be raised in this way, or even if GitHub's agreements are vulnerable to this loophole, but it'd be interesting.

[0]https://www.dailydot.com/debug/patreon-suing-owen-benjamin-f... (see second update from July 31).

not2b · on June 30, 2021

You don't need to have a winnable case, just enough of a case for a large company (hello Oracle) to sue a small one. Is any version of Oracle-owned Java in the corpus? Or any of the DBs they bought (MySQL)?

duskwuff · on June 30, 2021

> If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work...

Under the right circumstances, Copilot will recite a GPL copyright header. It isn't a huge step from that to some other commonly repeated hunk of GPLed code -- I'd be particularly curious whether some protected portion of automake/autoconf code shows up often enough that it'd repeat that too.

sideshowb · on June 30, 2021

But what would we think to the legal start-up that automatically checked all of github to see whether the ai could be persuaded to spit out a significant amount of any project code verbatim?

Somehow p-hacking springs to mind

BlueTemplar · on July 4, 2021

It doesn't matter, Copilot isn't human, therefore it isn't considered as an author, and therefore cannot do derivative works.

The issue is with the users of Copilot potentially violating copyright and licences (non-attribution for instance) and with Microsoft facilitating it. (See also : A&M Records, Inc. v. Napster, Inc.)

rbarrois · on June 30, 2021

An interesting impact of this discussion is, for me: within my team at work, we're likely to forbid any use of Github co-pilot for our codebase, unless we can get a formal guarantee from Github that the generated code is actually valid for us to use.

By the way, code generated by Github co-pilot is likely incompatible with Microsoft's Contribution License Agreement [1]: "You represent that each of Your Submission is entirely Your original work".

This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.

[1] https://opensource.microsoft.com/pdf/microsoft-contribution-...

CharlesW · on June 30, 2021

> This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.

For this scenario, how is using Co-Pilot generated code different from using code based on sample code, Stack Overflow answers, etc.?

rbarrois · on June 30, 2021

I'd say that it depends on the license; for StackOverflow, it's CC-BY-SA 4.0 [1]. For sample code, that would depend on the license of the original documentation.

My point is: when I'm copying code from a source with an explicit license, I know whether I'm allowed to copy it. If I pick code from co-pilot, I have no idea (until tested by law in my jurisdiction) whether said code is public domain, AGPL, proprietary, infringing on some company's copyright.

[1] https://stackoverflow.com/legal/terms-of-service#licensing

CharlesW · on June 30, 2021

That makes sense, thank you.

gwenzek · on June 30, 2021

A number of company, including Google and probably Microsoft forbid copying code from Stack Overflow because there is no explicit license

CharlesW · on June 30, 2021

TIL, thank you!

pnathan · on July 1, 2021

> forbid any use of Github co-pilot for our codebase,

I have recommended as such to the CTO and other senior engineers at the startup I work at, pending some clear legal guidance about the specific licensing.

My casual read of Copilot suggests that certain outputs would be clear and visible derivatives of GPL code, which would be _very bad_ in court- probably? Some other company can have fun in court and make case law. We have stuff to build.

gdsdfe · on June 30, 2021

How would you know if copilot was used or not?!

gdsdfe · on July 1, 2021

I'm not sure why I'm getting down voted? "We'll forbid the use of copilot in our code base" How???? How the fuck would anyone know how the code was written?

BlueTemplar · on July 4, 2021

How can you generally know ? You can't, really, plagiarism is a hard problem...

tyingq · on June 30, 2021

"We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"

If it's spitting out verbatim code 0.1% of the time, surely it's spitting out copied code where only trivial things are different at a much higher rate.

Trivial things meaning swapped order where order isn't important, variable/function names, equivalent ops like +=1 vs ++, etc.

Surely it's laundering some GPL code, for example, and effectively removing the license in a way that sounds fishy.

dwheeler · on June 30, 2021

It's not just the GPL. Almost all open source software licenses require attribution; without that attribution, any copy is a license violation.

Whether or not the result is a license violation is tricky legal question. As always, IANAL.

tyingq · on June 30, 2021

I did say "for example".

dwheeler · on July 2, 2021

You certainly did! But there are a lot of people who think "OSS license means there are no requirements" and think it's okay to do things like copy without attribution when the license requires attributions. I know you didn't say anything like that either, but some others might think it.

It seems to me an important question is, "is this like a human who learns from examples, or is this really a derivative work in the copyright sense?".I'm not sure how to answer that. I'm not a lawyer. I don't know if many lawyers can answer that question either!

BlueTemplar · on July 4, 2021

Neither :

https://scholarship.law.cornell.edu/facpub/1481/

Copilot isn't human and therefore what it does isn't a "work".

The usual issues still apply to users of Copilot - unwitting violations of license terms of the code it was trained on (like non-attribution) are still violations.

drran · on June 30, 2021

I have a much simpler AI Copilot, called "cat", which spills verbatim code more frequently, but it's OK for me. Can I train it on M$ code?

BlueTemplar · on July 4, 2021

My cat only spills drinks and produces garbage when walking over the keyboard... :'(

devetec · on June 30, 2021

You could say a human is laundering GPL code if they learned programming from looking at Github repositories. Would you, though? The type of model they use isn't retrieving, it's having learned the syntax and the solutions that are used, just like a human would.

thrwaeasddsaf · on June 30, 2021

> You could say a human is laundering GPL code if they learned programming from looking at Github repositories.

I don't have photographic memory, so I largely don't memorize code. I learn general techniques, and memorize simple facts such as APIs. I can memorize some short snippets of code, but these probably aren't enough to be copyrightable anyway.

> The type of model they use isn't retrieving

How do we know? It think it's very likely that it is largely just retrieving code that it memoized, and doing minor adjustment to make the retrieved pieces fit the context. That wouldn't differ much from finding code that matches the problem (whether on SO or Github), copy pasting the interesting bits, and fixing it until it satisfies the constraints of the surrounding code. It's impressive that AI can do that, but it doesn't sound like it's producing code.

I think the alternative to retrieving would actually require a higher level understanding of the world, and the ability to reason from first principles; that would be much closer to AGI.

For example, if I want to implement a linked list, I'm not going to retrieve an implementation from memory (although given that linked lists are so simple, I probably could). I know what a linked list is and how it works, and therefore I can produce working code from scratch.. for any programming language, even ones for which no prior implementations exist. I doubt co-pilot has anything remotely as advanced as this ability. No, it fully reliant on just retrieving and reshaping a pieces of memoized code; it needs a large corpus of code to memoize before it can do anything at all.

I don't need a large corpus of examples to copy, because I use my ability to reason in conjunction with some memoized general techniques and common APIs in order to produce original code.

VMtest · on June 30, 2021

gonna develop my own linux-like kernel soon, with my own AI model trained on public repositories

wanna see the source code of my AI model? oh, it's closed source

it's just coincidence that nearly 100% of my future linux-like kernel code looks the same as linux the kernel, bear in mind that my closed-source AI model takes inspiration from GitHub Copilot, there is no way that it will copy any source code

phendrenad2 · on June 30, 2021

Nothing is closed-source to the courts.

cwkoss · on July 1, 2021

It may be possible to use closed source code during training and delete it, leaving just a black box model that is hard to prove was derived from that closed source code.

throwaway3699 · on June 30, 2021

What's the point? Linux is already open under GPL 2.

Deathmax · on June 30, 2021

You get to make changes without having to respect the GPL and thus no longer obligated to provide those changes to your end users, as you have effectively laundered the kernel source code by passing it through an "AI" and get to relicense the end result.

VMtest · on June 30, 2021

my linux-like kernel will be MIT license though

jackbeck · on June 30, 2021

He mentioned that the Linux-like kernel will be closed source which violates GPL

Ygg2 · on June 30, 2021

Does it, if code was written by a bot that trained on Linux kernel?

pjerem · on June 30, 2021

You know, that's precisely what the topic here is about.

sp332 · on June 30, 2021

Probably. Copyright applies to derivative works.

visarga · on June 30, 2021

Oh, you're so witty, have you heard of content hashing?

akagusu · on June 30, 2021

For years people have warned about hosting the majority of world's open source code in a proprietary platform that belongs to a for profit company. These people were called lunatics, fundamentalists, radicals, conspiracy theorists, and many other names.

Well, they were ignored and this is the result. A for profit company built a proprietary system using every code hosted in its platform without respecting the code license.

There will be a lot of people saying this is not a license violation but it is, and more, it is an exploitation of other people work.

Right now I'm asking myself when people will stop supporting these kind of company that exploit people's work without giving anything in return to people and society while making a huge amount of profit.

sergiomattei · on June 30, 2021

If we feed the entirety of a library to an AI and have it generate new books, is it an exploitation of people's work?

If we read a book and use its instructions to build a bicycle, is it an exploitation of people's work?

No, no it's not.

akagusu · on July 1, 2021

If you read a book and use the instructions to build a bicycle you are learning a new skill and this is obviously not exploitation of people's work.

When you read a book and copy this book partially or entirely to create a new book or create a derivative work using this book without citation it's called plagiarism and copyright infringement. It is not only exploitation, it is against the law.

If you feed an entire library to an AI to generate new books without source citation and copyright agreements it is not only exploitation, it is against the law. We can call this automated plagiarism and copyright infringement, and automated or not, it is against the law. Except if you use public domain books. It wouldn't be illegal but highly unethical considering there are powerful companies with big pockets bending public domain's laws to avoid their assets to be public available (I'm looking at you Disney), but that is another story.

nomorewords · on July 1, 2021

I think you are abstracting the matter by taking out the humanity. It's one thing to learn to do something by hand after purchasing the book. It's a totally different thing to read every single book in the world (humanly impossible) and then absorb some knowledge and train yourself to write exceptional books because you (the AI in this scenario) have learned that some words and sentence structures have lead to books having higher ratings than others. It's not humanly possible.

Of course we generate the world around us and its rules but I get angry every time we compare people to machines and say that it's the same thing. No it's not. We are constrained by time and space. I can't add more brain or more eyes to my body so I read more books can I? Microsoft can have a small city of servers somewhere and that could replace lots of people's jobs.

ipsum2 · on July 1, 2021

People have trained ML models on code thats on Github before co-pilot. (lots of examples here: https://github.com/src-d/awesome-machine-learning-on-source-...) There's nothing proprietary that other interested people or companies couldn't easily replicate here.

enriquto · on June 30, 2021

It certainly seems to be a laundering enabler. Say that you want to un-GPL-ify some famous copylefted code that is on the training database. You type a first innocuous characters of it, then the co-pilot keeps completing the rest of the same exact code, for it offers a perfect match. If the completion is not exact, you "twiddle" it a bit until it becomes. Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!

rlpb · on June 30, 2021

> Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!

Thinking that this would conveniently bypass the fact that your goal was to copy the code seems to be the most common legal fallacy amongst software developers. The law will see straight through you, and you will be found to have infringed copyright. The reason is well explained in "What Colour are your bits?" (https://ansuz.sooke.bc.ca/entry/23).

enriquto · on June 30, 2021

My message was sarcastic. I'm worried about accidental conversion of free software into proprietary. I mean, "accidental" locally, in each particular instance; but maybe non accidental in the grand scheme of things.

EDIT: to I can write my worry, semi-jokingly, as a conspiracy theory: Microsoft is using thousands of unsuspecting (and unwilling) developers to turn a huge copylefted corpus of algorithms into non-copylefted implementations. Even assuming that developers that use the co-pilot use non-copyleft licenses only 50% of the time, there's still a constant trickling of un-copyleftization.

freshhawk · on June 30, 2021

I suppose someone should make a OS-generating AI, conceptually it can just have windows, osx and some linux distros in it and output one based on a question about favorite color or something.

You'd just have to wrap it in a nice complex model representation so it's a black box you fed example OS's with some meta-data into and it happens to output this very useful data.

After all, once you use something as input to a machine learning model apparently the license disappears. Sweet.

bogwog · on June 30, 2021

That would be interesting:

* Someone leaks Windows 10/11 source code

* Copilot picks it up in its training data

* Someone uses copilot to generate a Windows clone and starts selling it

I wonder how Microsoft would react to that. I wonder if they've manually blacklisted leaked source code from Windows (or other Microsoft products) so that it doesn't show up in Copilot's training data. If they have, that means Microsoft recognizes the IP risks of having your code in that data set, which would make this Copilot thing not just the result of poor planning/maybe a little incompetence, but something much more devious and malicious.

If Microsoft is going to defend this project, they should introduce all of their own source code into the training data.

DemocracyFTW · on June 30, 2021

> source code

why do you think it has to be source code? it could be the compiled code after all.

If what we're talking / fantasizing about here works in the way of `let x = 42` it should equally well work with `loda 42` &cpp, so source code be damned. It was ever only to be an intermediate step, inserted between the idea and the working bits, to enable humans to helpfully interfere. Dispensable.

aj3 · on June 30, 2021

Come on, there is a huge gap between 1) writing a single function (potentially incorrectly) with a known prototype/interface and a description and 2) designing interfaces, datatypes and APIs themselves.

bogwog · on June 30, 2021

Why would you need to design anything? Just copy official Windows headers and use copilot to implement individual functions.

Maybe if the signature matches perfectly, copilot will even pull in the exact implementation from the Windows source code.

treesprite82 · on June 30, 2021

> Someone uses copilot to generate a Windows clone

You could test this with one of Microsoft's products that is already on GitHub - like VSCode. I doubt you would get anywhere with just copilot.

bogwog · on June 30, 2021

You probably won't get an entire operating system out of it, but I could totally see a project like Wine using it to implement missing parts of the Win32 API and improve their existing implementations.

saba2008 · on June 30, 2021

How is it different from just copy-pasting?

It does add some degree of plausible deniability (accidental violation, instead of intentional), but I don't think it would matter much.

taneq · on June 30, 2021

1) Type a comment like

    // The following code implements the functionality of <popular GPL'd library>

2) Have library implemented magically for you

3) Delete top comment if necessary :P

(It's pretty unlikely that this will actually work but the approach could well do.)

methyl · on June 30, 2021

What stops you to do the same, without the AI part?

petercooper · on June 30, 2021

That's what I was wondering. I've never been interested enough to steal anyone else's code, but with all the code transformers and processing tools nowadays, I imagine it's trivial to translate source code into a functionally equivalent but stylistically unique version?

pjerem · on June 30, 2021

The question is not if it's trivial or not, but if it is legal or not. You can already technically steal GPLv2 by obfuscating it.

formerly_proven · on June 30, 2021

Assuming ML models are causal, then bits of GPL code that fall out of the model have to have the color GPL, because the only way they could've gotten there was to train the ML using GPL-colored bits. It seems to me like the answer here is pretty obvious, it doesn't really matter how you copy a work.

Rapzid · on June 30, 2021

Bits?

alkonaut · on June 30, 2021

I don’t think most of us are scared enough of being “tainted” by the sight of a GPL snippet that we’d bother. Besides, if you want to target a specific snippet so you can type the start to prime the recognition - you already saw it?

Why not just copy it and then edit it? If a snippet is changed both logically and syntactically to not resemble the original, then it’s no longer the original and you aren’t in any licensing trouble. There is no meaningful difference between that manual washing and a clean room implementation. All the ML changes here is the accidental vs deliberate. But it will be a worse wash than your manual one.

shadilay · on June 30, 2021

Would it be possible to do this in reverse assuming the AI has some proprietary code in its training data?

bogwog · on June 30, 2021

Yes this is a concern, but I'm not sure if the AI is actually able to "generate" a non-trivial piece of code.

If you tell it to generate "a function for calculating the barycentric coordinates of a ray-triangle intersection", you might get a working implementation of a popular algorithm, adapted to your language and existing class/function/variable names.

But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.

And if that's true, it means that copilot isn't really generating anything. It's just a (high-tech) search engine that knows how to adapt the code it finds to fit your codebase. That's still a really cool technology and worth exploring, but it doesn't do enough to justify ignoring software licenses.

treis · on June 30, 2021

>But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.

But since now APIs are unprotected you could feed it all of the class structure and method signatures to have it fill in the blanks. I don't know if that gets you a working operating system but it seems like it will get you quite a long way

mrosett · on June 30, 2021

The second tweet in the thread seems badly off the mark in its understanding of copyright law.

> copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this

Copyright law is very complicated (remember Google vs Oracle?) and involves a lot of balancing different factors [0]. Simply saying that something is a "derivative work" doesn't establish that it's copyright infringement. An important defense against infringement claims is arguing that the work is "transformative." Obviously "transformative" is a subjective term, but one example is the Supreme Court determining that Google copying Java's API's to a different platform is transformative [1]. There are a lot of other really interesting examples out there [2] involving things like if parodies are fair use (yes) or if satires are fair use (not necessarily). But one way or another, it's hard for me to believe that taking static code and using it to build a code-generating AI wouldn't meet that standard.

As I said, though, copyright law is really complicated, and I'm certainly not a lawyer. I'm sure someone out there could make an argument that Copilot is copyright infringement, but this thread isn't that argument.

[0] https://www.nolo.com/legal-encyclopedia/fair-use-the-four-fa...

[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...

[2] https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...

Edit: Note that the other comments saying "I'm just going to wrap an entire operating system in 'AI' to do an end run around copyright" are proposing to do something that wouldn't be transformative and therefore probably wouldn't be fair use. Copyright law has a lot of shades of grey and balancing of factors that make it a lot less "hackable" than those of us who live in the world of code might imagine.

narmiouh · on June 30, 2021

Google copied an interface(declarative), not code snippets/functions(implementation). Copilot is capable of copying only Implementation. IMO that is quite different and easily a violation if it was copied verbatim.

invig · on June 30, 2021

If you can read open source code, learn from it, and write your own code, why can't a computer?

mrdrozdov · on June 30, 2021

I think the core argument has much more to do about plagiarism than learning.

Sure, if I use some code as inspiration for solving a problem at work, that seems fine.

But if I copy verbatim some licensed code then put it in my commercial product, that's the issue.

It's a lot easier to imagine for other applications like generating music. If I trained a music model on publicly available Youtube music videos, then my model generates music identical to Interstellar Love by The Avalanches and I use the "generated" music in my product, that's clearly a use that is against the intent of the law.

015a · on June 30, 2021

Many behaviors which are healthy and beneficial at human-level scale can easily become unhealthy and unethical at industrial automation scale. There's little universal harm in cutting down a tree for fire during the winter; there is significant harm in clear-cutting a forest to do the same for a thousand people.

lxpnh98 · on June 30, 2021

Exactly. This comes up with personal data protection as well. There's no problem in me jotting down my acquaintances' names, phone numbers, and addresses and storing it in my computer. But a computer system that stores thousands of names, phone numbers, and addresses must get consent to do so.