Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft sued for open-source piracy through GitHub Copilot (bleepingcomputer.com)
322 points by redbell on Nov 5, 2022 | hide | past | favorite | 288 comments



Microsoft's major arguments:

- as long as the reuse of code is transformative enough, they consider it fair use

- yet Microsoft's Copilot is not trained on Microsoft codebases, and not on Windows source code

- yet Microsoft files DMCA takedown requests on Windows source code leaks

As long as these are the case, Microsoft's argument of "Everything is considered fair use, except our stuff" is flawed in a lawsuit. That is not how copyright works.

That is why Copilot has to be considered an attack on a specific group of software, and a specific group of people. And that is my own opinion.

Either get rid of copyright alltogether (including Microsoft's arsenal of patents for it) or not. But don't try to play the hero of open source while you are stealing their stuff, specifically.


GitHub source code is different from windows code though. This isn’t a moral argument but a dry legal one. Microsoft doesn’t control the rights for a lot of the windows source code because it isn’t their copyright to begin with. Meanwhile even private repos on GitHub are subject to uses like this through the terms of service iirc. So from a moral perspective I’d agree with you, but I don’t see how this could fly legally. I’m pretty sure Microsoft has had six armies of IP lawyers independently come to this conclusion.


Terms of service does not mean the forfeiting of user's copyright. There is an argument to what extent snippets of code is fair use versus when it becomes excessive and violates the copyright.

You are right though, battling Microsoft in court will be no easy fight, and it will probably be a long one. However, I think the sharks are in the water and smell blood (money), as many lawyers see weaknesses in Microsoft's arguments and justifications (excuses).


You here run into other problems though with people mirroring code on github that they do not have the rights to change the license for, which would be required for accepting the terms of services. E.g. if you take a GNU project hosted at Savannah and maintain your own independent for of it on github for whatever reason, you cannot re-license the bit that is own by the FSF to conform with the github ToS without getting the FSF's permission (and for other projects without a single copyright holder, every single contributor). This ofc isn't Microsoft's or Github's fault, but the people making the github repo, but still creates a problem for breaking licenses.

Their FAQ also has some ambiguity on what it has been trained on:

> It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.


> Microsoft doesn’t control the rights for a lot of the windows source code because it isn’t their copyright to begin with.

I know "we don't own the source" came up before with OS/2, but I'm pretty sure Microsoft owns the bulk of Windows source code (maybe not in the hardware drivers department), but as I mentioned in another comment, it has trade secret protection, they simply have never published or licensed it.


"The rain it raineth on the just

And also on the unjust fella;

But chiefly on the just, because

The unjust hath the just’s umbrella."

Always think of this one whenever "the law applies equally to all" argument comes up.


I feel that it's important for this discussion to point out that the actual GitHub source code is AFAIK (mostly?) closed source, unlike for the open core GitLab. (No idea whether it's crawled by Copilot ?)


> yet Microsoft's Copilot is not trained on Microsoft codebases, and not on Windows source code

I don't think Copilot is trained on any private GitHub repository nor anything secret, not just Microsoft's private code


a major bug in your analysis is, Microsoft doesn't publish the bulk of Windows source code, it's a trade secret. It's still covered by copyright, but fair use is absolutely not in the picture, and whoever does publish it has no license whatsoever.

(and I've hated Microsoft and other monopolies from before 80% (90%?) of you were born, so this isn't me defending Microsoft.)


> It's still covered by copyright, but fair use is absolutely not in the picture, and whoever does publish it has no license whatsoever.

Remind me again, why is my used GPL3 licensed code considered fair use, and the EULA of Microsoft's code is not?

If a public repo doesn't contain a very permissive license, Copilot is still copyright fraud. And violating GPL-style licenses is still a license violation, which nullifies the right to use.


> including Microsoft's arsenal of patents for it

You'll probably be surprised to know that Microsoft are a member of both OIN and LOT: https://www.redhat.com/en/blog/red-hat-welcomes-milestone-ad...


Even if this succeeds (I sincerely hope that it does not, as I believe these to be sufficiently transformative as to be considered fair use), it doesn't functionally matter in the long term:

1. The ability to be able to both run and train these models is going to eventually be plausible on a home machine with even modest hardware.

2. These models, e.g. such as one scraped from all of the code on GitHub, will be publicly available via torrent or whatever.

3. People will be able to run it locally as an integrated plug-in in their IDE of choice, jetbrains, vs code, etc.

4. You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application


It's not transformative at all, nor quoted for study/discussion, and that's exactly what makes it a problem.

If the snippets merely had attribution, they'd be both legal and acceptable to most open source authors.

The code is not transformative because the quoted code is not used for some other purpose like as part of an article discussing whatever the code does, it is used to do exactly it's original job.

If you wrote a book by quoting choice paragraphs from other books, without crediting any of them, presenting the new book as simply your pile of deep insights, that is not transformative, even though your book is not the same as any single of the source books. It's also not fair use, even though all the quoted bits are short, both because of the usage and the lack of accreditation.

We could have something like copilot just fine if done above board, but as it is right now github are simply outlaw and with no excuse at all.


I read these kinds of comments on HN and I wonder if the people complaining about copilot have actually used the thing for more than 5 minutes?

Most of the time copilot is giving you one-line completions that are heavily adapted to the surrounding code. It’s not “writing a book by quoting choice paragraphs from other books”. It’s fancy autocomplete.

You can direct copilot to give you anything you ask it for. If you ask it for an entire function, and you don’t give it any context, then you are more likely to get a snippet from inside its memory. But even when this happens you are unlikely to keep the entire snippet unmodified because the chance that it is what you were looking for is low.


There are countless examples of people just letting Copilot's "autocomplete" run wild and it ends up reproducing many lines of code, often verbatim, and not just single lines of code.


You need to bait Copilot really hard to make it output that. To the point where it would be much easier to locate and copy-paste the code directly from the source yourself. It doesn't happen accidentally and it's not how you normally use Copilot.


A lot of people seem to think changing the variable names makes it new code.


The count is like 5


Considering how easy it is to generate additional examples, I think it's higher than 5. Five big, popular blog posts, maybe; but there are way more than 5 examples just on Twitter.


> you are unlikely to keep the entire snippet unmodified

Does that matter? It's not the end users who are being accused of copyright violation, it's the tool itself.


If the argument is that the tool is facilitating copyright infringement by users then it matters whether or not the users actually keep the supposedly infringing code, yes. If the code gets deleted immediately afterwards by the user then the claim of infringement by the user is farcical.

If the argument is that copilot is infringing because it merely displays an allegedly copyrighted snippet for a few moments, even if the user does not accept the completion, or accepts it but then deletes it, then that is a completely different argument because there is very little identifiable damage to the authors, which bolsters an argument of fair use.


The tool itself is infringing copyright. It possesses and reproduces copyrighted code without it's attribution. Full stop.

If I steal your television, it doesn't matter how much I get for it at the pawn shop.


You can Google and find GPL'ed code. That doesn't mean Google is facilitating license violation even though the Google search index posses and reproduces GPL'ed code.

> It possesses and reproduces copyrighted code without it's attribution.

To be clear - attribution does not fix GPL license violation.


You can Google and find GPL'ed code.

Emphasis mine.

This is different from asking Google and getting code directly from Google without the license - that is, without any indication that this code was taken from somewhere that has a license on all parts of it, not just the parts that Google didn't bother to show you.


Copyright doesn’t go away simply because you edit what was copied, deleting it may work assuming you don’t recreate to much of the original work.

There is a concept in copyright law of derivative work which means simply editing Harry Potter isn’t enough to publish it without paying for the rights to do so. Even highly transformative fanficfion can run into this issue when much of the original work’s characters and setting remain.

Disney gets away with this stuff by copying public domain works, copilot and it’s users don’t have that defense.


The key word is fan fiction. Almost every single word in fiction is a creative expression and thus copyrightable.

But works of non-fictional literature (of which source code is an example) have to work harder to justify their copyright.

Many of the snippets of code generated by copilot contain zero creative expression. They are just mechanical implementations of an algorithm.

You mentioned setting and plot. The source code equivalent is architecture. Copilot generated snippets are usually too short and too algorithmic to have their own architecture.

What makes a derivative work derivative is that it copies creative expressions from the parent work. If you edit them out, then it’s not a derivative work anymore.


Variable names are just as subject to copyright as character names, and specific implications of algorithms can vary quite a bit. Just look at the homework assignments of any introductory programming course.

Which is why even very short snippets of code have been an issue.


But what if not has the algorithm be designed for to extract that creative expression from the original work.

The use is only helping developing the algorithm by adding more value to the product in improving extracting the creative expression.

Copilot is a loose/loose game for developers, regardless on which end they are.


> it matters whether or not the users actually keep the supposedly infringing code

That's not how this works.

If I give you a blob of uniformly random bits and then give you a passphrase you can use with a tool to turn that into a Hollywood movie, and you do so, and then you delete it, that isn't not copyright infringement.


> If I give you a ... Hollywood movie, ... that isn't not copyright infringement.

It is.


I'm no lawyer, but judging how people reacted to DeCSS (09F9 1102 9D74 E35B D841 56C5 6356 88C0, anyone?)… why do you say this isn't copyright infringement?


Pretty much as if you buy a bike and it turns out it was stolen it's still not your bike someone who blindly accepts what turns out to be a chunk of someone else's code can still be unknowingly guilty of copyright violation.


Most of the time I enter a bank I don’t rob it


The authors of these comments have never used it. Their understanding of the issue comes from having seen a tweet where someone coaxed CoPilot into almost-reproducing a piece of code that thousands of people have copy and pasted into their GitHub repos over the years, such as the quake square root code or that matrix multiplication code.

They also usually have a Luddite axe to grind.

If they succeed in getting CoPilot outlawed (obviously they won’t), then I will rent $50 worth of GPU time and train my own, like some kind of cyberpunk outlaw.


>Most of the time copilot is giving you one-line completions that are heavily adapted to the surrounding code.

Keyword being "Most of the time"


Sure. I used it for about twenty minutes before I turned it off because it seemed clear to me that it wasn't much different than copying code from open source repositories.


That is absolutely not my experience at all and wish people would stop with the hysterical hyperbole. The tool typically produces code that would only make sense in my codebase because it uses internal types which only make sense in my code base. That is not and cannot be copyright infringement.


Sure, it's clever enough to swap identifiers correctly, but the shape of the code and techniques it uses were familiar to me from elsewhere, at times.

It's not unlike copying code from stack overflow and swapping the identifiers around.


How did you learn the shape of the code and techniques that you recognized? By looking at other code? Are you violating copyright when you are writing similar code yourself?


> By looking at other code? Are you violating copyright when you are writing similar code yourself?

It's certainly possible. Back when Phoenix reverse-engineered the IBM BIOS using published sources with a restrictive license, they did it by having one team read the sources and write a very detailed specification of everything that happens there, and then another team used that spec to write new code from scratch. They did it that way because if the first team were to write the code themselves, it is quite likely that the result would be legally considered derived work.


> By looking at other code? Are you violating copyright when you are writing similar code yourself?

In that order likely a violation, but it depends on the concrete case, always.


From decades of looking at interesting code.

I take care to attempt to not duplicate code; it's why don't use copilot.


> the shape of the code and techniques it uses were familiar to me from elsewhere.

Isn't this exactly what you'd expect an AI coder to do?


That's precisely the problem with it. AI, as it is, can be considered a lossy compression technique with multiple document recovery, convolution and interpolation ability.


Can't humans be considered a lossy compression technique with multiple document recovery, convolution and interpolation ability?

What is the specific difference?


100% agree. This matches my actual usage of github copilot. All the drama around it seems to just be mostly for headlines to fuel the outrage machine.


This isn't my experience with Co-pilot's suggestions. I've literally been able to have Co-pilot suggest a complete unit test based on a novel structure I hand-coded myself and a few words describing the unit test. The constants are often wrong, but it saves minutes of fidgeting with the syntax for unit tests and assertions.

These are not quotations from other people's code but something about the deep structures of language and programming language semantics. However, I suspect if you knew enough of a snippet from other source you could coax Co-pilot to suggest code learned from that source, but it would likely be washed over by other code in the corpus where it coincided with meanings.


Worth noting with models like copilot is that if you deliberately give it an input similar to the training contents, odds are it'll near verbatim reiterate it.

The main issue is that while you can use copilot to create "new"/transformative code, it's also trivial to get it to pump out licensed works in a form where you could claim "I didn't know it was taken from x project with y license because the tool made it for me".

I personally have no problem with copilot in concept however to do it (or any other AI model based text/graphics tool) without infringing on people's copyrights is practically an unsolved problem (excluding just per-licensing the training data ahead of time).


I mean, you can prompt me (or any other engineer) to spit out copyrighted code. FizzBuzz comes to mind… as do a number of algorithms I’ve written in the past which belongs to my past employers…

I really think we are entering some interesting territory that will likely be an interesting can of worms.


There is somehow something different if you knew the entire code to quickbooks from memory, and had an api where I could request any 10-line chunk of it I wanted, as many times as I wanted.


So you’re saying that how fast someone can type and how well they can recall makes a difference? I don’t type over 120 words per min like my grandma, but I have a photographic memory. I can tell you what file & line a chunk of code belongs to, or spit it out verbatim, customized to the current problem I’m working on.

So, you’re saying I can’t work in this industry? That seems a bit harsh.


> So, you’re saying I can’t work in this industry? That seems a bit harsh.

If you're going to be spitting copyrighted code out in violation of any licenses it might be made available under… yes, you can't. Most employers would not appreciate that behaviour. But I doubt you actually do this, even though you're capable of it. You reason about your code; you're not just being a predictive text engine.


It sounds like for you in particular, yes, since you seem to want to go out of your way to find any way to violate copyright, even when the terms are intentionally generous. Indeed such a person should not work in this industry, though I'm sure there are plenty of employers who are happy to have you steal for them, so you will be able to regardless.


My point was that we all do this /not on purpose/ (and for me being an exception, can make sure I don’t personally). But when I see code that existed in another company with some variables changed, I don’t flag it. There are only so many ways to describe a chair, are they all copyrighted?


"My point was that we all do this /not on purpose/"

I don't concede the equivalence.


It's really simple. If you are outputting licensed code and not abiding by the terms of the license, then yes, that is a problem.


Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.


Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.


Commoditized*


To attribution I would add license compliance.


I agree but I'm saying that attribution would be all that's missing from compliance in most cases.


But not, I'll hazard a guess (ianal), if its a Gnu GPL license. Right?


I didn't want to make an unreadably long comment by trying to solve every detail.

But GPL just adds that the source be made available to any recipient of code you used. That's pretty easy to arrange because even a link satisfies it.

If copilot can be made to spit out attribution, then it can spit out links at the same time.

Another idea is copilot could be changed to only include code where the authors opted in to an aggregate credit where your new program only has to declare that it used copilot and a link to copilot's training set along with your programs source, without trying to itemize each bit of output.

There could also be other versions of copilot that includes other code under other terms, like pure MIT or PD code where the original author already explicitly granted usage with no terms, or paid commercial code where github paid the authors to be able to include code to be used in this way and maybe with terms where the end user does not have to re-share.


The "link" necessary here would be a link to the full source code of the software developed with the use of copilot, distributed with every copy of said software


> If you wrote a book by quoting choice paragraphs from other books, without crediting any of them, presenting the new book as simply your pile of deep insights

This is quite literally how most of the Old and New Testaments came into existence. The authors of Matthew and Luke certainly didn't bother mentioning that they copied a bunch of stuff from Mark and whatever the Q source is (my bet is on either the Gospel of Thomas or some source thereof). Nor did the authors of Leviticus and such mention how they ripped large swaths of what's now known as Mosaic Law straight from Hittite laws and the Code of Hammurabi. These works are nonetheless considered pretty transformative.


> This is quite literally how most of the Old and New Testaments came into existence.

That happened a few (thousand) weeks before copyright law was invented.

> The authors of Matthew and Luke certainly didn't bother mentioning that they copied a bunch of stuff from Mark and whatever the Q source is (my bet is on either the Gospel of Thomas or some source thereof). Nor did the authors of Leviticus and such mention how they ripped large swaths of what's now known as Mosaic Law straight from Hittite laws and the Code of Hammurabi.

The Q source (two-source hypothesis and four-source hypothesis both) is speculative, and there are several reasons to criticise the theory – such as lack of corroborating historical evidence that the Q source ever existed as a separate document.

That aside, Leviticus likely got its Hittite and Hammurabi influences via Mosaic Law, not (primarily) the other way around. (Legal systems influence other legal systems? Who knew‽) Merely saying the same thing isn't protected by modern copyright; facts are not copyrightable. (There is such a thing as "database rights", but that doesn't really come into play, here.)

> These works are nonetheless considered pretty transformative.

The merits of our current copyright regime aren't terribly relevant, here. If you want to change the law, change the law; don't try to have the existing law interpreted differently in select cases, especially if you're only doing that where it benefits already-powerful entities like Disney or Microsoft.


> don't try to have the existing law interpreted differently in select cases, especially if you're only doing that where it benefits already-powerful entities like Disney or Microsoft.

Who says it only benefits the already-powerful entities, or that it even benefits them at all in the long run? As evident from my other comments on this topic, my hope (and indeed expectation) is that they'll shoot themselves in the foot and be hoisted by their own copyright-infringing petard. It wasn't that long ago when copyright enforcement wasn't a thing, and that lack of enforcement didn't seem to stop people from creating all sorts of artistic and literary and musical works, great in quantity and quality alike. The more the megacorporations backtrack on their copyright absolutism and admit that ignoring copyright expedites creative work, the more ammunition the rest of us have against their continued insistence on copyright absolutism.

Put simply: Copilot sets a precedent that I suspect will lead to the downfall of intellectual property law entirely. Probably not alone, but very likely one of several dominoes toppling toward that end result.


> and that lack of enforcement didn't seem to stop people from creating all sorts of artistic and literary and musical works, great in quantity and quality alike.

It's why several of Shakespeare's plays are missing, and others are only available in modified form. Since there was no exclusive right of performance, playwrights (and other similar creatives) had to keep their works close to their chest. Copyright does exist for a legitimate reason.

The biggest problems with copyright today, are:

• it's far, far, far too long; and

• corporate gatekeeping allows nigh-monopolistic corporations to take authors' copyrights from them, meaning copyright law doesn't protect authors at all!

(Rebecca Giblin and Cory Doctorow discuss the latter in their book Chokepoint Capitalism, which I haven't read yet.)

> The more the megacorporations backtrack on their copyright absolutism and admit that ignoring copyright expedites creative work, the more ammunition the rest of us have against their continued insistence on copyright absolutism.

I'd love it if it worked that way. I hope you're right – but I believe you're wrong. We can let them do this backtracking without letting copyright law tilt even further in their favour.


> It's why several of Shakespeare's plays are missing, and others are only available in modified form.

If copyright enforcement was a thing in Shakespeare's time, then a lot more of his plays would be missing today; there would've been far fewer "modified forms" even partially preserving works for which the original was lost, and far less opportunity to produce copies of unmodified forms.

> I'd love it if it worked that way. I hope you're right – but I believe you're wrong. We can let them do this backtracking without letting copyright law tilt even further in their favour.

I guess what I'm getting at is that this doesn't tilt the law in their favor. Either Copilot ignoring the licenses of source material is deemed legal (at which point people are free to launder proprietary code through ML models, destroying copyright law as applied to software) or it's not (at which point a multi-billion-dollar corporation loses a revenue source). Lose-lose for corporations, win-win for the rest of us.


> If copyright enforcement was a thing in Shakespeare's time, then a lot more of his plays would be missing today; there would've been far fewer "modified forms" even partially preserving works for which the original was lost, and far less opportunity to produce copies of unmodified forms.

No. His plays were preserved and published by his friends, after his death, so that they weren't lost. Because the curtains had closed, the players had moved on, and the loss of his plays was deemed worse than the loss of monopoly over them. Copyright was originally created to remove the incentives to keep things private and secret; if you have exclusive rights to copy it, and that's protected by law, you're able to release those copies without fear of somebody taking what you've made and pushing you out.

Look at the playwrights who didn't have friends like Shakespeare's. Where are their works, now?

That being said, the world has changed since the 1600s. Copyright (with reasonable term lengths!) was necessary before we had the 'net, but perhaps not so much now.

> Either Copilot ignoring the licenses of source material is deemed legal (at which point people are free to launder proprietary code through ML models, destroying copyright law as applied to software)

The law doesn't work logically like that. If Copilot is deemed lawful, there will be justification of that decision that doesn't allow us to launder proprietary code through ML models. Because copyright isn't about author's rights, and hasn't been since… the 60s? Since some time after the Universal Copyright Convention of 1952, anyway; I'm no historian.


> No. His plays were preserved and published by his friends, after his death, so that they weren't lost.

You literally just said that there are plays of his which only survive in modified forms - i.e. ones which his friends were unable to preserve. Those modified forms are the ones that would disappear in this supposed alternate universe wherein Elizabethan England had copyright enforcement - because those modified forms would've been DCMA'd into nonexistence, and thus those plays would've been lost entirely.

> Look at the playwrights who didn't have friends like Shakespeare's. Where are their works, now?

Christopher Marlowe? https://en.wikisource.org/wiki/Author:Christopher_Marlowe

Ben Jonson? https://en.wikisource.org/wiki/Author:Ben_Jonson

Francis Beaumont? https://en.wikisource.org/wiki/Author:Francis_Beaumont

John Fletcher? https://en.wikisource.org/wiki/Author:John_Fletcher

Thomas Middleton? https://en.wikisource.org/wiki/Author:Thomas_Middleton

John Lyly? https://en.wikisource.org/wiki/Author:John_Lyly

The answer to your question seems to be: they were preserved just fine and dandy.

> If Copilot is deemed lawful, there will be justification of that decision that doesn't allow us to launder proprietary code through ML models.

A license is a license. If it's okay to ignore the terms of the GPL, then it's okay to ignore the terms of any other software license, including Microsoft's EULAs. The law works on precedents, and Copilot's legality sets a precedent that threatens the very concept of software copyright.

Even if Microsoft somehow avoids being immediately eaten alive for lunch by the myriad competitors who'd love nothing more than for Windows to lose its copyright protections (they do have plenty of money to spend on lawyers, after all), their best case scenario of "we somehow managed to carve out an exception for our EULA being enforceable that somehow doesn't apply to other software licenses" would demonstrate rather plainly the fundamental inconsistency in intellectual property law, making it all the easier to justify abolishing it entirely.


Thank you for the solid arguments.


I think your opinion is bad while also misrepresenting machine learning.

> It's not transformative at all, nor quoted for study/discussion, and that's exactly what makes it a problem.

This model works in the same way as GPT-3. It's just predicting what the next most likely word will be given the previous words. It does this by creating a generalizaiton over the content that was used to train it. This generalization should be similar (or close to the same) to if it had a completely different set of training data.

In the same way that you could create 60,000 new training images comparable to the MNIST dataset, and get a model that works at an equivalent level, you could do the same with source code. It doesn't matter where the original data is from because the end result should be very similar when models reach higher levels of accuracy.

> but as it is right now github are simply outlaw and with no excuse at all.

This is an insane view on machine learning. When you read content, neural weights are adjusted in your brain to keep it as a memory. Is that copyright infringement too? Do you now owe Disney royalties because you stored their movie in your memories?

It's also an incredibly harmful view. Machine learning is dependent on data. Services like search, translation, recommendations, etc. are all dependent on training data. And, you're making the case, that since you don't like that it was trained on source code, it should all be illegal.


> This model works in the same way as GPT-3. It's just predicting what the next most likely word will be given the previous words. It does this by creating a generalizaiton over the content that was used to train it. This generalization should be similar (or close to the same) to if it had a completely different set of training data.

It's not enough of a generalization to be acceptable IMO.

Let's see GPT-3 as something analogous to a brain. Human brains can do multiple things:

- learn through experience and re-use concepts, but writing completely original code,

- copy a program's structure without copy-pasting the code, effectively re-writing it,

- copy the exact code, with a few minor details changed.

Currently, Copilot does a lot of the last point [1]. So just like humans can choose different actions and the legality depends on the action, ML models can do different things, and these aren't automatically legal by virtue of coming from an ML model.

It's not enough to argue about how machine learning works. ML can do tons of things, and GitHub's current methodology leads to something close to copy-paste. Maybe it can learn a semi-original way to write common things like searching in a list, but for more exotic uses like complex algorithms, being unable to actually understand how to code, it basically has to act as a search engine for existing implementations.

> Do you now owe Disney royalties because you stored their movie in your memories?

That argument makes no sense. Nobody would be complaining if GitHub had just trained a model on that code. The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.

[1] https://twitter.com/docsparse/status/1581461734665367554


> It's not enough to argue about how machine learning works. ML can do tons of things, and GitHub's current methodology leads to something close to copy-paste. Maybe it can learn a semi-original way to write common things like searching in a list, but for more exotic uses like complex algorithms, being unable to actually understand how to code, it basically has to act as a search engine for existing implementations.

I don't disagree that it is over-trained on certain sequences of words, but, overall, I think the generalization fine. It's often pre-prompted with content it has not seen before, resulting in unique new content. There's nothing copy-pasted about this, just a statistical understanding of what usually happens next.

If the pre-prompt is something very specific, ex. "a dog runs in the park while during the middle of the day, while the sky is the color _____" it will obviously output blue. The same can be true when there are only very specific known algorithms that have been used more frequently than others. And, in the cases where it does commit something arguably comparable to copyright infringement, it would probably on the programmer and not on the model for deciding to use it.

> The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.

In very specific cases only does it output copyrighted content. Most of the time it is just outputting a generalization of what is expected. It isn't Disney's content, but human content. Also, just creating content that has some similarity isn't copyright infringement. Satire has been well accepted as fair use.


You are assuming that laws/rights operate on basic properties like learning or remembering, but they operate on goals (e.g. humans can run, but there is no speed limit for a pedestrian. But if we learned to run at 100mph somehow, a limit would be introduced). The goal is to motivate creators to create on terms that would be fair enough and useful for a society. If you “generalize” their work at such scale, it is a very different situation, like when you are using a sidewalk for robots running at 100mph and demotivate anyone from using it.

Services like search, translation, recommendations, etc. are all dependent on training data

There is an obvious benefit in being represented in search results and recommendations.


> If you “generalize” their work at such scale, it is a very different situation, like when you are using a sidewalk for robots running at 100mph and demotivate anyone from using it.

I don't understand what you're trying to state.

If you make fun of someone on national TV, someone doesn't like it either. It doesn't mean that it should be illegal.


The point of attribution and copyright is to create a creator/inventor-friendly environment. The copyright law used realistic points in a problem space to provide it. These new AIs enlarged the problem space, but it doesn’t mean that the initial idea of copyright is not applicable to them by default, or should not be. It’s okay to vote for or against that, but the potential systemic effect of that vote should also be kept in mind.

I hope this makes more clear why I used this analogy.


> The point of attribution and copyright is to create a creator/inventor-friendly environment.

They're building a class, but, beyond that, I'm not really confident there are any parties that feel like they've been harmed by what they're claiming is a violation. Microsoft didn't just copy repositories from GPL licensed code, but from other companies too. Why didn't these companies care?

It is a good counter point though, as Waymo has released datasets that can only be used under a license. While at the same time, GitHub did not initially use discretion for people who had agreed to their terms of service and were using their hosted repositories.

While I think the legal system might lean towards enforcing consent to use the data in certain circumstances, on the technical side, if it's really a generalization that is resulted from training, the initial data is meaningless in the final model. It could be argued that it did not harm this company economically because the model trained without the data would still have an almost equivalent financial impact.

> It’s okay to vote for or against that, but the potential systemic effect of that vote should also be kept in mind.

My position above was clear, I do not think the law should be changed yet. The systematic change is from someone stating a violation happening when I do not believe that it is a violation. This lawsuit could result in models like Stable Diffusion becoming illegal, which I view as incredibly harmful to the future of artificial intelligence research.


What if we extrapolate a little into the furure. Copilot-likes and SDs become useful tools for knowledge/mastery extraction and application, much better than today. Will people want to create new knowledge, or will they be satisfied with reshuffling existing one? Will opinions change if people realize that there is no point in creating something new (it will be generalized), and existing is easily reproduced without them (classic “took our jobs”). Can it all happen? How do you see it?


Incrediblest full of all the harmfulsts! Not only Machine learning:

"Watching ditigal Walt Disney Movies is dependent on data."

It is illegal as the attribution is missing. Also copyright applies etc. see the lawsuit. Its less a technical thing, more a legal one.


> Incrediblest full of all the harmfulsts! Not only Machine learning:

I do not agree that it's a good idea to make using copyrighted data for training illegal. The people who are upset that GPL code was used for training should be happy that a large corporation believes that training on copyrighted works is fair use. It's a more democratized position.

> It is illegal as the attribution is missing. Also copyright applies etc. see the lawsuit. Its less a technical thing, more a legal one.

I think in reality, this is just a lawyer or group of lawyers who realized they had a case and a reactionary audience to possibly make them millions of dollars.


If you wrote a book by clipping small bits of others' work and assembling it into something cohesive that would most definitely be fair use.


Incorrect, if you did what I actually said, which mimics what copilot does.


> The code is not transformative because the quoted code is not used for some other purpose like as part of an article discussing whatever the code does, it is used to do exactly it's original job.

Hmm..

> The printing press is not transformative because the printed text is not used for some other purpose, it is used to do exactly it's original job.

See the error in your logic? The potentially transformative part is not the code itself. It's the impact to the process of creating the code.


For copyright purposes the printing press is not transformative. No contradiction here.


Well lucky for us paragraphs of prose are artistic expression and the snippets of code returned by Copilot are utilitarian.

The only things covered by copyright are creative choices. In software, as established by Whelan v. Jaslow [0] this is the structure, sequence and organization [1].

The district court ruling in the Whelan case drew on the established doctrine that even when the component parts of a work cannot be copyrightable, the structure and organization of a work may be. The court also drew support from the 1985 SAS Inst. Inc. v. S&H Computer Sys. Inc. in which it had been found that copyright protected organizational and structural details, not just specific lines of code. Structure, sequence and organization (SSO) in this case was defined as "the manner in which the program operates, controls and regulates the computer in receiving, assembling, calculating, retaining, correlating, and producing useful information." SSO refers to non-literal elements of computer programs that include "data input formats, file structures, design, organization and flow of the code, screen outputs or user interfaces, and the flow and sequencing of the screens."

However:

The Whelan decision initiated a period of excessively tight protection, suppressing innovation, since almost everything other than the broad purpose of a software work would be protected. The only exception was where the functionality could only be achieved in a very small number of ways. In these cases there could be no protection due to the merger doctrine, which applies when the expression and the idea are inextricably merged. Later the same year, in Broderbund v. Unison the court cited Whelan when finding that the overall structure, sequencing, and arrangement of screens, or the "total concept and feel", could be protected by copyright.

For a brief overview of the merger doctrine:

https://www.lucysnyder.com/index.php/copyright-law-merger-do...

So this excessively tight protection was rectified in CAI Inc v Altai [2], which established the legal doctrine of Abstraction-Filtration-Comparison [3] which says "As we have already noted, a computer program's ultimate function or purpose is the composite result of interacting subroutines. Since each Subroutine is itself a program, and thus, may be said to have its own 'idea,' Whelan's general formulation that a program's overall Purpose equates with the program's idea is descriptively inadequate."

[0] https://en.wikipedia.org/wiki/Whelan_v._Jaslow

[1] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

[2] https://en.wikipedia.org/wiki/Computer_Associates_Internatio....

[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...


> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow

Sure, there’s a sense in which copilot is just an alternative way of finding these snippets; if someone used Google to find a piece of code then chose to copy-paste it, the result would be the same. The code is out there on the internet, being indexed - and incorporated into training models.

The difference is that Google will (probably, if they don’t just yank the content into a snippet on the Serp) link you to the GitHub source, where the license should be clearly visible, so if you copy paste it you know what you’re doing. Copilot produces the code without attribution.

But it’s a narrower difference than we might actually wish. What is it about google’s pointing you to a line of code in a GitHub repo that absolves Google from any responsibility to tell you that the code it sent you to might be copyrighted? Presumably, when you land on that page you are expected to recognize from context that this is probably licensed code, and go look for the license terms. Why can’t copilot also rely on that same ‘user must recognize that some of the results produced might be copyrighted’ logic?

Worth noting that Google image search has long presented its results under a caption pointing out that the results may be copyrighted and you need to do your own research before using any images it produces based on your query. How different is searching it’s image index and producing potentially copyrighted images it found online, from copilot’s search of ‘how to code’ doing the same thing?


> Why can’t copilot also rely on that same ‘user must recognize that some of the results produced might be copyrighted’ logic?

Google shows you where you can find the copyright and license information. GitHub Copilot does not.

> Worth noting that Google image search has long presented its results under a caption pointing out that the results may be copyrighted and you need to do your own research before using any images it produces based on your query.

Because there's a common misconception that "Google images" are legal to use for any purpose. Afaik, Google has no obligation to correct people on this; nonetheless, they choose to do so.


Google search sends you to where it found the code.

That the copyright and license information are maybe available there is not something Google search actually knows to be the case.

That information might not actually be there, and Google does not care either way.

Similar to photos, There’s maybe a common misconception that code produced by copilot is free to use for any purpose. Is copilot under an obligation to point out that that might not be the case?


> That the copyright and license information are maybe available there is not something Google search actually knows to be the case.

Google sends you to where it found the code, on a best-effort basis. Google can't actually, in the general case, do any better than this. Google is a search engine; it's widely understood that, as a search engine, it looks for things and then tells you where it found them (not where they originated from; that's not what the verb "search" normally means, in English).

> Is copilot under an obligation to point out that that might not be the case?

GitHub Copilot is not branded as a search engine. It doesn't give people the option to identify, or verify, the terms under which the output it produces may be used. It doesn't provide any way of discovering the attribution, even in principle.

Normally, when something is given to you without any copyright markings, without any attribution of any kind, or licensing information, or a source:

• the computer probably came up with it on its own, just from the input you gave it; and

• you're free to use it.

This is not the case. GitHub Copilot goes against the expectations of the user; its UX design is actively misleading. This misconception only exists because Microsoft created it.


> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application

Wasn't this the subject of millions of dollars of litigation between Oracle and Google?


For others like me, missing context:

> Google’s copying of so-called application programming interfaces from Oracle’s Java SE was an example of fair use, the court held (…).

Source: https://edition.cnn.com/2021/04/05/tech/google-oracle-suprem...


> Wasn't this the subject of millions of dollars of litigation between Oracle and Google?

That was a very different case. That case was about whether an API/interface was considered copyrightable. The outcome of the case was that the implementation was obviously copyrightable however the interface was not. The background being google creating their own custom implementation of the JVM.


The outcome (SCOTUS decision) did not decide whether the interface was copyrightable or not. It decided that if the interface were copyrightable, it'd be fair use anyway, so there's no need to bother resolving that question.


I was thinking this, but it is almost thr inverse. Api's without their implementation are not proprietary, but is an implementation without it's API (and surrounding structure) proprietary?


> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application

It's obvious that you've never endured any sort of deep code analysis for license "problems" because then you would know that it is in fact comically easy to tell if someone did this. It happens all the time.

While some detections are false positives, this is in fact why such deep analysis is performed: because getting caught selling a commerical application you don't have the rights to sell can destroy value very rapidly. It halts mergers and acquisitions. It gets products pulled off shelves. It gets people fired.


This doesn’t fit the definition of fair use, if that’s what you are implying. Fair use is a balance between four factors, transformative nature is just one factor.

To satisfy section 107 of the U.S. Copyright Act, you need to look at four factors. The first consideration is the purpose and nature of the work, which is where a determination is made as to whether a work is sufficiently transformative to be considered fair use of copyrighted material. There are other factors at play on this test, however. A court would look at what sort of commercial interest and benefit is being made from the use of the material, and even then they may decide that a non-profit motive is not sufficient to satisfy this test.

The second factor a court will look at is the nature of the copyrighted work. In the case of open source material, it is already published but most licenses quite rightly ask for attribution. A court may decide that not satisfying the attribution of the work may cause it fail this criteria, and prevent it from being fairly used. However on this point, I feel CoPilot may be on more steady ground as the work is already published.

Fair Use doctrine also factors in the amount and substantiality of the work being reproduced. It is the substantiality issue that will likely trip up copilot, as the code snippets in use may be complex enough in nature that it is considered a substantial part of the code. A court may well consider an algorithm that took time to create and test, although only a small part of the codebase, is sufficiently complex in that it took substantial effort to create and thus look dimly on its use in a larger derived work.

The fourth factor in play here is the one being argued by the litigant, and I feel will be extremely hard for GitHub to justify; that factor is the effect use of the code has on the potential market or value of the code. Open Source projects rely on volunteers and contributions. Substantial effort is made to actively maintain projects. It is not within the spirit of most projects to have people use unattributed code snippets, because it reduces the value of contributions and improvements to the code project. Improvements and contributions are reduced, and it is also hard to distribute improvements and fixes to any project that uses the code. In this regard, CoPilot seems to be completely violating Fair Use doctrine.

So as you can see, Fair Use is way, way more complex than most people realise. Many people think that a sufficiently transformative use of copyrighted material is sufficiently for satisfying a Fair Use defence, but in reality it is not.


Exactly. It's like people calling for Dall-E to be banned. Well Stable Diffusion exists, as do a dozen other similar projects and an entire ecosystem of training data and models. The march of technology rarely stops because of people crying "copyright!"


> "The march of technology"

Who are the stakeholders putting the machine to the run? Who owns the machine?

This is much more subjective than clinical rhetoric praising tech may allure to.


BigCode project is coming and it gonna be opensourced the same way stable diffusion is.


We can't expect a massive improvement on computer performance anymore. So it will never come the day household computer can finish training the model equivalent of Github Copilot in practical time.


They can just add more cores and more power. With modern languages like Rust making multi threading more accessible, I expect we will double down on this. You could also crowd source this, some distributed application where everyone puts their home machines towards training.


> You could also crowd source this, some distributed application where everyone puts their home machines towards training.

That's how Leela Chess 0 (LC0) replicated the Alpha-zero performance. In fact, this is actually not that difficult. Assuming you have the means to orchestrate it; all it takes is loading the weights and a batch, computing backprop; and submitting it to the central system which aggregates the gradient updates and then updates the whole network and push new weights (kinda how bitcoin creates a new block).

This is no different to gradient accumulation; just "distributed". In-fact, the system could offload a large number of batches because the returned update is O(1) space to return where n is for batch size; it's just that the O(1) is the size of the network.


I’d argue, running it on your local machine as your own copilot seems like a better proposition than what is offered now.

There is still the issue of setting boundaries on what to learn from, but it is much more aligned with you taking the responsibility for what your copilot does or doesn’t, and it might give a better window of opportunity to customize it to your needs.

We’d lose economies of scale, but also lose infringement at scale, and gain individual control. Net positives overall ?


Any ML model that decides to use my code as part of its code suggestion functionality has a fool for an algorithm.


5. Microsoft, a $1.5e15 company, will not profit from the technology (as easily).


If that is so, people will stop writing open source. Then this model will use its own output until it produces bland garbage like a camera filming its own output stream.


Linux has a lot of corporations contributing code to it. Tensorflow is backed by Google. Plenty of other projects are released as open source by companies, not people. I don't see companies crying about Copilot, so their employees will keep writing open source when instructed as such by their managers.


I write open source code to benefit others and to put more learning material in to the world. If I was so paranoid about other people using it I would have made it proprietary.


“People will stop writing open source” is far from being self-evident.


The desired class-action status of this suit feels like a stretch to me. Typically, in a class action, the defendant's alleged behavior unquestionably harms the members of the class, which is why it's okay to make them party to the suit without their consent.

In this case, I suspect that only a minority of the members of the purported class actually believe they were harmed by the defendant's actions, and some class members are _paying customers_ of the defendant's product based on the conduct at issue. (Which surely creates some kind of estoppel issue, aside from the dubiousness of the class certification in the first place.)

Could some enterprising attorney get a bunch of members of the purported class to sign on to a brief opposing class certification? (I don't know if that would be an amicus or what, but surely those people have some kind of standing here.)

(Enterprising attorneys: my email address is in my profile.)

I guess the problem here is with "enterprising" -- there's no money there, except to Microsoft, who I imagine might be barred by legal ethics from funding something like this...


Ultimately losing a case like this might be better for super large players like Microsoft in the long run: they'll staple blanket permissions (and maybe even indemnities for third party code you submit) in the terms of service and go ahead training their own proprietary models.

But players from startups to open source projects that lack market power will then face an impossible moat if they want to develop their own similar models.


> Could some enterprising attorney get a bunch of members of the purported class to sign on to a brief opposing class certification?

Yes, I'd sign onto this (and I have open source code on GitHub that presumably (hopefully!) is being included).


CopyLoot model of business ruins Open Source model of business (AKA Open Core) and Open Source movement in general.


Letting trillion-dollar companies freely grind up the collective creative output of planet earth into digital meat slurry to fuel automatic content-generators seems like a bad thing.


> Letting trillion-dollar companies freely grind up the collective creative output of planet earth into digital meat slurry to fuel automatic content-generators seems like a bad thing.

The description is pretty loaded, though.

"Grind up" evokes Microsoft taking away the creative output and destroying it forever to create an inferior product. When in fact, the original code is still here, you just also have a smart auto-completion tool built with it.

Yes, there are concerns about license laundering and corporations taking advantage of open-source code without contributing to it, but these concerns are a lot more subtle and nuanced than "megacorporations are grinding up the collective output of the planet into slurry".

My hot take: piracy everywhere, and screw intellectual property. It does more good for the world if anyone is allowed to build upon what anyone else has done before without having to ask for permission first, even if it makes monetization harder.


Copyright is indeed pretty fucked. Megacorps repackaging and selling products created by indies is also pretty fucked. I'm pro "freedom of information" but we need laws around the sale of information to prevent that sort of thing.


And letting them do this by exposing your minute code edits in response to issues as a developer to make your job eventually redundant and have it taken over by ML seems like a worse thing even.


AI taking jobs is not an argument.

License violation is.


> to make your job eventually redundant and have it taken over by ML

This will last until the programmers run out. These systems will have a bad time writing in any language they haven't got extensive training data in.


Software is only going to become more ubiquitous, programmers won't run out, they'll just shift up to higher level abstractions and more diagnostic/administrative work. Think "AI doctor" and "AI manager" rather than code slinger.


Would be nice if they released the model, stable diffusion style. In fact, this kind of legal action could prevent disclosure of any AI models in the future.


Funny how it's "Microsoft GitHub" when something bad is being said about it. But it's just "GitHub" for positive things.


”A noteworthy discovery was made today in Edinburgh, marking another great British breakthrough.”

“In London today, further advancements were made in <blah> technology, showing the strength of English academia.”


Guess how your son got himself into trouble now

Our son is on the honor roll


I feel that deep down inside, Github is dead or will be dead, we just don't yet have anywhere to go...

That's why people talk like that. There's the GitHub the world loved, then there's the Micro$oft version, which we're starting to see problems with.

I think "GitHub" is the old good place, Microsoft Github, is the problematic version we're going to see more of.


GitHub was stagnant before and GitLab had basically overtaken it. Since the acquisition they have massively picked up the pace and delivered several exceptionally good features.

GitLab is still a very viable alternative but there is just very little reason to want to leave GitHub. The vast majority of devs either don’t care about or enjoy copilot. It’s just a small number of HN whiners who make most of the noise on the topic.


I think Copilot is very in sync with GPLs Idea of Copyleft and learning form code. Alle projects using Copilot should therefore also be GPL or AGPL


It isn't compliant with GPL's attribution requirements. If it were made so, then yeah, I'd agree – with the caveat that it'd also have to be MIT, BSD, Apache etc. compliant.


Everything GitHub has done since Microsoft's acquisition has been great! The product is getting so much better – Copilot is wonderful, Actions are great, issue improvements. More please!


Sounds like you might like: https://sourcehut.org (it's also completely free of crypto/web3 schemes, as an added bonus)

If GitHub (M$) continues this path of stomping on licenses/copyright of the people who made it great in the first place (i.e repo authors/contributors), I'm definitely gone.


Oh they're "YOUR KIDS" when they misbehave, huh??


Anyone who has known Microsoft from back in the day knows attributing anything positive to them is likely an error


Yes, we get it. Microsoft bad. Very good. The reality is that quite a lot of “us” are still using GitHub, including the Microsoft-built/directed parts of it, willingly. A company can be hideously evil and still make a single useful thing in ~50 years.


I use it willingly, but I feel bad. I hate that Microsoft owns it. But it's essentially a social network (at least in the way that I use it), and network effect dictates that I use it (as opposed to, say, GitLab). :c


Keeping ancient software working without rebuilding is quite impressive, to be fair.


Which ancient software?

All of them have been rebuilt IMO.


Windows Executables. For applications not touching hardware, they should still run on any newer version of Windows provided the Windows build still supports that instruction set. i.e. the original Windows 1.0 Hello World demo (16-bit exe compiled in Win 1.0) still runs on Windows 10 x32 [1].

1. https://virtuallyfun.com/2020/05/22/examining-windows-1-0-he...


Oh stop it. Microsoft has made/done plenty of positive things, you're just blind to them because you have somewhere between a mild dislike and a hatred for the company. Open your eyes and see.


> you're just blind to them because you have somewhere between a mild dislike and a hatred for the company.

Sophisticated enough to understand the utility of Github, but too stupid to understand their own biases? The idea that someone is faulty in their understanding is an uncharitable take. The benefit of (MSFT's) history, is you can learn from it, or ignore it and claim that those who have are being irrationally ornery.

> Oh stop it.

Microsoft is one of many indifferent profit-seeking tyrants. I will never stop pointing out how positive characterizations are, at best, misinterpreting their intent. I'm much more likely to believe any feature is a stepping stone that MSFT plans to leverage later to extort its own users. This plan may or may not come to fruition, which is incidental.


> Microsoft is one of many indifferent profit-seeking tyrants

As opposed to all of the other companies that aren’t seeking profits?


As opposed to all of the other non-tyrants companies, including profit-seeking and non-profit companies.


Which companies are those that aren’t “tyrants”?


> As opposed to all of the other companies that aren’t seeking profits?

One of many is right in the quote.

The obvious difference is how the power in an industry has been wielded. Railroad barons were the closest historical equivalent, imo.


What power is Microsoft “wielding” as far as hosted git repositories? Literally anyone can set up a git server


> What power is Microsoft “wielding”

Again, this is about historic context. They wielded power widely enough (monopolistic practices, predatory acquisitions, etc) that they earned the distrust. Every move, however magnanimous is carefully weighed as to how it can be monetize at a maxim (now, without drawing Govt scrutiny).


Weird post to defend. Are you seriously criticizing my post when the post you're defending is worse?


> Are you seriously criticizing my post

Yes. Personal attacks are unwarranted here.

> the post you're defending is worse?

I'm not sure what you mean by "worse" here.

> Anyone who has known Microsoft from back in the day knows attributing anything positive to them is likely an error

I agree with this fully.


> Personal attacks are unwarranted here.

Neither is hyperbole.


>> Personal attacks are unwarranted here.

> Neither is hyperbole.

So we agree? Personal attack and unwarranted are a correct characterization. That's nice to hear.


Or anyone using azure at scale today


Xbox was good! (But Windows remains very, very bad.)


What we need is a new license (like AGPL did for cloud) that prevents machines from ingesting and learning from the code. And all open source software who has trouble with copilot then relicense to this specific license avoid this. Involving government will be a terrible approach that will stifle innovation.

EDIT:1. Not a lawyer, but, unlike copyright that just blocks usage, I think license can be more creative and will help to preserve the FOSS spirit. For example, a license can ask for access to be provided to the source code, other data collected and model file that makes use of your code. Releasing the copilot model in open should be enough to make MSFT backoff.


The applicability of various open source licenses relies on the fact that without accepting the license, the act of distributing software would be a violation of copyright law as that is an exclusive right of the copyright holder and requires their permission (i.e. license). Anyone can refuse the conditions of the GPL (just as any other contract), it's just that without accepting the GPL license they aren't allowed to redistribute their version of the software, and you can sue them for copyright infringement.

"machines learning from the code" is not like that - this is not an exclusive right awarded to the authors (quite the opposite, quoting copyright law, "In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.") and it does not require the permission of the copyright owner. If I have a legitimately obtained copy of some copyrighted work, no matter if it's a book, audio recording or code, and I don't have any contractual restrictions, I'm free to train a ML model on it. And if an open source license like you propose would create such contractual restrictions, I don't need to enter that contract, because I don't need a license for this.


Next to copyright, often similar rights exist as well like protection for databases, like the file-system is one or the graph database in an SCM.


Database rights are not universal.


The BSD 2-Clause License appears to attach conditions to both "redistribution" and "use". Quote:

  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:
Is that a mistake in the BSD license? I don't think so. I think you are mistaken.

See the discussion here "Can a software license impose restrictions on the place where the software is to be used, so that a court would enforce those restrictions?"

https://law.stackexchange.com/questions/78519/can-a-software...

There are numerous examples there of licenses restricting use: Apple's licenses, the Unreal Engine, etc.

License Restrictions Sample Clauses:

  License Restrictions. Licensor reserves all rights not expressly granted to You. The Software is licensed for Your internal use only. Except as this Agreement expressly allows, You may not (1) copy (except for back-up purposes), modify, alter, create derivative works, reverse engineer, decompile, or disassemble the Software except and only to the extent expressly permitted by applicable law; (2) transfer, assign, pledge, rent, timeshare, host or lease the Software, or sublicense any of Your license grants or rights under this Agreement; in whole or in part, without prior written permission of Licensor; (3) remove any patent, trademark, copyright, trade secret or other proprietary notices or labels on the Software or its documentation; or (4) disclose the results of any performance, functional or other evaluation or benchmarking of the Software to any third party without the prior written permission of Licensor. Hosting Restrictions. In the event that You desire to have a third party manage, host (either remotely or virtually) or use the Software on Your behalf, You shall (1) first enter into a valid and binding agreement with such third party that contains terms and conditions to protect Licensor’s rights in the Software that are no less prohibitive and/or restrictive than those contained in this Agreement, including, without limitation, the Verification section below; (2) prohibit use by such third party except for the sole benefit of You; and (3) be solely responsible to Licensor for any and all breaches of the above terms and conditions by such third party.
If a license can prohibit decompilation or copying, we can obviously prohibit language model training.

And that's what we need to do, for the reasons stated here:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...


It's not uncommon for licenses to try to assert overbroad conditions, just as a discouragement and also a way to ensure that even if they are not relevant in some jurisdictions, they stick elsewhere, and the license you quote is a good example of that. A license or a contract saying something does not make it true (especially so in civil law jurisdictions - I've seen contracts and terms&conditions where the majority of clauses are absolutely void because they contradict relevant law), and of course the validity of the contract also is relevant (e.g. while in USA shrink-wrap licenses may be considered valid contracts, in much of the world they are not binding).

It does not prohibit decompilation, although it tries to do that. All it says it that the licensor does not grant me the right to decompile or disassemble the Software. It does give a nod to "except and only to the extent expressly permitted by applicable law" which is the key part (and would be valid even if they did not say it), because the applicable law (at least for me) does grant me the right to decompile and disassemble the software for various purposes without the permission of the copyright owner, i.e. this license does not actually prohibit decompilation, no matter what it says.

The same applies for language model training.


It's a broad clause which relies on the fact that some types of computer software "use" may require permission of the copyright owners, depending on jurisdiction - in essence, I'd say that the validity of this restriction depends on how the specific law treats the incidental copies of the software created as it is being installed, executed, etc; this (unlike most parts of copyright principles aligned in international conventions) isn't universal globally.

My position is mostly based not on code but on text, as in natural language processing there is a similar but much older situation of models being trained on copyright-protected work, the interests of researchers and publishers obviously differ, and at least currently (laws do change) the legal position is that publishers' requirements can be (and are) ignored, as models can be trained on these texts without their permission and even after their explicit objections / cease and desist requests. And a BSD or GPL or some other license can't do anything more restrictive than the book "license" of "all rights reserved, we don't grant you any permissions".


I don't think so.

Please refer to the material I provided. Clear examples of use being restricted (decompilation, disassembly, copying, publication of performance measurements, etc.)

A license can restrict use, and we need to do exactly that (restrict use in training models and in inference) to address this new kind of threat to intellectual property rights.


I suspect those materials were read carefully. The gist of the rebuttal is that an author cannot reserve rights they do not have and licenses "can" claim restrictions that they are not able to actually claim (depending upon local law)


The No-AI 3-Clause License:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...

This is the only license to explicitly disallow language model training and inference using your code.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  ...

  3. Use in source or binary form for the construction or operation
     of predictive software generation systems is prohibited.
All other licenses seem to apply only when the code is emitted, and therefore can't really protect you. This attacks the problem at the source. If Microsoft trains Copilot on your code, they are unambiguously violating an explicit condition that controls use.

Otherwise, it's just the standard BSD 2-Clause License.


I think its a good start. Not a lawyer, But my guess is the definition around predictibe systems need to be defined by courts. If a system just collects and curates, but then is fully firewalled from the transforming or learning module, is that part of predictive systems? Its a fascinating start and I will be observing its adaption.


unfortunately that's a terrible idea:

1. copyright licenses cannot be imagined to limit what you do with the legal copy of your software. it's not what copyright is or does.

2. it's also no longer an open source license, OSI, DFSG, GPL are all incompatible and would create practical problems for your would be users

this is a problem that doesn't need to be solved; your license likely already prohibits this - copying without attribution, re-licensing, or license-washing.

what remains is courts fine tuning what is "fair use" snippets or not, and it's not likely to come down as OK to use snippets like the fast sqrt as fair use. same whether it's a human abusing fair use, or a company, or a machine used by either of those. the AI aspect has no legal bearing and changes nothing - it just made the problem practical enough to actually happen.

open source people and projects that have had their licenses violated need to sue and defend their existing copyrights.


It seems that the argument that Microsoft will use is that their use of copyrighted code is actually fair use. If it's fair use, they're free to ignore your license that tells them they can't train Copilot on it.

Ultimately, copyright is a fiction that's created and protected by the government. A lawsuit is how we decide what is legally protected or not by copyright.


Yes that will require a few court ruling just like how court recognised GPL is a valid license to stop corporations from violating it. Ultimately, we don't want a blanket ban or approval for MSFT to use FOSS code. License can help make it possible to have nuance and a case by case basis.


Not nesseserily a single new license, many existing licenses could be upgraded by directly addressing this type of use.

Maybe extending the definition of derivative work.


#LicensesAreForLosers


Maybe you should tell that quote to RMS /s


Every analysis of this filing by legal professionals I have seen online concludes that Microsoft will use a textbook fair use defense and will succeed at it. There was a lot of precedent set in this area by Google v. Oracle.


> There was a lot of precedent set in this area by Google v. Oracle.

I don't see the connection at all. The precedent had more to do with APIs, and possibly some influence around implementations which are trivial enough to disqualify them for copyright.

Sure, some copilot output is in that latter category but a whole lot of it is very obviously not.


I'm curious to what that precedent is, given that Copilot consumes, reproduces and distributes much more than simple APIs, like full implementations.


To start, you have to treat copyrighted works on Github as a whole (say the entire project/repository or at least substantial parts of it) rather than look at the few lines that copilot copied in isolation. With that context:

1. The derived work is transformative. It uses the original as inspiration but the end result is substantially different.

2. Only a trivial amount of the work is copied (a few lines of code out of thousands/millions).

3. It is hard to prove that the creators of the original work were damaged by copilot (for example did they lose any customers or revenue?) This is normally the single biggest test in such cases.


> you have to treat copyrighted works on Github as a whole (say the entire project/repository or at least substantial parts of it

But that's not how copyright works here, so this analysis seems irrelevant.

> 1. The derived work is transformative. It uses the original as inspiration but the end result is substantially different.

It's not. Changing variable names isn't a substantial difference.

> 2. Only a trivial amount of the work is copied (a few lines of code out of thousands/millions).

That doesn't matter. Relative size doesn't matter, absolute size does.

> 3. It is hard to prove that the creators of the original work were damaged by copilot (for example did they lose any customers or revenue?) This is normally the single biggest test in such cases.

??? No? You appear to have no clue how copyright works or what copyright infringement does. It's damaging solely in that you don't have the right to reproduce it without my permission. It's mine.


Well you can have your own interpretation of copyright law, but the Supreme Court – who ruled in favor of Google exactly on the basis of these three tests I shared above – will disagree with you.


Ruling on those basis in a case doesn't make that the only test for whether something infringes copyright or not. You have to qualify for all of those things.

Also, there were four tests.

The general principle of the ruling is that:

- it was mostly about APIs; organization rather than implementation.

- it was sufficiently transformative

- it was a small amount of code of insubstantial value

- it was serving a different market

So, I suppose you have indeed proven that we can have different interpretations of copyright law.

Taking a small but substantial piece of implementation code and using it in a similar way for a similar purpose to solve a similar problem would appear to fail all of those tests, and at least to me smells like rancid infringement.


Well you can have your own interpretation of copyright law, but the Supreme Court – who ruled in favor of Google exactly on the basis of these three tests I shared above – will disagree with you.


I wouldn't be surprised to find out that MS themselves secretly supported this case to win it and to show potential customers that this tool is safe.


I am a legal professional. Convos with colleagues have been all over the place.


Legal professionals conclude Microsofts first move should be to admit copyright violation? What is their hourly rate?


Big discussion a couple of days ago: https://news.ycombinator.com/item?id=33457063


For awhile now I have not wanted to use GitHub because of this tool. Why should anyone write code for Microsoft for free when Microsoft charges for software?

Perhaps options like GitLab and others should be considered instead.


I like Copilot and use it, so I'll be glad to continue using it and it can use my code to help train it. It's a fair exchange in my view.


That's cool for you. For the rest of us, however, it would be terribly simple for Github to provide a profile/account setting that allows people to opt out -- perhaps on a per-repo level as well as per-account. Yet they don't. I wonder why not.


Until someone forks/mirrors your project on GitHub


I've certainly been enjoying Codeberg for my personal development.


> Why should anyone write code for Microsoft for free when Microsoft charges for software

Do you feel the same way about contribute to the Linux kernel that is used by commercial software companies?


If you don’t want people to use your code, consider making your repos private.


What if you want people to use your code, but not for proprietary software?


Impossible. As soon as you show it to anyone there is a chance they will use something they learned in their proprietary software, likely unknowingly.


And if that happens, I have the right to sue.


GPL?


There are thousands of open source libraries used legally in thousands of commercial products, so I'm not sure I understand your point.


Why are programmers so suicidal?

A giant corporation openly steals code from millions of devs and uses to try and automate us and programmers cheer it on.

A billionaire with known poor labor practices buys Twitter and twitter devs make no attempt to organize as lab instead they just write toothless statements.


> A giant corporation openly steals code from millions of devs and uses to try and automate us and programmers cheer it on.

Actual programmers don't cheer it on. Only modern """programmers""" aka professional CTRL+V pressers, the kinds of people that usually "write" software in languages like javascript by importing a few hundred open source libraries and jamming them together until it vaguely does what they want. Copilot is good for these people because using others' code is all they do anyway, AI just helps them do it more efficiently, and without having to worry about all those pesky licenses.


Excuse me? I know tons of very skilled programmers who use it, simply because of the time it saves.

I use it regularly in C#, JS, CSS, and hell, C++ ocassionately.

It's not about using other people's code, it's about it saving time by generating pretty much the code I was already going to write (with pretty good accuracy too). Once you have enough knowledge, I see no difference in the code Copilot generates (at least, not any better quality/perf) than I would have written.

So yes, I cheer it on, and I love it. It has made my development life easier. 17+ years of coding, and this has been a big impact for me.


How about a version of copilot that only learns from your personal career codebase? Or perhaps only works based on a company's internal codebase? That would remove any legal ambiguity.

If these templates are as common as you suggest, it seems learning from your personal codebase should get the job done.


I don't see many people cheering it on here, in fact most of the news about copilot has been negative. Where do you see positivity?



6 posts in a thread with 232 points? This is not good math.


So I recently had a discussion with a copyright academic expert, and based on what he said to me, I'm wondering how they are suing.

Basically, the common conception of all produced work being automatically copyrighted is correct. However, one can only sue for copyright infringement for works that have a registered copyright (one can register after the fact, one is then just limited in the damages they can sue for, i.e. no enhanced damages for violations before registration).

My assumption is that even github copilot is violating peoples copyright, since the copyright for the vast majority of the code is not registered (does anyone actually pay to register the copyright on their code?), it can make it difficult to sue for damages, especially as a form of class action (where presumambly the class that has registered their copyright is minimal in size).

was I informed incorrectly (I'm also wondering, if this is true, how this would impact enforcement of GPL type licenses in the USA, as again, the vast majority of code licensed under it isn't registered, so while one could register it after the fact, how many people really do this?)


They're suing for breach of license, because CoPilot copies code and omits any mention of the source license, which they claim is not within Fair Use.

You can read the complaint: https://www.documentcloud.org/documents/23264658-github-comp...

It's a bit .. odd. Seems to be partially a complaint, and partially a code review of the couple of examples they cite. Not quite sure what the point is there. Other parts are clearly copy-n-pasted with minor changes from Wikipedia without attribution, which is funny to see in a document basically claiming that's what the other side is doing.


My partner is a paralegal, and just yesterday I was marveling that there's no concept of voting legal briefs, just wholesale copying of blocks of text without attribution. You cite /cases/ but briefs are fair game for what anyone outside the legal profession would call plagiarism.


If you want to bring a suit, you have to register the copyright. You can have someone infringe upon your unregistered copyrighted work and all you have to do to sue them for that is to then register the copyright after the infringement.


This is big because it sets a precedent for broader usage of these "copy and make similar" ML models. For example, I believe it'll impact the future of AI image generation.


I have a crazy theory Microsoft may have been seeking this outcome.

Microsoft isn't stupid, and Copilot is so obviously legally dubious that it stretches reason they launched it without better legal justification or discourse. Microsoft employees basically treat any notion like it isn't legal as nonsense, even though they lack any real precedent to justify the position.

Is it plausible they might have launched a product almost certain to face a legal case in order to establish precedent for using training data? The company is working in this industry and it likely needs to know for future projects that are harder or costlier to build.


This is pretty much just stomping on sand. Law and positive norms rarely can constrain a breakthrough tech like this.


>Law and positive norms rarely can constrain a breakthrough tech like this.

True enough, but the point of this type of lawsuit is to define the legal framework within which the tech can operate. The law could end up saying that AI generated code from public, but license constrained code, must abide by the license of the original code. Or maybe there are rulings that help to better define "fair use" within this context. There is clearly significant legal ambiguity so this and other suits will help to clarify things.

Of course those laws can't constrain someone from using the tech on their own computer how they want, but most public companies will create policies to keep their workforce within the lines of the defined legal framework.


It's funny how companies can become their own worst enemy, which Microsoft has been since it acquired Github. Github itself recommends that you use the GPL license (see the Github-run https://choosealicense.com/). This has likely led to many projects choosing GPL/BSD/Apache/MIT by default, simply because the author didn't understand the differences between them. I've found that often, once people take the time to educate themselves about the various licenses, they instead prefer a highly-permissive license like The Unlicense (https://unlicense.org/).


Why did you lump MIT in with the un-premissive licenses?


Any license that requires attribution is not permissive enough to be free from copilot legal uncertainty.


I think this renaissance of nlp for coding will be the death sentence for open source software.

Nobody will be able to convince their manager to make their project open source when it can be so easily crawled and used in seconds by competitors without any attribution.


If that's all it takes to kill open source, then I welcome its demise.

Rather, it may spell doom for "let's throw this over the fence and have unpaid suckers work for us on it" software, which would be a net benefit for aociety.


> and used in seconds by competitors without any attribution.

Ah yes, because before Copilot it was impossible for competitors to use the code of an open source project.

This is nonsense.


If the code was GPL-licensed, it was indeed impossible (legally) to use in proprietary software.


> it can be so easily crawled and used in seconds by competitors without any attribution.

this can already happen today. ML models and such don't really change this aspect of open source.


But you are still liable if you steal code.

With this NLP approach, it is not obvious where the code came from. Could be a mixture of codes that produced the outcome, all stolen, but you will have difficult time proving it.


> With this NLP approach, it is not obvious where the code came from

so if this exact same action was done by a human today, instead of using an ML approach, how does the copyright holder of today catch them? Why can't the same method be used for an ML produced code base?


What is your take on "steal code?" Are you thinking substantial portions, like a control system or the code behind a semantic image processor, or are you thinking the implementation behind 'reverse a linked list?'


In my opinion stealing is when you benefit from an implementation that you were not supposed to copy-paste.

I do not think that using an existing idea to work on your own implementation is stealing.

But copilot does not reimplement ideas, it blatantly copy-pastes implementations (with comments, bugs and everything)


That was not quite an answer to the question I asked. I'm curious what scale of implementation you have in mind. Are we talking many lines, a block, one line?

Asked another way, could that implementation be one line? Say, an implementation to logging a java exception within a catch block? Or, is the implementation something you would find for a leetcode question? Or are we talking something larger like the implementation of evaluating a mathematical formula in spreadsheet software?


Well, let’s just leak the Windows source code into the training set. Then M$ will have no choice but to shut it down.


Can neural network tensors be seen as an encoding?

If I copy source code and save it in an UTF-8 text file it is subjected to licensing. Now if the same information is represented in a neural network all of a sudden this 'trained data' is intelligence.

It's a complex representation of the data, and some data may be a bit scrambled, but still in a way it's also an encoding of source data which should be treated as having an origin and an author.


I'm still waiting for someone to contaminate Copilot with e.g. leaked Windows source code bits to "launder" them for free reuse.


This may actually backfire on Butterick. I am sure Microsoft has a very competent legal department and they have given a lot of thought to the matter.

Most of these articles assume that Butterick will win, but I don’t think that’s a given. If the court rules that this is fair use, then that could set precedent for even further applications of CoPilot and remove some of the legal uncertainty around its use.


Letting Copilot "fill-up" portions of the code "learned" from open source code IMHO is fair. Most devs worth their salt even without using Copilot would eventually write approximately the same code given the same requirement. What should be taken to court is if code for an entire feature or features taken verbatim from somebody else's project or work.


It seems that Microsoft is conflating "public" with "unlicensed."

Sadly, it is a common practice. Humans are really good at rationalizing their way down the slippery slope. "I'll just use this clever way to loop over that structure. Well, actually, the whole function pretty much does what I need, I'll just drop that function in. Hmm, I could just change the API of this module a bit and that will work nicely..." It doesn't take long for learning to turn into copyright violation. I've seen this repeatedly while reviewing code of junior engineers. The engineers aren't malicious. They just sort of slide into it over time.

As Copilot gets better and better, I just don't see how it doesn't end up violating license terms, at least occasionally. Even if much of its output is fair use, as long as some of it is violating licenses, MS has a problem.


Open-source licenses were a mistake. "Do whatever the fuck you want" should be the only open-source license.


Open source licenses are a response to the tragedy of the commons that follows the "do whatever you want" mantra.

It turns out what many developers want is to share code in ways that keep the commons healthy and protect user freedoms.


I do agree with this case: Microsoft is illegally profiting from other people's source code.

Just because the source code is open and sometimes free to use, does not mean Microsoft can sell a service based on it.


Nobody seems to consider the fact that non-US law is also applicable to this issue. GitHUB ToS are subject to US/California law, but that does not mean that GitHub must not respect copyright in other jurisdictions where the service is provided. Many common open source licenses do not have a governing law clause (i.e. BSD and MIT). Since one of the primary defences for Copilot is fair use, and this concept does not exist in EU it would seem that Copilot is even more legally iffy in other jurisdictions.


One doesn't simply buy GitHub for $7.5e+9 to foster the community and the ideals of open source.

Moreover, ideals like that don't simply pop out of the blue and right into the collective psyche without comparable infusions.

To drive it home in case lines between dots are not obvious: open source ideology, github and copilot are merely successive stages in the process of production of profit. The next step is obvious, get rid of human programmers and save on their wages.

This is where ideals lead.


What’s the risk exposure for startups whose engineers are using Copilot to write proprietary private code?


It would be very bad for the tech progress in the West if these types of lawsuits succeed, while China can do this. The legislature would have to act to explicitly legalize it and that could take years and it could be too late by then. The US is already falling behind China in the AI progress.


The argument of "the west" vs "china" is a bad one.

"US company sued for underpaying employees" -> "It would be bad if labor laws are enforced here, china has lower minimum wage and exploits workers more, we have to be competitive"

I know those situations aren't exactly the same, but you're trying to use that same justification, that we should think about our laws in terms of a competition with china, in terms of AI progress, not in terms of the arts or workers or happiness.


What you have stated doesn’t discredit the argument at all, it merely points out that there are additional things possibly worth considering, which is basically always true.


The only thing it would mean is that companies would have to pay for their training datasets like they already do for many applications of ML, instead of freeloading off of the work of millions of developers.

Hell, I am sure there are plenty of developers who wouldn't give a shit if Microsoft trained Copilot on their code for free as long as Microsoft abided by the terms of their licenses.

This is like arguing against ending forced prison labor because China will still use it anyway even if we stop. Why do anything if the looming spectre of China is enough to capitulate to their status quo?


Show me one example of Copilot output that you would like protected under a non-open source license owned by Microsoft.

Do you not understand that if a court rules that little snippets of utilitarian code are somehow considered artistic expression that it will basically be impossible to write any software that isn’t infringing on something owned by a large corporation with in-house legal?


That's already the case, hence why clean room reverse engineering exists.

It's not illegal to come up with code that is exactly the same as an existing piece of copyrighted code.

It is illegal to take copyrighted code and reproduce and distribute it against its license, however.

It's the difference between original writing and plagiarism. You might come up with a substantially similar, to exact, point that other people have made. That's not illegal. Copying those points from a book verbatim is a copyright violation, however.

Copilot is reproducing copyrighted code from its training set. That's no different than you reading the leaked Windows source code and then copying it to ReactOS or WINE. But if you came with the same solution Windows developers happened to come up with, and it's just a coincidence, that's just fine.


> It's not illegal to come up with code that is exactly the same as an existing piece of copyrighted code.

That's not why it is not a copyright infringement to come up with code that is exactly the same as a piece of code that is the same as some other code.

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

It's because the code in question is most likely to only be written in one way.

If you happen to come up with the same melody and lyrics as a pop song in a "clean room" do you think you are in the clear? We're talking about copyright! It is meant to cover artistic expression! Not utilitarian inventions.


So the things that are covered by copyright in software are the creative choices... like, the overall structure of the code, the specific classes, and interfaces, etc. Like, there's a zillion ways to organize code (certainly some better than others!) and this is where clean-room design comes into play. It is really unlikely that in a decent sized codebase that the classes, or types, or whatever are going to be substantially similar.


You're both right but only one of you is seated in reality, the other describes the reality we aught to be in, but are not.


While I know I am right about the current interpretation of how copyright applies to software I am also of the opinion that this is also how things aught to be.

I have yet to hear a persuasive argument otherwise.

Is it because I both write and publish open source code and write and publish music that I am able to clearly delineate idea from expression?

For one, I have never been satisfied with the non-utilitarian aspects of Copilot’s output. When I am writing software the real art has always been in how code is organized. I gain absolutely no aesthetic value from autocompleting unit tests or boilerplate.

You may think that the outputs of Copilot and Stable Diffusion are artistic in nature but all I see is a rhyming dictionary.

I look at attempts to have copyright cover the utilitarian aspect of software as an attempt to claim ownership over chord progressions, scales or time signatures.


I don't follow. Aside from the ignoring of all ethical and legal concerns built up in multiple societies to this point - if Microsoft loses this type of lawsuit, why would AI not continue to be researched? Don't they just have to build in attribution into the tool and everything's legal again?


Terms of many licenses require more than just attribution, but also the inclusion of the copyright notices and licenses verbatim. MIT, BSD, MPL, GPL, LGPL, AGPL, etc all require those at minimum.

Then there is the application of the licenses themselves. GPL licensed code stipulates that its inclusion in a new work means that the new work must be licensed under the GPL, as well. AGPL and LGPL have similar terms, but with different stipulations.


Indeed, not to mention that you cannot just mix licenses. The vast majority of licenses are not compatible. It just so happens that some popular ones are, but in general the output is either GPL-like, MIT-like, or incompatible.


AFAIK copilot suggestions are sourced from a model that is trained on many sources. Presumably when it claims a solution (an autocomplete suggestion), it would be using data taken from dozens, maybe hundreds of examples. How does a person even provide attribution in this case?

It's like saying, I know 'polo' comes after Marco not only because I was at your pool party where we all signed NDAs, but also because I've been to dozens of other pool parties. How do I know where to attribute this knowledge? In part that knowledge is due to the prevalence of examples and not just any single example.


I'm not in ML, can the program not know where its suggestions are drawn from?


Depends, could be a no if copilot identifies solutions by looking for consensus matches.

To illustrate, google translate got suddenly better some time ago when it searched literary texts for phrase matches and then cross-referenced known translations to give you a translation.

Copilot is perhaps doing something similar (somewhat as an analogy) copilot could maybe be finding exactly one sample that it suggests, in which case there could be attribution.

It is unlikely to offer any solution without a consensus. To further the example, how can you be confident in that translation? What if there is not just one literary match, but many thousand and 99.9% agree.

Without that high match percentage, it would be difficult to know if the result is a accurate, or with a small set of disagreeing matches, again it would. E hard to give an accurate answer. I don't know for sure, though seemingly copilot would have some threshold of agreement across multiple matches before it suggests a solution


If a court determines that their behavior is illegal, this argument sounds a lot like: "Let us break the law, because someone else is doing much worse."

If it is necessary to reform how we handle copyright and IP licensing to remain competitive, we should find ways to do that.

The law should apply equally to everyone, whether they are competing with Chinese companies or not.


Should West dismantle copyright and let everyone use leaked propritary software source code too? Or allow to break license of source-available software like Unreal Engine? Or it's only apply to open source software?

After all in China it happen all the time.


Unironically, yes.


Microsoft can't use their own code to train the model? Researchers can't use OSS to research (but not monetize)?

This isn't progress anyway. It's a fancy autocomplete/plagiarism/stealing dressed up as AI.


Something tells me this might be welcome news to Microsoft.

I would assume this was anticipated.


There's no way that Microsoft didn't spend a lot of time derisking this with lawyers internally before release. This will be interesting to see play out but my bet is that it'll take 5+ years to get a result.

I'm not a lawyer but I think MS will win quite easily.


This must be how Oracle felt when they were trying to protect their java declarations from android's clutches. M$ is likely to win this case.


I can imagine a lot less code being open sourced purely to counter against this kind of mining.

I'd be all for it if creators were compensated.


Now I'm just imagining if you trained co-pilot on all kinds of legal libraries, what lawsuits it could come up with.


I assume that if they train the model only on GPLv2, then the generated code might be legal to be used under GPLv2 license. Similar could be done with some other open source licenses as well.


I hope Microsoft wins.

The result of a lawsuit like this will add caché many big companies bullying smaller ones.


I hope they lose so that the wishes and legal rights of the people who made GitHub great in the first place remain respected.

GitHub is almost like a social network in the way it relies on "the community". Stealing people's code (who made it great) by regurgitating it through an ML code laundering machine, isn't very fair imho.


I don't think anyone who wishes others not to learn from their code should upload it publically to GitHub.


ah yes, copy 'n pasting (or the ML equivalant) is truly learning.

You can learn from code by just looking at it, working with it (i.e. transforming it into a truly derivative/creative work) there's little value to be had in having some automated program spit out chunks of code (ripped from their original context) that you don't fully understand or without the tool never could have conceived.

Also looking at some publicly available code (to learn) still respects the license, co-pilot possibly doesn't.


You can think what you like.

But Microsoft isn't being sued to stop copilot (which is trash really) it's to stop further development and an AI which will be to programmers what AlphaGo was to human players.

The workforce in the industry is desperate for a way to slow AI down (every industry workforce) and they found a way to try.


Where to sign a petition against luddites?

They stole my `i` variable from `for` cycle.


#LicensesAreForLosers


Does FSF have a position on this issue already?


The FSF has funded a set of white papers on the topic: https://www.fsf.org/news/publication-of-the-fsf-funded-white... . The one by Bradley Kuhn, which is likely closest to the FSF's official position, concludes that CoPilot is problematic.


And yet copilot is less problematic than RMS.


is this new Oracle vs. Google?


More like SCO vs Linux, but in reverse.


That was just the api. This is fully functional sections of code.


ive come to think that if you open source your code, u should only use MIT. ie. make it a gift. because once your code is just out there in the open, you cant pretend you still control it. suddenly you're dependent upon lawyers. thats icky. i mean anyone could be using this guy's code to do something against the license, and dude would never know. well, if you give a shit, maybe don't leave your code sitting in the open on a microsoft server? either set it free, or don't.


That's a really bad take. GPL or AGPL License is not because someone doesn't want their code to be out, it's because we believe in fair contributions, ownership and building the projects in a community. Open Source with GPL is a Tit for Tat system (you either make things better for everybody, or you don't), not 'get altruism for free' system.


>Open Source with GPL is a Tit for Tat system (you either make things better for everybody, or you don't)

The GPL doesn't force you to make software better for everyone. People and communities are free to steal the project, make it better, and keep it all to themselves.


> The GPL doesn't force you to make software better for everyone.

It forces you to contribute. Read what GPLv3 states:

- Include a copy of the full license text

- State all significant changes made to the original software

- Make available the original source code when you distribute any binaries based on the licensed work

- Include a copy of the original copyright notice

Watch Linus' interview with the Intel CEO where he talks about what open source is. He clearly mentions that it's a system where personal reasons end up benefitting everyone involved. When someone contributes code for their own need, collectively everyone ends up benefitting from the individual changes. It's not an altruistic system, nobody is doing this to please corpos like MIT license usually is.


You ignored the

> and keep it all to themselves

bit. The GPL allows you to make modifications to software, and then not publish the code of those modifications, as long as you don't distribute the software. If you only use the software in-house, you can make unpublished modifications as much as you like. That's what "keep it all to themselves" most likely referred to.


You can distribute the software and still not publish it. The source just has to be available to who you distribute the binaries to. It could be the case that your community is non technical and don't care about the source code. It could be the case that having the changes is "cool" and leaking them to be public wouldn't make it cool anymore and could get them kicked out or shamed.


>Make available the original source code when you distribute any binaries based on the licensed work

You only have to make it available to people you distribute the binaries to. If people with the binaries don't want it or if they want to keep it to themselves they are free to do so.

The reasons why corpos contribute back to upstream projects have nothing to do with the license. Usually it is just to shift the maintenance costs to the upstream developers.


but this doesnt really explain what makes gpl superior to mit. in my experience, gpl is expensive, and i have seen plenty of fair contribution, ownership, and community making things better with mit. what i was trying to say was that code published in the open will inevitably be abused by humans and machines alike. the difference between gpl and mit, in my view, is that more permissive licenses are less at odds with this reality. with copyleft there will always be lawsuits


Those lawsuits are necessary in order to protect free software.

They have also brought us much good, as without the GPL you wouldn't have OpenWRT or LineageOS, projects which typically depend on vendor-provided kernels or drivers in order to function.


MIT is not a gift. It still requires attribution, notice and license term.

This lawsuit is even not about copy left.


so? the gist of the mit license is, "use without restriction". without the license the default is that you have no right to the work, so the license is what makes it a gift. you're welcome


no it very much isn't use without restriction - you must credit the the author and otherwise follow the licensee. I can't copy and republish code from the Nvidia driver leaks because those are copyrighted by Nvidia (unless I have permission from them). In the same way you cannot copy and republish code from a MIT project unless I have permission from them (via the MIT licensee).


i would suggest reading the text [0] to educate yourself

[0] https://opensource.org/licenses/MIT


I would suggest to read the text and then contribute to the discussion. We have access to search engines too.


But doesn't it say:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

So that's a required attribution if you want to follow the MIT license?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: