Microsoft sued for open-source piracy through GitHub Copilot

cookiengineer · on Nov 6, 2022

Microsoft's major arguments:

- as long as the reuse of code is transformative enough, they consider it fair use

- yet Microsoft's Copilot is not trained on Microsoft codebases, and not on Windows source code

- yet Microsoft files DMCA takedown requests on Windows source code leaks

As long as these are the case, Microsoft's argument of "Everything is considered fair use, except our stuff" is flawed in a lawsuit. That is not how copyright works.

That is why Copilot has to be considered an attack on a specific group of software, and a specific group of people. And that is my own opinion.

Either get rid of copyright alltogether (including Microsoft's arsenal of patents for it) or not. But don't try to play the hero of open source while you are stealing their stuff, specifically.

alkonaut · on Nov 6, 2022

GitHub source code is different from windows code though. This isn’t a moral argument but a dry legal one. Microsoft doesn’t control the rights for a lot of the windows source code because it isn’t their copyright to begin with. Meanwhile even private repos on GitHub are subject to uses like this through the terms of service iirc. So from a moral perspective I’d agree with you, but I don’t see how this could fly legally. I’m pretty sure Microsoft has had six armies of IP lawyers independently come to this conclusion.

Tozen · on Nov 6, 2022

Terms of service does not mean the forfeiting of user's copyright. There is an argument to what extent snippets of code is fair use versus when it becomes excessive and violates the copyright.

You are right though, battling Microsoft in court will be no easy fight, and it will probably be a long one. However, I think the sharks are in the water and smell blood (money), as many lawyers see weaknesses in Microsoft's arguments and justifications (excuses).

imaltont · on Nov 6, 2022

You here run into other problems though with people mirroring code on github that they do not have the rights to change the license for, which would be required for accepting the terms of services. E.g. if you take a GNU project hosted at Savannah and maintain your own independent for of it on github for whatever reason, you cannot re-license the bit that is own by the FSF to conform with the github ToS without getting the FSF's permission (and for other projects without a single copyright holder, every single contributor). This ofc isn't Microsoft's or Github's fault, but the people making the github repo, but still creates a problem for breaking licenses.

Their FAQ also has some ambiguity on what it has been trained on:

> It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

fsckboy · on Nov 6, 2022

> Microsoft doesn’t control the rights for a lot of the windows source code because it isn’t their copyright to begin with.

I know "we don't own the source" came up before with OS/2, but I'm pretty sure Microsoft owns the bulk of Windows source code (maybe not in the hardware drivers department), but as I mentioned in another comment, it has trade secret protection, they simply have never published or licensed it.

sorokod · on Nov 6, 2022

"The rain it raineth on the just

And also on the unjust fella;

But chiefly on the just, because

The unjust hath the just’s umbrella."

Always think of this one whenever "the law applies equally to all" argument comes up.

BlueTemplar · on Nov 6, 2022

I feel that it's important for this discussion to point out that the actual GitHub source code is AFAIK (mostly?) closed source, unlike for the open core GitLab. (No idea whether it's crawled by Copilot ?)

renonce · on Nov 6, 2022

> yet Microsoft's Copilot is not trained on Microsoft codebases, and not on Windows source code

I don't think Copilot is trained on any private GitHub repository nor anything secret, not just Microsoft's private code

fsckboy · on Nov 6, 2022

a major bug in your analysis is, Microsoft doesn't publish the bulk of Windows source code, it's a trade secret. It's still covered by copyright, but fair use is absolutely not in the picture, and whoever does publish it has no license whatsoever.

(and I've hated Microsoft and other monopolies from before 80% (90%?) of you were born, so this isn't me defending Microsoft.)

cookiengineer · on Nov 7, 2022

> It's still covered by copyright, but fair use is absolutely not in the picture, and whoever does publish it has no license whatsoever.

Remind me again, why is my used GPL3 licensed code considered fair use, and the EULA of Microsoft's code is not?

If a public repo doesn't contain a very permissive license, Copilot is still copyright fraud. And violating GPL-style licenses is still a license violation, which nullifies the right to use.

suprjami · on Nov 6, 2022

> including Microsoft's arsenal of patents for it

You'll probably be surprised to know that Microsoft are a member of both OIN and LOT: https://www.redhat.com/en/blog/red-hat-welcomes-milestone-ad...

throwaway675309 · on Nov 5, 2022

Even if this succeeds (I sincerely hope that it does not, as I believe these to be sufficiently transformative as to be considered fair use), it doesn't functionally matter in the long term:

1. The ability to be able to both run and train these models is going to eventually be plausible on a home machine with even modest hardware.

2. These models, e.g. such as one scraped from all of the code on GitHub, will be publicly available via torrent or whatever.

3. People will be able to run it locally as an integrated plug-in in their IDE of choice, jetbrains, vs code, etc.

4. You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application

Brian_K_White · on Nov 5, 2022

It's not transformative at all, nor quoted for study/discussion, and that's exactly what makes it a problem.

If the snippets merely had attribution, they'd be both legal and acceptable to most open source authors.

The code is not transformative because the quoted code is not used for some other purpose like as part of an article discussing whatever the code does, it is used to do exactly it's original job.

If you wrote a book by quoting choice paragraphs from other books, without crediting any of them, presenting the new book as simply your pile of deep insights, that is not transformative, even though your book is not the same as any single of the source books. It's also not fair use, even though all the quoted bits are short, both because of the usage and the lack of accreditation.

We could have something like copilot just fine if done above board, but as it is right now github are simply outlaw and with no excuse at all.

zarzavat · on Nov 5, 2022

I read these kinds of comments on HN and I wonder if the people complaining about copilot have actually used the thing for more than 5 minutes?

Most of the time copilot is giving you one-line completions that are heavily adapted to the surrounding code. It’s not “writing a book by quoting choice paragraphs from other books”. It’s fancy autocomplete.

You can direct copilot to give you anything you ask it for. If you ask it for an entire function, and you don’t give it any context, then you are more likely to get a snippet from inside its memory. But even when this happens you are unlikely to keep the entire snippet unmodified because the chance that it is what you were looking for is low.

heavyset_go · on Nov 6, 2022

There are countless examples of people just letting Copilot's "autocomplete" run wild and it ends up reproducing many lines of code, often verbatim, and not just single lines of code.

Kiro · on Nov 6, 2022

You need to bait Copilot really hard to make it output that. To the point where it would be much easier to locate and copy-paste the code directly from the source yourself. It doesn't happen accidentally and it's not how you normally use Copilot.

nopenopenopeno · on Nov 6, 2022

A lot of people seem to think changing the variable names makes it new code.

dymk · on Nov 6, 2022

The count is like 5

wizzwizz4 · on Nov 6, 2022

Considering how easy it is to generate additional examples, I think it's higher than 5. Five big, popular blog posts, maybe; but there are way more than 5 examples just on Twitter.

esperent · on Nov 6, 2022

> you are unlikely to keep the entire snippet unmodified

Does that matter? It's not the end users who are being accused of copyright violation, it's the tool itself.

zarzavat · on Nov 6, 2022

If the argument is that the tool is facilitating copyright infringement by users then it matters whether or not the users actually keep the supposedly infringing code, yes. If the code gets deleted immediately afterwards by the user then the claim of infringement by the user is farcical.

If the argument is that copilot is infringing because it merely displays an allegedly copyrighted snippet for a few moments, even if the user does not accept the completion, or accepts it but then deletes it, then that is a completely different argument because there is very little identifiable damage to the authors, which bolsters an argument of fair use.

vorpalhex · on Nov 6, 2022

The tool itself is infringing copyright. It possesses and reproduces copyrighted code without it's attribution. Full stop.

If I steal your television, it doesn't matter how much I get for it at the pawn shop.

nl · on Nov 6, 2022

You can Google and find GPL'ed code. That doesn't mean Google is facilitating license violation even though the Google search index posses and reproduces GPL'ed code.

> It possesses and reproduces copyrighted code without it's attribution.

To be clear - attribution does not fix GPL license violation.

Beldin · on Nov 6, 2022

You can Google and find GPL'ed code.

Emphasis mine.

This is different from asking Google and getting code directly from Google without the license - that is, without any indication that this code was taken from somewhere that has a license on all parts of it, not just the parts that Google didn't bother to show you.

Retric · on Nov 6, 2022

Copyright doesn’t go away simply because you edit what was copied, deleting it may work assuming you don’t recreate to much of the original work.

There is a concept in copyright law of derivative work which means simply editing Harry Potter isn’t enough to publish it without paying for the rights to do so. Even highly transformative fanficfion can run into this issue when much of the original work’s characters and setting remain.

Disney gets away with this stuff by copying public domain works, copilot and it’s users don’t have that defense.

zarzavat · on Nov 6, 2022

The key word is fan fiction. Almost every single word in fiction is a creative expression and thus copyrightable.

But works of non-fictional literature (of which source code is an example) have to work harder to justify their copyright.

Many of the snippets of code generated by copilot contain zero creative expression. They are just mechanical implementations of an algorithm.

You mentioned setting and plot. The source code equivalent is architecture. Copilot generated snippets are usually too short and too algorithmic to have their own architecture.

What makes a derivative work derivative is that it copies creative expressions from the parent work. If you edit them out, then it’s not a derivative work anymore.

Retric · on Nov 6, 2022

Variable names are just as subject to copyright as character names, and specific implications of algorithms can vary quite a bit. Just look at the homework assignments of any introductory programming course.

Which is why even very short snippets of code have been an issue.

hakre · on Nov 6, 2022

But what if not has the algorithm be designed for to extract that creative expression from the original work.

The use is only helping developing the algorithm by adding more value to the product in improving extracting the creative expression.

Copilot is a loose/loose game for developers, regardless on which end they are.

xyzzy_plugh · on Nov 6, 2022

> it matters whether or not the users actually keep the supposedly infringing code

That's not how this works.

If I give you a blob of uniformly random bits and then give you a passphrase you can use with a tool to turn that into a Hollywood movie, and you do so, and then you delete it, that isn't not copyright infringement.

drran · on Nov 6, 2022

> If I give you a ... Hollywood movie, ... that isn't not copyright infringement.

It is.

wizzwizz4 · on Nov 6, 2022

I'm no lawyer, but judging how people reacted to DeCSS (09F9 1102 9D74 E35B D841 56C5 6356 88C0, anyone?)… why do you say this isn't copyright infringement?

Taniwha · on Nov 6, 2022

Pretty much as if you buy a bike and it turns out it was stolen it's still not your bike someone who blindly accepts what turns out to be a chunk of someone else's code can still be unknowingly guilty of copyright violation.

lokar · on Nov 6, 2022

Most of the time I enter a bank I don’t rob it

woah · on Nov 6, 2022

The authors of these comments have never used it. Their understanding of the issue comes from having seen a tweet where someone coaxed CoPilot into almost-reproducing a piece of code that thousands of people have copy and pasted into their GitHub repos over the years, such as the quake square root code or that matrix multiplication code.

They also usually have a Luddite axe to grind.

If they succeed in getting CoPilot outlawed (obviously they won’t), then I will rent $50 worth of GPU time and train my own, like some kind of cyberpunk outlaw.

NullPrefix · on Nov 6, 2022

>Most of the time copilot is giving you one-line completions that are heavily adapted to the surrounding code.

Keyword being "Most of the time"

dleslie · on Nov 6, 2022

Sure. I used it for about twenty minutes before I turned it off because it seemed clear to me that it wasn't much different than copying code from open source repositories.

fooster · on Nov 6, 2022

That is absolutely not my experience at all and wish people would stop with the hysterical hyperbole. The tool typically produces code that would only make sense in my codebase because it uses internal types which only make sense in my code base. That is not and cannot be copyright infringement.

dleslie · on Nov 6, 2022

Sure, it's clever enough to swap identifiers correctly, but the shape of the code and techniques it uses were familiar to me from elsewhere, at times.

It's not unlike copying code from stack overflow and swapping the identifiers around.

rolisz · on Nov 6, 2022

How did you learn the shape of the code and techniques that you recognized? By looking at other code? Are you violating copyright when you are writing similar code yourself?

int_19h · on Nov 7, 2022

> By looking at other code? Are you violating copyright when you are writing similar code yourself?

It's certainly possible. Back when Phoenix reverse-engineered the IBM BIOS using published sources with a restrictive license, they did it by having one team read the sources and write a very detailed specification of everything that happens there, and then another team used that spec to write new code from scratch. They did it that way because if the first team were to write the code themselves, it is quite likely that the result would be legally considered derived work.

hakre · on Nov 6, 2022

> By looking at other code? Are you violating copyright when you are writing similar code yourself?

In that order likely a violation, but it depends on the concrete case, always.

dleslie · on Nov 6, 2022

From decades of looking at interesting code.

I take care to attempt to not duplicate code; it's why don't use copilot.

nl · on Nov 6, 2022

> the shape of the code and techniques it uses were familiar to me from elsewhere.

Isn't this exactly what you'd expect an AI coder to do?

dleslie · on Nov 6, 2022

That's precisely the problem with it. AI, as it is, can be considered a lossy compression technique with multiple document recovery, convolution and interpolation ability.

nl · on Nov 7, 2022

Can't humans be considered a lossy compression technique with multiple document recovery, convolution and interpolation ability?

What is the specific difference?

veilrap · on Nov 6, 2022

100% agree. This matches my actual usage of github copilot. All the drama around it seems to just be mostly for headlines to fuel the outrage machine.

sj4nz · on Nov 6, 2022

This isn't my experience with Co-pilot's suggestions. I've literally been able to have Co-pilot suggest a complete unit test based on a novel structure I hand-coded myself and a few words describing the unit test. The constants are often wrong, but it saves minutes of fidgeting with the syntax for unit tests and assertions.

These are not quotations from other people's code but something about the deep structures of language and programming language semantics. However, I suspect if you knew enough of a snippet from other source you could coax Co-pilot to suggest code learned from that source, but it would likely be washed over by other code in the corpus where it coincided with meanings.

jacoblambda · on Nov 6, 2022

Worth noting with models like copilot is that if you deliberately give it an input similar to the training contents, odds are it'll near verbatim reiterate it.

The main issue is that while you can use copilot to create "new"/transformative code, it's also trivial to get it to pump out licensed works in a form where you could claim "I didn't know it was taken from x project with y license because the tool made it for me".

I personally have no problem with copilot in concept however to do it (or any other AI model based text/graphics tool) without infringing on people's copyrights is practically an unsolved problem (excluding just per-licensing the training data ahead of time).

withinboredom · on Nov 6, 2022

I mean, you can prompt me (or any other engineer) to spit out copyrighted code. FizzBuzz comes to mind… as do a number of algorithms I’ve written in the past which belongs to my past employers…

I really think we are entering some interesting territory that will likely be an interesting can of worms.

Brian_K_White · on Nov 6, 2022

There is somehow something different if you knew the entire code to quickbooks from memory, and had an api where I could request any 10-line chunk of it I wanted, as many times as I wanted.

withinboredom · on Nov 6, 2022

So you’re saying that how fast someone can type and how well they can recall makes a difference? I don’t type over 120 words per min like my grandma, but I have a photographic memory. I can tell you what file & line a chunk of code belongs to, or spit it out verbatim, customized to the current problem I’m working on.

So, you’re saying I can’t work in this industry? That seems a bit harsh.

wizzwizz4 · on Nov 6, 2022

> So, you’re saying I can’t work in this industry? That seems a bit harsh.

If you're going to be spitting copyrighted code out in violation of any licenses it might be made available under… yes, you can't. Most employers would not appreciate that behaviour. But I doubt you actually do this, even though you're capable of it. You reason about your code; you're not just being a predictive text engine.

Brian_K_White · on Nov 6, 2022

It sounds like for you in particular, yes, since you seem to want to go out of your way to find any way to violate copyright, even when the terms are intentionally generous. Indeed such a person should not work in this industry, though I'm sure there are plenty of employers who are happy to have you steal for them, so you will be able to regardless.

withinboredom · on Nov 6, 2022

My point was that we all do this /not on purpose/ (and for me being an exception, can make sure I don’t personally). But when I see code that existed in another company with some variables changed, I don’t flag it. There are only so many ways to describe a chair, are they all copyrighted?

Brian_K_White · on Nov 6, 2022

"My point was that we all do this /not on purpose/"

I don't concede the equivalence.

dev_tty01 · on Nov 6, 2022

It's really simple. If you are outputting licensed code and not abiding by the terms of the license, then yes, that is a problem.

heavyset_go · on Nov 6, 2022

Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.

cbzbc · on Nov 6, 2022

Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.

heavyset_go · on Nov 6, 2022

Commoditized*

ISL · on Nov 6, 2022

To attribution I would add license compliance.

Brian_K_White · on Nov 6, 2022

I agree but I'm saying that attribution would be all that's missing from compliance in most cases.

mikro2nd · on Nov 6, 2022

But not, I'll hazard a guess (ianal), if its a Gnu GPL license. Right?

Brian_K_White · on Nov 6, 2022

I didn't want to make an unreadably long comment by trying to solve every detail.

But GPL just adds that the source be made available to any recipient of code you used. That's pretty easy to arrange because even a link satisfies it.

If copilot can be made to spit out attribution, then it can spit out links at the same time.

Another idea is copilot could be changed to only include code where the authors opted in to an aggregate credit where your new program only has to declare that it used copilot and a link to copilot's training set along with your programs source, without trying to itemize each bit of output.

There could also be other versions of copilot that includes other code under other terms, like pure MIT or PD code where the original author already explicitly granted usage with no terms, or paid commercial code where github paid the authors to be able to include code to be used in this way and maybe with terms where the end user does not have to re-share.

ImPostingOnHN · on Nov 11, 2022

The "link" necessary here would be a link to the full source code of the software developed with the use of copilot, distributed with every copy of said software

yellowapple · on Nov 6, 2022

> If you wrote a book by quoting choice paragraphs from other books, without crediting any of them, presenting the new book as simply your pile of deep insights

This is quite literally how most of the Old and New Testaments came into existence. The authors of Matthew and Luke certainly didn't bother mentioning that they copied a bunch of stuff from Mark and whatever the Q source is (my bet is on either the Gospel of Thomas or some source thereof). Nor did the authors of Leviticus and such mention how they ripped large swaths of what's now known as Mosaic Law straight from Hittite laws and the Code of Hammurabi. These works are nonetheless considered pretty transformative.

wizzwizz4 · on Nov 6, 2022

> This is quite literally how most of the Old and New Testaments came into existence.

That happened a few (thousand) weeks before copyright law was invented.

> The authors of Matthew and Luke certainly didn't bother mentioning that they copied a bunch of stuff from Mark and whatever the Q source is (my bet is on either the Gospel of Thomas or some source thereof). Nor did the authors of Leviticus and such mention how they ripped large swaths of what's now known as Mosaic Law straight from Hittite laws and the Code of Hammurabi.

The Q source (two-source hypothesis and four-source hypothesis both) is speculative, and there are several reasons to criticise the theory – such as lack of corroborating historical evidence that the Q source ever existed as a separate document.

That aside, Leviticus likely got its Hittite and Hammurabi influences via Mosaic Law, not (primarily) the other way around. (Legal systems influence other legal systems? Who knew‽) Merely saying the same thing isn't protected by modern copyright; facts are not copyrightable. (There is such a thing as "database rights", but that doesn't really come into play, here.)

> These works are nonetheless considered pretty transformative.

The merits of our current copyright regime aren't terribly relevant, here. If you want to change the law, change the law; don't try to have the existing law interpreted differently in select cases, especially if you're only doing that where it benefits already-powerful entities like Disney or Microsoft.

yellowapple · on Nov 6, 2022

> don't try to have the existing law interpreted differently in select cases, especially if you're only doing that where it benefits already-powerful entities like Disney or Microsoft.

Who says it only benefits the already-powerful entities, or that it even benefits them at all in the long run? As evident from my other comments on this topic, my hope (and indeed expectation) is that they'll shoot themselves in the foot and be hoisted by their own copyright-infringing petard. It wasn't that long ago when copyright enforcement wasn't a thing, and that lack of enforcement didn't seem to stop people from creating all sorts of artistic and literary and musical works, great in quantity and quality alike. The more the megacorporations backtrack on their copyright absolutism and admit that ignoring copyright expedites creative work, the more ammunition the rest of us have against their continued insistence on copyright absolutism.

Put simply: Copilot sets a precedent that I suspect will lead to the downfall of intellectual property law entirely. Probably not alone, but very likely one of several dominoes toppling toward that end result.

wizzwizz4 · on Nov 6, 2022

> and that lack of enforcement didn't seem to stop people from creating all sorts of artistic and literary and musical works, great in quantity and quality alike.

It's why several of Shakespeare's plays are missing, and others are only available in modified form. Since there was no exclusive right of performance, playwrights (and other similar creatives) had to keep their works close to their chest. Copyright does exist for a legitimate reason.

The biggest problems with copyright today, are:

• it's far, far, far too long; and

• corporate gatekeeping allows nigh-monopolistic corporations to take authors' copyrights from them, meaning copyright law doesn't protect authors at all!

(Rebecca Giblin and Cory Doctorow discuss the latter in their book Chokepoint Capitalism, which I haven't read yet.)

> The more the megacorporations backtrack on their copyright absolutism and admit that ignoring copyright expedites creative work, the more ammunition the rest of us have against their continued insistence on copyright absolutism.

I'd love it if it worked that way. I hope you're right – but I believe you're wrong. We can let them do this backtracking without letting copyright law tilt even further in their favour.

yellowapple · on Nov 6, 2022

> It's why several of Shakespeare's plays are missing, and others are only available in modified form.

If copyright enforcement was a thing in Shakespeare's time, then a lot more of his plays would be missing today; there would've been far fewer "modified forms" even partially preserving works for which the original was lost, and far less opportunity to produce copies of unmodified forms.

> I'd love it if it worked that way. I hope you're right – but I believe you're wrong. We can let them do this backtracking without letting copyright law tilt even further in their favour.

I guess what I'm getting at is that this doesn't tilt the law in their favor. Either Copilot ignoring the licenses of source material is deemed legal (at which point people are free to launder proprietary code through ML models, destroying copyright law as applied to software) or it's not (at which point a multi-billion-dollar corporation loses a revenue source). Lose-lose for corporations, win-win for the rest of us.

wizzwizz4 · on Nov 6, 2022

> If copyright enforcement was a thing in Shakespeare's time, then a lot more of his plays would be missing today; there would've been far fewer "modified forms" even partially preserving works for which the original was lost, and far less opportunity to produce copies of unmodified forms.

No. His plays were preserved and published by his friends, after his death, so that they weren't lost. Because the curtains had closed, the players had moved on, and the loss of his plays was deemed worse than the loss of monopoly over them. Copyright was originally created to remove the incentives to keep things private and secret; if you have exclusive rights to copy it, and that's protected by law, you're able to release those copies without fear of somebody taking what you've made and pushing you out.

Look at the playwrights who didn't have friends like Shakespeare's. Where are their works, now?

That being said, the world has changed since the 1600s. Copyright (with reasonable term lengths!) was necessary before we had the 'net, but perhaps not so much now.

> Either Copilot ignoring the licenses of source material is deemed legal (at which point people are free to launder proprietary code through ML models, destroying copyright law as applied to software)

The law doesn't work logically like that. If Copilot is deemed lawful, there will be justification of that decision that doesn't allow us to launder proprietary code through ML models. Because copyright isn't about author's rights, and hasn't been since… the 60s? Since some time after the Universal Copyright Convention of 1952, anyway; I'm no historian.

yellowapple · on Nov 7, 2022

> No. His plays were preserved and published by his friends, after his death, so that they weren't lost.

You literally just said that there are plays of his which only survive in modified forms - i.e. ones which his friends were unable to preserve. Those modified forms are the ones that would disappear in this supposed alternate universe wherein Elizabethan England had copyright enforcement - because those modified forms would've been DCMA'd into nonexistence, and thus those plays would've been lost entirely.

> Look at the playwrights who didn't have friends like Shakespeare's. Where are their works, now?

Christopher Marlowe? https://en.wikisource.org/wiki/Author:Christopher_Marlowe

Ben Jonson? https://en.wikisource.org/wiki/Author:Ben_Jonson

Francis Beaumont? https://en.wikisource.org/wiki/Author:Francis_Beaumont

John Fletcher? https://en.wikisource.org/wiki/Author:John_Fletcher

Thomas Middleton? https://en.wikisource.org/wiki/Author:Thomas_Middleton

John Lyly? https://en.wikisource.org/wiki/Author:John_Lyly

The answer to your question seems to be: they were preserved just fine and dandy.

> If Copilot is deemed lawful, there will be justification of that decision that doesn't allow us to launder proprietary code through ML models.

A license is a license. If it's okay to ignore the terms of the GPL, then it's okay to ignore the terms of any other software license, including Microsoft's EULAs. The law works on precedents, and Copilot's legality sets a precedent that threatens the very concept of software copyright.

Even if Microsoft somehow avoids being immediately eaten alive for lunch by the myriad competitors who'd love nothing more than for Windows to lose its copyright protections (they do have plenty of money to spend on lawyers, after all), their best case scenario of "we somehow managed to carve out an exception for our EULA being enforceable that somehow doesn't apply to other software licenses" would demonstrate rather plainly the fundamental inconsistency in intellectual property law, making it all the easier to justify abolishing it entirely.

Brian_K_White · on Nov 8, 2022

Thank you for the solid arguments.

World177 · on Nov 6, 2022

I think your opinion is bad while also misrepresenting machine learning.

> It's not transformative at all, nor quoted for study/discussion, and that's exactly what makes it a problem.

This model works in the same way as GPT-3. It's just predicting what the next most likely word will be given the previous words. It does this by creating a generalizaiton over the content that was used to train it. This generalization should be similar (or close to the same) to if it had a completely different set of training data.

In the same way that you could create 60,000 new training images comparable to the MNIST dataset, and get a model that works at an equivalent level, you could do the same with source code. It doesn't matter where the original data is from because the end result should be very similar when models reach higher levels of accuracy.

> but as it is right now github are simply outlaw and with no excuse at all.

This is an insane view on machine learning. When you read content, neural weights are adjusted in your brain to keep it as a memory. Is that copyright infringement too? Do you now owe Disney royalties because you stored their movie in your memories?

It's also an incredibly harmful view. Machine learning is dependent on data. Services like search, translation, recommendations, etc. are all dependent on training data. And, you're making the case, that since you don't like that it was trained on source code, it should all be illegal.

plonk · on Nov 6, 2022

> This model works in the same way as GPT-3. It's just predicting what the next most likely word will be given the previous words. It does this by creating a generalizaiton over the content that was used to train it. This generalization should be similar (or close to the same) to if it had a completely different set of training data.

It's not enough of a generalization to be acceptable IMO.

Let's see GPT-3 as something analogous to a brain. Human brains can do multiple things:

- learn through experience and re-use concepts, but writing completely original code,

- copy a program's structure without copy-pasting the code, effectively re-writing it,

- copy the exact code, with a few minor details changed.

Currently, Copilot does a lot of the last point [1]. So just like humans can choose different actions and the legality depends on the action, ML models can do different things, and these aren't automatically legal by virtue of coming from an ML model.

It's not enough to argue about how machine learning works. ML can do tons of things, and GitHub's current methodology leads to something close to copy-paste. Maybe it can learn a semi-original way to write common things like searching in a list, but for more exotic uses like complex algorithms, being unable to actually understand how to code, it basically has to act as a search engine for existing implementations.

> Do you now owe Disney royalties because you stored their movie in your memories?

That argument makes no sense. Nobody would be complaining if GitHub had just trained a model on that code. The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.

[1] https://twitter.com/docsparse/status/1581461734665367554

World177 · on Nov 6, 2022

> It's not enough to argue about how machine learning works. ML can do tons of things, and GitHub's current methodology leads to something close to copy-paste. Maybe it can learn a semi-original way to write common things like searching in a list, but for more exotic uses like complex algorithms, being unable to actually understand how to code, it basically has to act as a search engine for existing implementations.

I don't disagree that it is over-trained on certain sequences of words, but, overall, I think the generalization fine. It's often pre-prompted with content it has not seen before, resulting in unique new content. There's nothing copy-pasted about this, just a statistical understanding of what usually happens next.

If the pre-prompt is something very specific, ex. "a dog runs in the park while during the middle of the day, while the sky is the color _____" it will obviously output blue. The same can be true when there are only very specific known algorithms that have been used more frequently than others. And, in the cases where it does commit something arguably comparable to copyright infringement, it would probably on the programmer and not on the model for deciding to use it.

> The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.

In very specific cases only does it output copyrighted content. Most of the time it is just outputting a generalization of what is expected. It isn't Disney's content, but human content. Also, just creating content that has some similarity isn't copyright infringement. Satire has been well accepted as fair use.

wruza · on Nov 6, 2022

You are assuming that laws/rights operate on basic properties like learning or remembering, but they operate on goals (e.g. humans can run, but there is no speed limit for a pedestrian. But if we learned to run at 100mph somehow, a limit would be introduced). The goal is to motivate creators to create on terms that would be fair enough and useful for a society. If you “generalize” their work at such scale, it is a very different situation, like when you are using a sidewalk for robots running at 100mph and demotivate anyone from using it.

Services like search, translation, recommendations, etc. are all dependent on training data

There is an obvious benefit in being represented in search results and recommendations.

World177 · on Nov 6, 2022

> If you “generalize” their work at such scale, it is a very different situation, like when you are using a sidewalk for robots running at 100mph and demotivate anyone from using it.

I don't understand what you're trying to state.

If you make fun of someone on national TV, someone doesn't like it either. It doesn't mean that it should be illegal.

wruza · on Nov 6, 2022

The point of attribution and copyright is to create a creator/inventor-friendly environment. The copyright law used realistic points in a problem space to provide it. These new AIs enlarged the problem space, but it doesn’t mean that the initial idea of copyright is not applicable to them by default, or should not be. It’s okay to vote for or against that, but the potential systemic effect of that vote should also be kept in mind.

I hope this makes more clear why I used this analogy.

World177 · on Nov 6, 2022

> The point of attribution and copyright is to create a creator/inventor-friendly environment.

They're building a class, but, beyond that, I'm not really confident there are any parties that feel like they've been harmed by what they're claiming is a violation. Microsoft didn't just copy repositories from GPL licensed code, but from other companies too. Why didn't these companies care?

It is a good counter point though, as Waymo has released datasets that can only be used under a license. While at the same time, GitHub did not initially use discretion for people who had agreed to their terms of service and were using their hosted repositories.

While I think the legal system might lean towards enforcing consent to use the data in certain circumstances, on the technical side, if it's really a generalization that is resulted from training, the initial data is meaningless in the final model. It could be argued that it did not harm this company economically because the model trained without the data would still have an almost equivalent financial impact.

> It’s okay to vote for or against that, but the potential systemic effect of that vote should also be kept in mind.

My position above was clear, I do not think the law should be changed yet. The systematic change is from someone stating a violation happening when I do not believe that it is a violation. This lawsuit could result in models like Stable Diffusion becoming illegal, which I view as incredibly harmful to the future of artificial intelligence research.

wruza · on Nov 6, 2022

What if we extrapolate a little into the furure. Copilot-likes and SDs become useful tools for knowledge/mastery extraction and application, much better than today. Will people want to create new knowledge, or will they be satisfied with reshuffling existing one? Will opinions change if people realize that there is no point in creating something new (it will be generalized), and existing is easily reproduced without them (classic “took our jobs”). Can it all happen? How do you see it?

hakre · on Nov 6, 2022

Incrediblest full of all the harmfulsts! Not only Machine learning:

"Watching ditigal Walt Disney Movies is dependent on data."

It is illegal as the attribution is missing. Also copyright applies etc. see the lawsuit. Its less a technical thing, more a legal one.

World177 · on Nov 6, 2022

> Incrediblest full of all the harmfulsts! Not only Machine learning:

I do not agree that it's a good idea to make using copyrighted data for training illegal. The people who are upset that GPL code was used for training should be happy that a large corporation believes that training on copyrighted works is fair use. It's a more democratized position.

> It is illegal as the attribution is missing. Also copyright applies etc. see the lawsuit. Its less a technical thing, more a legal one.

I think in reality, this is just a lawyer or group of lawyers who realized they had a case and a reactionary audience to possibly make them millions of dollars.

CuriouslyC · on Nov 6, 2022

If you wrote a book by clipping small bits of others' work and assembling it into something cohesive that would most definitely be fair use.

Brian_K_White · on Nov 8, 2022

Incorrect, if you did what I actually said, which mimics what copilot does.

scatterhead · on Nov 6, 2022

> The code is not transformative because the quoted code is not used for some other purpose like as part of an article discussing whatever the code does, it is used to do exactly it's original job.

Hmm..

> The printing press is not transformative because the printed text is not used for some other purpose, it is used to do exactly it's original job.

See the error in your logic? The potentially transformative part is not the code itself. It's the impact to the process of creating the code.

ericpauley · on Nov 6, 2022

For copyright purposes the printing press is not transformative. No contradiction here.

williamcotton · on Nov 6, 2022

Well lucky for us paragraphs of prose are artistic expression and the snippets of code returned by Copilot are utilitarian.

The only things covered by copyright are creative choices. In software, as established by Whelan v. Jaslow [0] this is the structure, sequence and organization [1].

The district court ruling in the Whelan case drew on the established doctrine that even when the component parts of a work cannot be copyrightable, the structure and organization of a work may be. The court also drew support from the 1985 SAS Inst. Inc. v. S&H Computer Sys. Inc. in which it had been found that copyright protected organizational and structural details, not just specific lines of code. Structure, sequence and organization (SSO) in this case was defined as "the manner in which the program operates, controls and regulates the computer in receiving, assembling, calculating, retaining, correlating, and producing useful information." SSO refers to non-literal elements of computer programs that include "data input formats, file structures, design, organization and flow of the code, screen outputs or user interfaces, and the flow and sequencing of the screens."

However:

The Whelan decision initiated a period of excessively tight protection, suppressing innovation, since almost everything other than the broad purpose of a software work would be protected. The only exception was where the functionality could only be achieved in a very small number of ways. In these cases there could be no protection due to the merger doctrine, which applies when the expression and the idea are inextricably merged. Later the same year, in Broderbund v. Unison the court cited Whelan when finding that the overall structure, sequencing, and arrangement of screens, or the "total concept and feel", could be protected by copyright.

For a brief overview of the merger doctrine:

https://www.lucysnyder.com/index.php/copyright-law-merger-do...

So this excessively tight protection was rectified in CAI Inc v Altai [2], which established the legal doctrine of Abstraction-Filtration-Comparison [3] which says "As we have already noted, a computer program's ultimate function or purpose is the composite result of interacting subroutines. Since each Subroutine is itself a program, and thus, may be said to have its own 'idea,' Whelan's general formulation that a program's overall Purpose equates with the program's idea is descriptively inadequate."

[0] https://en.wikipedia.org/wiki/Whelan_v._Jaslow

[1] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

[2] https://en.wikipedia.org/wiki/Computer_Associates_Internatio....

[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

jameshart · on Nov 6, 2022

> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow

Sure, there’s a sense in which copilot is just an alternative way of finding these snippets; if someone used Google to find a piece of code then chose to copy-paste it, the result would be the same. The code is out there on the internet, being indexed - and incorporated into training models.

The difference is that Google will (probably, if they don’t just yank the content into a snippet on the Serp) link you to the GitHub source, where the license should be clearly visible, so if you copy paste it you know what you’re doing. Copilot produces the code without attribution.

But it’s a narrower difference than we might actually wish. What is it about google’s pointing you to a line of code in a GitHub repo that absolves Google from any responsibility to tell you that the code it sent you to might be copyrighted? Presumably, when you land on that page you are expected to recognize from context that this is probably licensed code, and go look for the license terms. Why can’t copilot also rely on that same ‘user must recognize that some of the results produced might be copyrighted’ logic?

Worth noting that Google image search has long presented its results under a caption pointing out that the results may be copyrighted and you need to do your own research before using any images it produces based on your query. How different is searching it’s image index and producing potentially copyrighted images it found online, from copilot’s search of ‘how to code’ doing the same thing?

wizzwizz4 · on Nov 6, 2022

> Why can’t copilot also rely on that same ‘user must recognize that some of the results produced might be copyrighted’ logic?

Google shows you where you can find the copyright and license information. GitHub Copilot does not.

> Worth noting that Google image search has long presented its results under a caption pointing out that the results may be copyrighted and you need to do your own research before using any images it produces based on your query.

Because there's a common misconception that "Google images" are legal to use for any purpose. Afaik, Google has no obligation to correct people on this; nonetheless, they choose to do so.

jameshart · on Nov 6, 2022

Google search sends you to where it found the code.

That the copyright and license information are maybe available there is not something Google search actually knows to be the case.

That information might not actually be there, and Google does not care either way.

Similar to photos, There’s maybe a common misconception that code produced by copilot is free to use for any purpose. Is copilot under an obligation to point out that that might not be the case?

wizzwizz4 · on Nov 6, 2022

> That the copyright and license information are maybe available there is not something Google search actually knows to be the case.

Google sends you to where it found the code, on a best-effort basis. Google can't actually, in the general case, do any better than this. Google is a search engine; it's widely understood that, as a search engine, it looks for things and then tells you where it found them (not where they originated from; that's not what the verb "search" normally means, in English).

> Is copilot under an obligation to point out that that might not be the case?

GitHub Copilot is not branded as a search engine. It doesn't give people the option to identify, or verify, the terms under which the output it produces may be used. It doesn't provide any way of discovering the attribution, even in principle.

Normally, when something is given to you without any copyright markings, without any attribution of any kind, or licensing information, or a source:

• the computer probably came up with it on its own, just from the input you gave it; and

• you're free to use it.

This is not the case. GitHub Copilot goes against the expectations of the user; its UX design is actively misleading. This misconception only exists because Microsoft created it.

jrockway · on Nov 5, 2022

> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application

Wasn't this the subject of millions of dollars of litigation between Oracle and Google?

thih9 · on Nov 5, 2022

For others like me, missing context:

> Google’s copying of so-called application programming interfaces from Oracle’s Java SE was an example of fair use, the court held (…).

Source: https://edition.cnn.com/2021/04/05/tech/google-oracle-suprem...

jacoblambda · on Nov 6, 2022

> Wasn't this the subject of millions of dollars of litigation between Oracle and Google?

That was a very different case. That case was about whether an API/interface was considered copyrightable. The outcome of the case was that the implementation was obviously copyrightable however the interface was not. The background being google creating their own custom implementation of the JVM.

int_19h · on Nov 7, 2022

The outcome (SCOTUS decision) did not decide whether the interface was copyrightable or not. It decided that if the interface were copyrightable, it'd be fair use anyway, so there's no need to bother resolving that question.

seadan83 · on Nov 6, 2022

I was thinking this, but it is almost thr inverse. Api's without their implementation are not proprietary, but is an implementation without it's API (and surrounding structure) proprietary?

xyzzy_plugh · on Nov 6, 2022

> You'll never know if somebody has lifted a bit of code in violation of a license anymore than you would be able to tell if somebody copy-pasted from stack overflow without attribution in any commercial application

It's obvious that you've never endured any sort of deep code analysis for license "problems" because then you would know that it is in fact comically easy to tell if someone did this. It happens all the time.

While some detections are false positives, this is in fact why such deep analysis is performed: because getting caught selling a commerical application you don't have the rights to sell can destroy value very rapidly. It halts mergers and acquisitions. It gets products pulled off shelves. It gets people fired.

chris_wot · on Nov 6, 2022

This doesn’t fit the definition of fair use, if that’s what you are implying. Fair use is a balance between four factors, transformative nature is just one factor.

To satisfy section 107 of the U.S. Copyright Act, you need to look at four factors. The first consideration is the purpose and nature of the work, which is where a determination is made as to whether a work is sufficiently transformative to be considered fair use of copyrighted material. There are other factors at play on this test, however. A court would look at what sort of commercial interest and benefit is being made from the use of the material, and even then they may decide that a non-profit motive is not sufficient to satisfy this test.

The second factor a court will look at is the nature of the copyrighted work. In the case of open source material, it is already published but most licenses quite rightly ask for attribution. A court may decide that not satisfying the attribution of the work may cause it fail this criteria, and prevent it from being fairly used. However on this point, I feel CoPilot may be on more steady ground as the work is already published.

Fair Use doctrine also factors in the amount and substantiality of the work being reproduced. It is the substantiality issue that will likely trip up copilot, as the code snippets in use may be complex enough in nature that it is considered a substantial part of the code. A court may well consider an algorithm that took time to create and test, although only a small part of the codebase, is sufficiently complex in that it took substantial effort to create and thus look dimly on its use in a larger derived work.

The fourth factor in play here is the one being argued by the litigant, and I feel will be extremely hard for GitHub to justify; that factor is the effect use of the code has on the potential market or value of the code. Open Source projects rely on volunteers and contributions. Substantial effort is made to actively maintain projects. It is not within the spirit of most projects to have people use unattributed code snippets, because it reduces the value of contributions and improvements to the code project. Improvements and contributions are reduced, and it is also hard to distribute improvements and fixes to any project that uses the code. In this regard, CoPilot seems to be completely violating Fair Use doctrine.

So as you can see, Fair Use is way, way more complex than most people realise. Many people think that a sufficiently transformative use of copyrighted material is sufficiently for satisfying a Fair Use defence, but in reality it is not.

paxys · on Nov 6, 2022

Exactly. It's like people calling for Dall-E to be banned. Well Stable Diffusion exists, as do a dozen other similar projects and an entire ecosystem of training data and models. The march of technology rarely stops because of people crying "copyright!"

hakre · on Nov 6, 2022

> "The march of technology"

Who are the stakeholders putting the machine to the run? Who owns the machine?

This is much more subjective than clinical rhetoric praising tech may allure to.

v3ss0n · on Nov 5, 2022

BigCode project is coming and it gonna be opensourced the same way stable diffusion is.

ezoe · on Nov 6, 2022

We can't expect a massive improvement on computer performance anymore. So it will never come the day household computer can finish training the model equivalent of Github Copilot in practical time.

Gigachad · on Nov 6, 2022

They can just add more cores and more power. With modern languages like Rust making multi threading more accessible, I expect we will double down on this. You could also crowd source this, some distributed application where everyone puts their home machines towards training.

PartiallyTyped · on Nov 6, 2022

> You could also crowd source this, some distributed application where everyone puts their home machines towards training.

That's how Leela Chess 0 (LC0) replicated the Alpha-zero performance. In fact, this is actually not that difficult. Assuming you have the means to orchestrate it; all it takes is loading the weights and a batch, computing backprop; and submitting it to the central system which aggregates the gradient updates and then updates the whole network and push new weights (kinda how bitcoin creates a new block).

This is no different to gradient accumulation; just "distributed". In-fact, the system could offload a large number of batches because the returned update is O(1) space to return where n is for batch size; it's just that the O(1) is the size of the network.

makeitdouble · on Nov 6, 2022

I’d argue, running it on your local machine as your own copilot seems like a better proposition than what is offered now.

There is still the issue of setting boundaries on what to learn from, but it is much more aligned with you taking the responsibility for what your copilot does or doesn’t, and it might give a better window of opportunity to customize it to your needs.

We’d lose economies of scale, but also lose infringement at scale, and gain individual control. Net positives overall ?

rikroots · on Nov 6, 2022

Any ML model that decides to use my code as part of its code suggestion functionality has a fool for an algorithm.

NewJazz · on Nov 6, 2022

5. Microsoft, a $1.5e15 company, will not profit from the technology (as easily).

kqzoup · on Nov 6, 2022

If that is so, people will stop writing open source. Then this model will use its own output until it produces bland garbage like a camera filming its own output stream.

rolisz · on Nov 6, 2022

Linux has a lot of corporations contributing code to it. Tensorflow is backed by Google. Plenty of other projects are released as open source by companies, not people. I don't see companies crying about Copilot, so their employees will keep writing open source when instructed as such by their managers.

Gigachad · on Nov 6, 2022

I write open source code to benefit others and to put more learning material in to the world. If I was so paranoid about other people using it I would have made it proprietary.

KyeRussell · on Nov 6, 2022

“People will stop writing open source” is far from being self-evident.

gwillen · on Nov 6, 2022

The desired class-action status of this suit feels like a stretch to me. Typically, in a class action, the defendant's alleged behavior unquestionably harms the members of the class, which is why it's okay to make them party to the suit without their consent.

In this case, I suspect that only a minority of the members of the purported class actually believe they were harmed by the defendant's actions, and some class members are _paying customers_ of the defendant's product based on the conduct at issue. (Which surely creates some kind of estoppel issue, aside from the dubiousness of the class certification in the first place.)

Could some enterprising attorney get a bunch of members of the purported class to sign on to a brief opposing class certification? (I don't know if that would be an amicus or what, but surely those people have some kind of standing here.)

(Enterprising attorneys: my email address is in my profile.)

I guess the problem here is with "enterprising" -- there's no money there, except to Microsoft, who I imagine might be barred by legal ethics from funding something like this...

nullc · on Nov 6, 2022

Ultimately losing a case like this might be better for super large players like Microsoft in the long run: they'll staple blanket permissions (and maybe even indemnities for third party code you submit) in the terms of service and go ahead training their own proprietary models.

But players from startups to open source projects that lack market power will then face an impossible moat if they want to develop their own similar models.

nl · on Nov 6, 2022

> Could some enterprising attorney get a bunch of members of the purported class to sign on to a brief opposing class certification?

Yes, I'd sign onto this (and I have open source code on GitHub that presumably (hopefully!) is being included).

drran · on Nov 6, 2022

CopyLoot model of business ruins Open Source model of business (AKA Open Core) and Open Source movement in general.

causi · on Nov 6, 2022

Letting trillion-dollar companies freely grind up the collective creative output of planet earth into digital meat slurry to fuel automatic content-generators seems like a bad thing.

PoignardAzur · on Nov 6, 2022

> Letting trillion-dollar companies freely grind up the collective creative output of planet earth into digital meat slurry to fuel automatic content-generators seems like a bad thing.

The description is pretty loaded, though.

"Grind up" evokes Microsoft taking away the creative output and destroying it forever to create an inferior product. When in fact, the original code is still here, you just also have a smart auto-completion tool built with it.

Yes, there are concerns about license laundering and corporations taking advantage of open-source code without contributing to it, but these concerns are a lot more subtle and nuanced than "megacorporations are grinding up the collective output of the planet into slurry".

My hot take: piracy everywhere, and screw intellectual property. It does more good for the world if anyone is allowed to build upon what anyone else has done before without having to ask for permission first, even if it makes monetization harder.

CuriouslyC · on Nov 6, 2022

Copyright is indeed pretty fucked. Megacorps repackaging and selling products created by indies is also pretty fucked. I'm pro "freedom of information" but we need laws around the sale of information to prevent that sort of thing.

tannhaeuser · on Nov 6, 2022

And letting them do this by exposing your minute code edits in response to issues as a developer to make your job eventually redundant and have it taken over by ML seems like a worse thing even.

mirekrusin · on Nov 6, 2022

AI taking jobs is not an argument.

License violation is.

wizzwizz4 · on Nov 6, 2022

> to make your job eventually redundant and have it taken over by ML

This will last until the programmers run out. These systems will have a bad time writing in any language they haven't got extensive training data in.

CuriouslyC · on Nov 6, 2022

Software is only going to become more ubiquitous, programmers won't run out, they'll just shift up to higher level abstractions and more diagnostic/administrative work. Think "AI doctor" and "AI manager" rather than code slinger.

sharperguy · on Nov 6, 2022

Would be nice if they released the model, stable diffusion style. In fact, this kind of legal action could prevent disclosure of any AI models in the future.

whoopdedo · on Nov 5, 2022

Funny how it's "Microsoft GitHub" when something bad is being said about it. But it's just "GitHub" for positive things.

highwaylights · on Nov 6, 2022

”A noteworthy discovery was made today in Edinburgh, marking another great British breakthrough.”

“In London today, further advancements were made in <blah> technology, showing the strength of English academia.”

asddubs · on Nov 6, 2022

Guess how your son got himself into trouble now

Our son is on the honor roll

bamboozled · on Nov 6, 2022

I feel that deep down inside, Github is dead or will be dead, we just don't yet have anywhere to go...

That's why people talk like that. There's the GitHub the world loved, then there's the Micro$oft version, which we're starting to see problems with.

I think "GitHub" is the old good place, Microsoft Github, is the problematic version we're going to see more of.

Gigachad · on Nov 6, 2022

GitHub was stagnant before and GitLab had basically overtaken it. Since the acquisition they have massively picked up the pace and delivered several exceptionally good features.

GitLab is still a very viable alternative but there is just very little reason to want to leave GitHub. The vast majority of devs either don’t care about or enjoy copilot. It’s just a small number of HN whiners who make most of the noise on the topic.

number6 · on Nov 6, 2022

I think Copilot is very in sync with GPLs Idea of Copyleft and learning form code. Alle projects using Copilot should therefore also be GPL or AGPL

wizzwizz4 · on Nov 6, 2022

It isn't compliant with GPL's attribution requirements. If it were made so, then yeah, I'd agree – with the caveat that it'd also have to be MIT, BSD, Apache etc. compliant.

wilg · on Nov 6, 2022

Everything GitHub has done since Microsoft's acquisition has been great! The product is getting so much better – Copilot is wonderful, Actions are great, issue improvements. More please!

generic-husky · on Nov 6, 2022

Sounds like you might like: https://sourcehut.org (it's also completely free of crypto/web3 schemes, as an added bonus)

If GitHub (M$) continues this path of stomping on licenses/copyright of the people who made it great in the first place (i.e repo authors/contributors), I'm definitely gone.

Eleison23 · on Nov 6, 2022

Oh they're "YOUR KIDS" when they misbehave, huh??

sirsinsalot · on Nov 6, 2022

Anyone who has known Microsoft from back in the day knows attributing anything positive to them is likely an error

KyeRussell · on Nov 6, 2022

Yes, we get it. Microsoft bad. Very good. The reality is that quite a lot of “us” are still using GitHub, including the Microsoft-built/directed parts of it, willingly. A company can be hideously evil and still make a single useful thing in ~50 years.

shrimp_emoji · on Nov 6, 2022

I use it willingly, but I feel bad. I hate that Microsoft owns it. But it's essentially a social network (at least in the way that I use it), and network effect dictates that I use it (as opposed to, say, GitLab). :c

Oxidation · on Nov 6, 2022

Keeping ancient software working without rebuilding is quite impressive, to be fair.

saddist0 · on Nov 6, 2022

Which ancient software?

All of them have been rebuilt IMO.

jacoblambda · on Nov 6, 2022

Windows Executables. For applications not touching hardware, they should still run on any newer version of Windows provided the Windows build still supports that instruction set. i.e. the original Windows 1.0 Hello World demo (16-bit exe compiled in Win 1.0) still runs on Windows 10 x32 [1].

1. https://virtuallyfun.com/2020/05/22/examining-windows-1-0-he...

phendrenad2 · on Nov 6, 2022

Oh stop it. Microsoft has made/done plenty of positive things, you're just blind to them because you have somewhere between a mild dislike and a hatred for the company. Open your eyes and see.

Supermancho · on Nov 6, 2022

> you're just blind to them because you have somewhere between a mild dislike and a hatred for the company.

Sophisticated enough to understand the utility of Github, but too stupid to understand their own biases? The idea that someone is faulty in their understanding is an uncharitable take. The benefit of (MSFT's) history, is you can learn from it, or ignore it and claim that those who have are being irrationally ornery.

> Oh stop it.

Microsoft is one of many indifferent profit-seeking tyrants. I will never stop pointing out how positive characterizations are, at best, misinterpreting their intent. I'm much more likely to believe any feature is a stepping stone that MSFT plans to leverage later to extort its own users. This plan may or may not come to fruition, which is incidental.

scarface74 · on Nov 6, 2022

> Microsoft is one of many indifferent profit-seeking tyrants

As opposed to all of the other companies that aren’t seeking profits?

drran · on Nov 6, 2022

As opposed to all of the other non-tyrants companies, including profit-seeking and non-profit companies.

scarface74 · on Nov 6, 2022

Which companies are those that aren’t “tyrants”?

Supermancho · on Nov 6, 2022

> As opposed to all of the other companies that aren’t seeking profits?

One of many is right in the quote.

The obvious difference is how the power in an industry has been wielded. Railroad barons were the closest historical equivalent, imo.

scarface74 · on Nov 7, 2022

What power is Microsoft “wielding” as far as hosted git repositories? Literally anyone can set up a git server

Supermancho · on Nov 14, 2022

> What power is Microsoft “wielding”

Again, this is about historic context. They wielded power widely enough (monopolistic practices, predatory acquisitions, etc) that they earned the distrust. Every move, however magnanimous is carefully weighed as to how it can be monetize at a maxim (now, without drawing Govt scrutiny).

phendrenad2 · on Nov 6, 2022

Weird post to defend. Are you seriously criticizing my post when the post you're defending is worse?

Supermancho · on Nov 6, 2022

> Are you seriously criticizing my post

Yes. Personal attacks are unwarranted here.

> the post you're defending is worse?

I'm not sure what you mean by "worse" here.

> Anyone who has known Microsoft from back in the day knows attributing anything positive to them is likely an error

I agree with this fully.

phendrenad2 · on Nov 7, 2022

> Personal attacks are unwarranted here.

Neither is hyperbole.

Supermancho · on Nov 14, 2022

>> Personal attacks are unwarranted here.

> Neither is hyperbole.

So we agree? Personal attack and unwarranted are a correct characterization. That's nice to hear.

lokar · on Nov 6, 2022

Or anyone using azure at scale today

wilg · on Nov 6, 2022

Xbox was good! (But Windows remains very, very bad.)

wanderingmind · on Nov 6, 2022

What we need is a new license (like AGPL did for cloud) that prevents machines from ingesting and learning from the code. And all open source software who has trouble with copilot then relicense to this specific license avoid this. Involving government will be a terrible approach that will stifle innovation.

EDIT:1. Not a lawyer, but, unlike copyright that just blocks usage, I think license can be more creative and will help to preserve the FOSS spirit. For example, a license can ask for access to be provided to the source code, other data collected and model file that makes use of your code. Releasing the copilot model in open should be enough to make MSFT backoff.

PeterisP · on Nov 6, 2022

The applicability of various open source licenses relies on the fact that without accepting the license, the act of distributing software would be a violation of copyright law as that is an exclusive right of the copyright holder and requires their permission (i.e. license). Anyone can refuse the conditions of the GPL (just as any other contract), it's just that without accepting the GPL license they aren't allowed to redistribute their version of the software, and you can sue them for copyright infringement.

"machines learning from the code" is not like that - this is not an exclusive right awarded to the authors (quite the opposite, quoting copyright law, "In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.") and it does not require the permission of the copyright owner. If I have a legitimately obtained copy of some copyrighted work, no matter if it's a book, audio recording or code, and I don't have any contractual restrictions, I'm free to train a ML model on it. And if an open source license like you propose would create such contractual restrictions, I don't need to enter that contract, because I don't need a license for this.

hakre · on Nov 6, 2022

Next to copyright, often similar rights exist as well like protection for databases, like the file-system is one or the graph database in an SCM.

anticensor · on Nov 7, 2022

Database rights are not universal.

bugfix-66 · on Nov 6, 2022

The BSD 2-Clause License appears to attach conditions to both "redistribution" and "use". Quote:

  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

Is that a mistake in the BSD license? I don't think so. I think you are mistaken.

See the discussion here "Can a software license impose restrictions on the place where the software is to be used, so that a court would enforce those restrictions?"

https://law.stackexchange.com/questions/78519/can-a-software...

There are numerous examples there of licenses restricting use: Apple's licenses, the Unreal Engine, etc.

License Restrictions Sample Clauses:

  License Restrictions. Licensor reserves all rights not expressly granted to You. The Software is licensed for Your internal use only. Except as this Agreement expressly allows, You may not (1) copy (except for back-up purposes), modify, alter, create derivative works, reverse engineer, decompile, or disassemble the Software except and only to the extent expressly permitted by applicable law; (2) transfer, assign, pledge, rent, timeshare, host or lease the Software, or sublicense any of Your license grants or rights under this Agreement; in whole or in part, without prior written permission of Licensor; (3) remove any patent, trademark, copyright, trade secret or other proprietary notices or labels on the Software or its documentation; or (4) disclose the results of any performance, functional or other evaluation or benchmarking of the Software to any third party without the prior written permission of Licensor. Hosting Restrictions. In the event that You desire to have a third party manage, host (either remotely or virtually) or use the Software on Your behalf, You shall (1) first enter into a valid and binding agreement with such third party that contains terms and conditions to protect Licensor’s rights in the Software that are no less prohibitive and/or restrictive than those contained in this Agreement, including, without limitation, the Verification section below; (2) prohibit use by such third party except for the sole benefit of You; and (3) be solely responsible to Licensor for any and all breaches of the above terms and conditions by such third party.

If a license can prohibit decompilation or copying, we can obviously prohibit language model training.

And that's what we need to do, for the reasons stated here:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...

PeterisP · on Nov 6, 2022

It's not uncommon for licenses to try to assert overbroad conditions, just as a discouragement and also a way to ensure that even if they are not relevant in some jurisdictions, they stick elsewhere, and the license you quote is a good example of that. A license or a contract saying something does not make it true (especially so in civil law jurisdictions - I've seen contracts and terms&conditions where the majority of clauses are absolutely void because they contradict relevant law), and of course the validity of the contract also is relevant (e.g. while in USA shrink-wrap licenses may be considered valid contracts, in much of the world they are not binding).

It does not prohibit decompilation, although it tries to do that. All it says it that the licensor does not grant me the right to decompile or disassemble the Software. It does give a nod to "except and only to the extent expressly permitted by applicable law" which is the key part (and would be valid even if they did not say it), because the applicable law (at least for me) does grant me the right to decompile and disassemble the software for various purposes without the permission of the copyright owner, i.e. this license does not actually prohibit decompilation, no matter what it says.

The same applies for language model training.

PeterisP · on Nov 6, 2022

It's a broad clause which relies on the fact that some types of computer software "use" may require permission of the copyright owners, depending on jurisdiction - in essence, I'd say that the validity of this restriction depends on how the specific law treats the incidental copies of the software created as it is being installed, executed, etc; this (unlike most parts of copyright principles aligned in international conventions) isn't universal globally.

My position is mostly based not on code but on text, as in natural language processing there is a similar but much older situation of models being trained on copyright-protected work, the interests of researchers and publishers obviously differ, and at least currently (laws do change) the legal position is that publishers' requirements can be (and are) ignored, as models can be trained on these texts without their permission and even after their explicit objections / cease and desist requests. And a BSD or GPL or some other license can't do anything more restrictive than the book "license" of "all rights reserved, we don't grant you any permissions".

bugfix-66 · on Nov 6, 2022

I don't think so.

Please refer to the material I provided. Clear examples of use being restricted (decompilation, disassembly, copying, publication of performance measurements, etc.)

A license can restrict use, and we need to do exactly that (restrict use in training models and in inference) to address this new kind of threat to intellectual property rights.

seadan83 · on Nov 6, 2022

I suspect those materials were read carefully. The gist of the rebuttal is that an author cannot reserve rights they do not have and licenses "can" claim restrictions that they are not able to actually claim (depending upon local law)

bugfix-66 · on Nov 6, 2022

The No-AI 3-Clause License:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...

This is the only license to explicitly disallow language model training and inference using your code.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  ...

  3. Use in source or binary form for the construction or operation
     of predictive software generation systems is prohibited.

All other licenses seem to apply only when the code is emitted, and therefore can't really protect you. This attacks the problem at the source. If Microsoft trains Copilot on your code, they are unambiguously violating an explicit condition that controls use.

Otherwise, it's just the standard BSD 2-Clause License.

wanderingmind · on Nov 6, 2022

I think its a good start. Not a lawyer, But my guess is the definition around predictibe systems need to be defined by courts. If a system just collects and curates, but then is fully firewalled from the transforming or learning module, is that part of predictive systems? Its a fascinating start and I will be observing its adaption.

fargle · on Nov 6, 2022

unfortunately that's a terrible idea:

1. copyright licenses cannot be imagined to limit what you do with the legal copy of your software. it's not what copyright is or does.

2. it's also no longer an open source license, OSI, DFSG, GPL are all incompatible and would create practical problems for your would be users

this is a problem that doesn't need to be solved; your license likely already prohibits this - copying without attribution, re-licensing, or license-washing.

what remains is courts fine tuning what is "fair use" snippets or not, and it's not likely to come down as OK to use snippets like the fast sqrt as fair use. same whether it's a human abusing fair use, or a company, or a machine used by either of those. the AI aspect has no legal bearing and changes nothing - it just made the problem practical enough to actually happen.

open source people and projects that have had their licenses violated need to sue and defend their existing copyrights.

heavyset_go · on Nov 6, 2022

It seems that the argument that Microsoft will use is that their use of copyrighted code is actually fair use. If it's fair use, they're free to ignore your license that tells them they can't train Copilot on it.

Ultimately, copyright is a fiction that's created and protected by the government. A lawsuit is how we decide what is legally protected or not by copyright.

wanderingmind · on Nov 6, 2022

Yes that will require a few court ruling just like how court recognised GPL is a valid license to stop corporations from violating it. Ultimately, we don't want a blanket ban or approval for MSFT to use FOSS code. License can help make it possible to have nuance and a case by case basis.

avodonosov · on Nov 6, 2022

Not nesseserily a single new license, many existing licenses could be upgraded by directly addressing this type of use.

Maybe extending the definition of derivative work.

breck · on Nov 6, 2022

#LicensesAreForLosers

wanderingmind · on Nov 6, 2022

Maybe you should tell that quote to RMS /s

paxys · on Nov 6, 2022

Every analysis of this filing by legal professionals I have seen online concludes that Microsoft will use a textbook fair use defense and will succeed at it. There was a lot of precedent set in this area by Google v. Oracle.

xyzzy_plugh · on Nov 6, 2022

> There was a lot of precedent set in this area by Google v. Oracle.

I don't see the connection at all. The precedent had more to do with APIs, and possibly some influence around implementations which are trivial enough to disqualify them for copyright.

Sure, some copilot output is in that latter category but a whole lot of it is very obviously not.

heavyset_go · on Nov 6, 2022

I'm curious to what that precedent is, given that Copilot consumes, reproduces and distributes much more than simple APIs, like full implementations.

paxys · on Nov 6, 2022

To start, you have to treat copyrighted works on Github as a whole (say the entire project/repository or at least substantial parts of it) rather than look at the few lines that copilot copied in isolation. With that context:

1. The derived work is transformative. It uses the original as inspiration but the end result is substantially different.

2. Only a trivial amount of the work is copied (a few lines of code out of thousands/millions).

3. It is hard to prove that the creators of the original work were damaged by copilot (for example did they lose any customers or revenue?) This is normally the single biggest test in such cases.

xyzzy_plugh · on Nov 6, 2022

> you have to treat copyrighted works on Github as a whole (say the entire project/repository or at least substantial parts of it

But that's not how copyright works here, so this analysis seems irrelevant.

> 1. The derived work is transformative. It uses the original as inspiration but the end result is substantially different.

It's not. Changing variable names isn't a substantial difference.

> 2. Only a trivial amount of the work is copied (a few lines of code out of thousands/millions).

That doesn't matter. Relative size doesn't matter, absolute size does.

> 3. It is hard to prove that the creators of the original work were damaged by copilot (for example did they lose any customers or revenue?) This is normally the single biggest test in such cases.

??? No? You appear to have no clue how copyright works or what copyright infringement does. It's damaging solely in that you don't have the right to reproduce it without my permission. It's mine.

paxys · on Nov 6, 2022

Well you can have your own interpretation of copyright law, but the Supreme Court – who ruled in favor of Google exactly on the basis of these three tests I shared above – will disagree with you.

xyzzy_plugh · on Nov 6, 2022

Ruling on those basis in a case doesn't make that the only test for whether something infringes copyright or not. You have to qualify for all of those things.

Also, there were four tests.

The general principle of the ruling is that:

- it was mostly about APIs; organization rather than implementation.

- it was sufficiently transformative

- it was a small amount of code of insubstantial value

- it was serving a different market

So, I suppose you have indeed proven that we can have different interpretations of copyright law.

Taking a small but substantial piece of implementation code and using it in a similar way for a similar purpose to solve a similar problem would appear to fail all of those tests, and at least to me smells like rancid infringement.

drran · on Nov 6, 2022

Well you can have your own interpretation of copyright law, but the Supreme Court – who ruled in favor of Google exactly on the basis of these three tests I shared above – will disagree with you.

vbezhenar · on Nov 6, 2022

I wouldn't be surprised to find out that MS themselves secretly supported this case to win it and to show potential customers that this tool is safe.

kemitchell · on Nov 6, 2022

I am a legal professional. Convos with colleagues have been all over the place.

stefan_ · on Nov 6, 2022

Legal professionals conclude Microsofts first move should be to admit copyright violation? What is their hourly rate?

greenyoda · on Nov 5, 2022

Big discussion a couple of days ago: https://news.ycombinator.com/item?id=33457063

Sunspark · on Nov 5, 2022

For awhile now I have not wanted to use GitHub because of this tool. Why should anyone write code for Microsoft for free when Microsoft charges for software?

Perhaps options like GitLab and others should be considered instead.

cercatrova · on Nov 5, 2022

I like Copilot and use it, so I'll be glad to continue using it and it can use my code to help train it. It's a fair exchange in my view.

mikro2nd · on Nov 6, 2022

That's cool for you. For the rest of us, however, it would be terribly simple for Github to provide a profile/account setting that allows people to opt out -- perhaps on a per-repo level as well as per-account. Yet they don't. I wonder why not.

hsbauauvhabzb · on Nov 5, 2022

Until someone forks/mirrors your project on GitHub

jszymborski · on Nov 5, 2022

I've certainly been enjoying Codeberg for my personal development.

scarface74 · on Nov 6, 2022

> Why should anyone write code for Microsoft for free when Microsoft charges for software

Do you feel the same way about contribute to the Linux kernel that is used by commercial software companies?

Gigachad · on Nov 6, 2022

If you don’t want people to use your code, consider making your repos private.

xigoi · on Nov 6, 2022

What if you want people to use your code, but not for proprietary software?

Gigachad · on Nov 6, 2022

Impossible. As soon as you show it to anyone there is a chance they will use something they learned in their proprietary software, likely unknowingly.

ranguna · on Nov 6, 2022

And if that happens, I have the right to sue.

mikro2nd · on Nov 6, 2022

scoot · on Nov 5, 2022

There are thousands of open source libraries used legally in thousands of commercial products, so I'm not sure I understand your point.

hnaccy · on Nov 6, 2022

Why are programmers so suicidal?

A giant corporation openly steals code from millions of devs and uses to try and automate us and programmers cheer it on.

A billionaire with known poor labor practices buys Twitter and twitter devs make no attempt to organize as lab instead they just write toothless statements.

bakugo · on Nov 6, 2022

> A giant corporation openly steals code from millions of devs and uses to try and automate us and programmers cheer it on.

Actual programmers don't cheer it on. Only modern """programmers""" aka professional CTRL+V pressers, the kinds of people that usually "write" software in languages like javascript by importing a few hundred open source libraries and jamming them together until it vaguely does what they want. Copilot is good for these people because using others' code is all they do anyway, AI just helps them do it more efficiently, and without having to worry about all those pesky licenses.

sorwin · on Nov 6, 2022

Excuse me? I know tons of very skilled programmers who use it, simply because of the time it saves.

I use it regularly in C#, JS, CSS, and hell, C++ ocassionately.

It's not about using other people's code, it's about it saving time by generating pretty much the code I was already going to write (with pretty good accuracy too). Once you have enough knowledge, I see no difference in the code Copilot generates (at least, not any better quality/perf) than I would have written.

So yes, I cheer it on, and I love it. It has made my development life easier. 17+ years of coding, and this has been a big impact for me.

dev_tty01 · on Nov 6, 2022

How about a version of copilot that only learns from your personal career codebase? Or perhaps only works based on a company's internal codebase? That would remove any legal ambiguity.

If these templates are as common as you suggest, it seems learning from your personal codebase should get the job done.

phendrenad2 · on Nov 6, 2022

I don't see many people cheering it on here, in fact most of the news about copilot has been negative. Where do you see positivity?

hnaccy · on Nov 6, 2022

bulk of the top replies on post this week seem to be pro-copilot:

https://news.ycombinator.com/item?id=33457488

https://news.ycombinator.com/item?id=33457933

https://news.ycombinator.com/item?id=33459343

https://news.ycombinator.com/item?id=33458943

https://news.ycombinator.com/item?id=33457737

https://news.ycombinator.com/item?id=33486605