GitHub is sued, and we may learn something about Creative Commons licensing

jacquesm · on Jan 6, 2023

Excellent. GitHub is in my opinion crossing a whole pile of lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built. Copyright is not something that can be signed over by a terms-of-use change of a hosting provider, the expectation is that your host does not automatically claim the rights to anything that you store there.

Such projects should always be opt-in, not just because it is the law but also because it is common sense and the right thing to do from an ethical perspective.

jasode · on Jan 6, 2023

>lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built.

Fyi... Google Books (scanned and OCR'd books) eventually won against the authors filing lawsuits of copyright infringement. So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

https://www.google.com/search?q=google+books+%22is+transform...

But courts in Europe may judge things differently.

jacquesm · on Jan 6, 2023

A number of points in Google's favor: they are not passing off Google books content as their own, they limit your access to a small fraction of the offering.

The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.

chimeracoder · on Jan 6, 2023

> The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.

They did appeal it. SCOTUS declined to hear the case.

https://www.nytimes.com/2016/04/19/technology/google-books-c...

jacquesm · on Jan 6, 2023

Yes, sorry I could have worded that more precisely.

judge2020 · on Jan 6, 2023

Google also used all of this to improve their OCR algorithms, almost certainly used in Google Cloud Vision[0], but I doubt this was a consideration when deciding if it was transformative/fair use.

0: https://cloud.google.com/vision

tremon · on Jan 6, 2023

Yet they did not build and market a service to authors that would write novels for them based on their OCR-ed catalog.

Tozen · on Jan 6, 2023

> Yet they did not build and market a service to authors that would write novels for them based on their OCR-ed catalog.

I find this to be a very appropriate analogy. If Google had done such a thing, they would be facing the same kinds of lawsuits that Microsoft is facing now. And despite Microsoft's money, I don't see how they can wiggle their way out of this one. They basically ignored the license terms and attribution requirements of the authors. Something Microsoft would never stand for, if "the shoe was on the other foot".

Jensson · on Jan 6, 2023

Maybe they will? They still have that data and the kind of people to make such a service.

jacquesm · on Jan 6, 2023

I guess we'll discuss that if and when they do.

emodendroket · on Jan 6, 2023

That's not really what Copilot proposes to do either.

altairprime · on Jan 6, 2023

Indeed; that would be an excellent topic for litigation, and they would fight it with every lawyer they have b/c it could invalidate their efforts to zero out human labor costs in all possible areas.

bbarnett · on Jan 6, 2023

As well, Google is, because of these things, somewhat acting as a library.

An libraries are very special entities.

kmeisthax · on Jan 6, 2023

Not really. Google won because Google Books was not actually a new concept; someone else had already built a book search engine the same way Google did, also got sued by the Authors Guild, and also prevailed. The only thing different about Google Books was that it'd give you two pages worth of excerpt out of the book. So it was very easy for a court to extend the fair use logic that they had already weaved into the law.

I still think "training is fair use" still has a leg to stand on, though. But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing). Fair use is not transitive; if I make 100 Google Books searches to get all the pages out of a book, I don't suddenly own the book. There is no "copyright laundry" here.

gnramires · on Jan 6, 2023

> I still think "training is fair use" still has a leg to stand on, though

If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think meanwhile the most reasonable solution is that an AI should always produce content compatible with the training material licenses. So if you want to use GPL training sets, you can only use that to create GPL-compatible code. If you use public domain (or e.g. 0BSD?) training sets, you can produce any code I guess.

kmeisthax · on Jan 6, 2023

> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".

gnramires · on Jan 6, 2023

It would be an essential feature, imo, to have this 'near-verbatim check' for copyleft code.

Overall it feels like it's a bit too much of specialized learning on GPL/Copyleft code to be fair. It's not like a human that reads some source code and gets an idea how it works. It's really learning code from scratch on Copyleft code, without which it would likely perform much worse and not generate a number of examples. It's not just copy-paste, but it's closer on the spectrum to copy paste than just super-abstract inspiration to feel fair.

As others have said, I don't think it would be fine (specially from big companies pov.) to decompile proprietary code (or just grab publicly available but illegal to reproduce code) and have AIs learn from that in a way that seems different in scope and ability to human research and reverse engineering.

I think we need a good tradeoff that isn't ludditism (that would reject a benefit for us all, i.e. that is good for everyone), but that still promotes and maintains open source software. In this case it's really a public "good" that's being seized and commercialized, that doesn't seem quite right: make copilot public, or use only permitted code (or share your revenue with developers -- although that would seem more complicated and up to each copyright holder to re-license for this usage). I remember not long ago MS declaring Open Source was a kind of "Cancer", now they're relying on it to sell their programming AIs. I personally think Open Source is quite the opposite of cancer, it is usually an unmitigated social good.

Much of the same could be said for the case of artists an generative AI art.

And this isn't even starting on how we move forward as a society that has highly automated most jobs and needs to distribute the resources and wealth in a good way to enable greatest wellbeing for all beings.

bbarnett · on Jan 6, 2023

You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think, for the desired outcome to occur, you should instead ask:

You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

Ask the above, and suddenly Microsoft will agree.

andybak · on Jan 6, 2023

Depends if you think the GPL means "copyright is great!" vs "let's use their biggest weapon against them..."

It's a surprisingly subtle distinction.

EDIT - if I squint hard enough in exactly the right way, there's a sense in which CoPilot etc aligns perfectly with the goals of the free software movement. A world in which you can use it as a code copyright laundry might be a world where code is actually free.

Is that any weirder than bizarre legal contortions such as the Google/Oracle "9 lines of code"? Or the whole dance around reverse engineering: "It's OK if you never saw the actual code but you're allowed to read comprehensive notes from someone who did"..?

There's a ton of examples like this. Tell me with a straight face that there's a clear moral line in either copyright or patent law as it relates to software.

IP is a mess and it's not clear who benefits. Is a world where code isn't subject to copyright so bad?

kmeisthax · on Jan 7, 2023

If Copilot was released as FOSS with trained model weights, I don't think the Free Software movement would have "shot first" in the resulting copyright fight.

It is specifically the idea of using copyright to eat itself that is harmed by AI training. In the world where we currently live in, only source code can be trained on. If I want to train an AI on, say, the NT kernel; I have to decompile it first, and even then it's not going to be good training data because there's no comments or variable names to guide the AI. The whole point of the GPL was to force other companies to not lock down programs and withhold source code, after all.

Keep in mind too that AI is basically proprietary software's final form. Not even the creator of an AI program has anything that resembles "source code"; and a good chunk of AI safety research boils down to "here's a program you can't comprehend except through gradient descent, how do we design it to have an incentive to not do bad things".

If you like copyright licensing and just view the GPL as an exception sales vehicle, then AI is less of a threat, because it's just another thing to sell licenses for.

skissane · on Jan 6, 2023

> You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

> Ask the above, and suddenly Microsoft will agree.

Does Microsoft actually agree? Many people have posted leaked/stolen Microsoft code (such as Windows, MS-DOS 6) to GitHub. Microsoft doesn't seem to make a very serious effort to stop it – sometimes they DMCA repos hosting it, but others have stayed up for ages. They could easily build some system to automatically detect and takedown leaks of their own code, but they haven't. Given this reality, if they trained GitHub Copilot on all public GitHub repos, it seems likely that its training included leaked Microsoft source code. If true, that means Microsoft doesn't actually have a problem with people using the outputs of an AI trained on their own closed source code.

emodendroket · on Jan 6, 2023

> If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

Is that new? If I include some excerpt from copyrighted material in my own work and it's deemed to be fair use, that doesn't limit my right to profit from the work, sell the copyright to someone else, and so on, does it?

nradov · on Jan 6, 2023

If open source code authors (and other content creators) don't want their IP to be used in AI training data sets then they can simply change the license terms to prohibit that use. And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place. Of course Microsoft is going to look for ways to monetize that data.

b3morales · on Jan 6, 2023

> they can simply change the license terms to prohibit that use

GitHub's argument* is not that they're following the license but that the license does not apply to their use. So they would continue to ignore any provision that says they can't use the material for training.

Previously discussed: https://news.ycombinator.com/item?id=27740001

Moving off GitHub is a better step at a practical level. But again they claim the license doesn't matter, so even if it's hosted publicly elsewhere they would (presumably) maintain that they can still scoop it up. It just becomes more work, for them, to do so.

*Which is completely wrong in my opinion, for the record

bbarnett · on Jan 6, 2023

And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place

No. It's called copyright, it is enforceable, and that's the control.

GPL source code is available everywhere, in all formats, in textbooks, on CDs, on websites, but it is still gpl.

And Microsoft doesn't get to scrub the license.

naasking · on Jan 6, 2023

> But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing).

But if you read the source code of 100 different projects to learn how they worked and then someone hired you to write a program that uses this knowledge, that should be legit. I'm not sure if the law currently makes a distinction between learning vs. remixing, and if Copilot would qualify as learning.

marginalia_nu · on Jan 6, 2023

That's not necessarily true at all. There's even techniques designed to demonstrably avoid such knowledge-contamination.

https://en.m.wikipedia.org/wiki/Clean_room_design

corysama · on Jan 6, 2023

That kind of legal ass-covering is expedient when you are going to explicitly reproduce someone else’s source-available work. It’s cheaper in that case to go through the whole clean room hassle than to risk getting into an intractable argument in court about how your code that does exactly the same thing as someone else’s code came to resemble the other people’s code so much.

But, for the general case, the argument still stands. I have looked at GPL code before. I might have even learned something from it. Is my brain infected? Am I required by law to license everything I ever make as GPL for the remainder of my days?

naasking · on Jan 6, 2023

Yes, it will sometimes depend on the unique qualities of the code. For instance, if you learned a new sorting algorithm from a C repo, and then wrote a comparable imperative solution in OCaml, that might be a derivative work. But if you wrote a purely functional equivalent of that algorithm, I don't think that could be considered a derivative work.

jacquesm · on Jan 6, 2023

Precisely, that is the key point.

Cthulhu_ · on Jan 6, 2023

And Google kept the copyright notices and attributions; probably not super relevant, but it's a difference between the two cases.

I mean in essence Github is a library; they did have a license to a point to do with the code as they pleased, but they then started to create a derivative work in the form of an AI, without correctly crediting the source materials.

I mean I think they made a gamble on it; as far as I'm aware, AI training sets were yet unchallenged in a court of law, so legally not fully defined yet. These lawsuits - and the ones (if any) aimed at the image generators, using CC artwork from e.g. artstation - will lay the legal groundwork for future AI / ML development.

ghaff · on Jan 6, 2023

Libraries are really not very special. They mostly exist on the basis of first-sale doctrine and have to subscribe to electronic services like everyone else.

Entities like the Internet Archive skate by (at least before their book lending stunt during COVID) by being non-profit and bending over backwards to respect even retrospective robots.txt instructions, meaning that it's not really worth suing them given they'll mostly do what you ask anyway.

But I guarantee you that if I set up a best comic strips of all time library I'll probably be in court.

toomuchtodo · on Jan 6, 2023

What if a library builds AI models from the content they host? The Internet Archive, for example.

nequo · on Jan 6, 2023

There are two important differences.

Google Books retains the bibliographical information so you can properly cite the authors or contact them for permission to use their material.

And Google Books does not automatically write new books for you that you can then send off to Penguin Books or self-publish on Amazon.

naasking · on Jan 6, 2023

Copilot also isn't retaining the actual content of the source code repositories and then deriving works from that. If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work. I'm not sure that the fact that a program can then take that associative data and generate new code doesn't suddenly make it not ok.

TeMPOraL · on Jan 6, 2023

> If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work.

That sounds to me somewhat close to "if I take an FFT of each of those copyrighted images, glue them together, and sell this as a picture, is that a derivative work?" - I'd say yes, or perhaps even a different encoding of the original work, since you can reverse the frequency domain representation and get the original spatial representation - the original images - back.

naasking · on Jan 6, 2023

Sure, ROT13 encoding is a derivative work because the entire original work is still there, encoded. Ditto for FFT. Large language models are not that.

Sometimes parts of the original works are still encoded, which we've seen when some code is reproduced verbatim, and I'm sure that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.

Once they iron out those wrinkles, it's not clear to me that a large language model is a directly reversible function of the original works. At least, not any more than a human learning from reading a bunch of code and then going on to have a career selling his skills at writing code.

Edit: by which I mean, LLMs are lossy encodings, not lossless encodings.

TeMPOraL · on Jan 7, 2023

> Ditto for FFT. Large language models are not that.

They're not, but the "giant table of token frequencies and associative keywords" reminded me of doing FFT on images, and I wanted to communicate the idea that transformations like this can actually retain the original information, and reproduce it back through inverse transform.

> by which I mean, LLMs are lossy encodings, not lossless encodings

Exactly. And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. As you move away from lossless, it's not obvious when, or if at all, the result is clear from copyright of original inputs' author. Compare e.g. with JPEG, which employs a less sophisticated lossy encoding - no matter how hard you compress a source image, the result would still likely retain the copyright of the source image author, as provenance matters.

(IANAL, though.)

naasking · on Jan 7, 2023

> And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. [...] Compare e.g. with JPEG

I'll just finally note that LLMs are not lossy encodings in the same sense as JPEG. LLMs are closer to human-like learning, where learning from data enables us to create entirely new expressions of the same concepts contained in that data, rather than acting as pure functions of the source data. That's why this will be interesting to see play out in the courts.

TeMPOraL · on Jan 7, 2023

My belief is there is no fundamental difference here. That is, learning is a form of compression. Learning concepts is just a more complex form of achieving that much greater (if lossy) compression levels. If the courts will see it the same way too, things will get truly interesting.

naasking · on Jan 7, 2023

Yes learning concepts is a form of compression, but I'm not sure that implies there's no "fundamental" difference. I see it as akin to a programming language having only first-order functions vs. having higher-order functions. Higher-order functions give you more expressive power but not any more computational power.

You could say a higher order program can "just" be transformed into a first-order program via defunctionalization, but I think the expressive difference is in and of itself meaningful. I hope the courts can tease that out in the end, and we'll see if LLMs cross that line, or if we need something even more general to qualify.

TeMPOraL · on Jan 8, 2023

> I see it as akin to a programming language having only first-order functions vs. having higher-order functions.

Interesting analogy, and I think there are a couple different "levels" of looking at it. E.g. fundamentally, they're the same thing under Turing equivalence, and in practice one can be transformed into the other - but then, I agree there is a meaningful difference for humans having to read or think in those languages. Additionally, if those are typical programming languages, you can't really have the code in the "weaker" language self-upgrade to the point the upgraded language has the same expressive power as the "stronger" one. If the "weaker" one is Lisp though, you can lift it like this.

In this sense I see traditional compression algorithms - like the ones we use for archiving, images and sound - to be like those typical weaker languages. There's a fixed set of features they exploit in their compression. But human learning vs. neural network models (or sophisticated enough non-DNN ML) is to me like Lisp vs. that stronger programming language, or even Lisp vs. a better Lisp - both can arbitrarily raise their conceptual levels as needed. But it's still fundamentally compression / programming Turing machines.

teo_zero · on Jan 7, 2023

> that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.

And if such algorithm is copyrighted, that would be infringing! It doesn't matter if you copy on purpose or by chance.

naasking · on Jan 7, 2023

You can't copyright algorithms, you can only copyright specific expressions of code.

Dylan16807 · on Jan 7, 2023

What do you mean by "glue them together"?

If you overlap a hundred different FFTs, then the result is likely fine copyright-wise.

These networks are not [supposed to] contain much of the original data. Like the trivia point that Stable Diffusion has less than two bytes per source image, on average.

TeMPOraL · on Jan 7, 2023

> What do you mean by "glue them together"?

Stitch them side by side. Yes, this is not how those DNNs work, but the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it.

> Stable Diffusion has less than two bytes per source image, on average.

I'm not convinced by this trivia point, though. Stable Diffusion is, effectively, a lossy compression of the training data. Nothing says lossy compression algorithms can't exploit some higher-level conceptual structures in the inputs[0], and applying lossy compression to some work doesn't automatically erase the copyrights of the original input's author.

--

[0] - SD isn't compressing arbitrary byte sequences, it's compressing images - which is a small subset of all possible byte sequences as large as the largest image used in training. "Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.

Dylan16807 · on Jan 7, 2023

> the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it

That depends entirely on how many frequencies you're keeping.

> high-level patterns in such data

High level patterns across thousands of images are generally not copyrightable.

I might even describe the purpose of stable diffusion as extracting just the patterns and zero specifics.

cbigart · on Jan 7, 2023

>"Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.

Two bytes would only let you uniquely identify ~65k images though, which to me doesn't sound plausible for a lossy compressor.

robertlagrant · on Jan 6, 2023

> since you can reverse the frequency domain representation and get the original spatial representation - the original images - back

I'd have thought that's exactly what you can't do with CoPilot.

nequo · on Jan 6, 2023

Yeah, I guess the court will be the real test of what is allowed here.

bhuga · on Jan 6, 2023

> So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

Curiously, from the article, copyright infringement is not alleged:

> As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement.

Perhaps the plaintiffs are trying to avoid exactly this prior law?

cyanydeez · on Jan 7, 2023

GitHub is actively driving a product as opposed to duplicating and those products may go on to generate income

mpoteat · on Jan 6, 2023

If this lawsuit succeeds, I have a startup idea that I think would be effective.

Create a for-profit copyright registry for code snippets that are long enough to qualify for copyright protection. You can be the canonical owner of the copyright for a given piece of code! For a premium fee, we can generate and submit a patent on your behalf as well.

Once I have a large corpus (perhaps millions of entries of code, most one or two lines long), I can automatically scan new respositories and send cease and desist letters for violating my client's copyright. Even if a piece of code is very common, that doesn't mean its unoriginal, it just means that there are many people violating its copyright after all. According to the logic of the folks in this thread at least.

layer8 · on Jan 6, 2023

Do note that there’s the concept of https://en.wikipedia.org/wiki/Threshold_of_originality, which may be substantial for mere code snippets.

This may be one of the reasons why the lawsuit isn’t based on copyright.

mpoteat · on Jan 6, 2023

This is the code of the future to ensure your code remains original: https://twitter.com/TylerGlaiel/status/1611115741809627139

PeterisP · on Jan 7, 2023

Copyright grants an author of a copyrightable work the exclusive right to make more copies of it. However, if people independently come up with the same exact thing, copying has not occurred and that exclusive right was not violated (and then the court battle effectively becomes one about proving whether copying did in fact occur).

In copyright law there is no such concept as "code snippets that are long enough to qualify for copyright protection" or "canonical owners". Quite explicitly, copyright does not give a monopoly over an idea, but merely protects against the unlawful reproduction of an original work.

If you take some snippet from a work in which you own copyright and find that in the world multiple people have somehow managed to write the exact snippet, but they did it independently without copying it from you, then copyright law effectively states the following things:

1) They definitely aren't violating your copyright, and you have no claim on them whatsoever - independent creation is a complete defense to copyright infringement;

2) Perhaps this snippet might be judged uncopyrightable, as the existence of multiple independent recreations is some evidence that it lacks originality and thus would not qualify for copyright protection at all.

chrismcb · on Jan 8, 2023

Length does come into play. Copyright is about creativity. If there is no creativity there is no copyright. A code snippet that is too short probably doesn't have enough creativity for it to be copyrightable. If you wrote some code and then registered it so it has a date. It will be difficult for someone who wrote the exact same code, later, to prove it wasn't copied.

mpoteat · on Jan 7, 2023

You don't need to prove anything in court to send someone a cease and desist. It's often cheaper to settle.

There does exist the concept of 'originality' in copyright law, which I was erroneously conflating with length.

gmiller123456 · on Jan 6, 2023

If the code snippets are so "obvious" that many people solve the problem axactly the same way, your going to have a lot of trouble asserting a copyright or patent over it.

But your idea is pretty much what almost all manufacturers do, and have been doing for decades.

int_19h · on Jan 7, 2023

Automated code scanning working on similar principles is already in use in many large tech companies, but in reverse. Basically, to prevent shipping improperly licensed code, missing attribution notices etc.

moralestapia · on Jan 6, 2023

Oh, the internet!

Just yesterday, I bought a nice domain name for an idea that's very close to what you mention, monetize on snippets of code.

If you want to team up, hit me up!

hinkley · on Jan 7, 2023

That sounds like leftpad but with more steps.

amelius · on Jan 6, 2023

I just hope it doesn't end in Microsoft paying some (from their perspective) small fine that is just the cost of doing business.

jacquesm · on Jan 6, 2023

The range of possible outcomes is enormous, I'll just wait by the sidelines but cherish the thought that moving out of GitHub when Microsoft bought it was the right decision. They can't be trusted, this has been proven over and over again and yet people keep falling for it. It's the fox guarding the chickens. I wrote about my misgivings at the time:

https://jacquesmattheij.com/what-is-wrong-with-microsoft-buy...

Beltalowda · on Jan 6, 2023

> The range of possible outcomes is enormous

The most likely of which – if this lawsuit ends up winning – is that corporations will have new ways to sue everyone and that the world will be a worse place.

Copyright expansion has never benefited the "little guy" such as Open Source authors, only large entities with deep pockets who can litigate to no end.

jacquesm · on Jan 6, 2023

It's not so much copyright expansion as it is copyright re-affirmation. In this case it is especially open source authors whose rights are in play. Keep in mind that all of open source relies on copyright, without that everything is PD from Microsofts point of view, if they hosted it. Think of Copilot as a trial balloon, if they get away with it they will likely use that as the stepping stone to the next level and bit by bit your rights are salamied out of existence. This is the first slice, it should stop right here.

Beltalowda · on Jan 7, 2023

I don't see any rights being taken away from me. CoPilot doesn't copy my code, it just learns from it, just as you can. "Others can learn from this" is one reason I release stuff as open source in the first place.

People will quote that John Carmack Doom example where it copies the function verbatim, but as far as I can tell that's a rare thing, and it's a function that's been widely copied around without proper licensing; a human could also get it wrong by copying it from github.com/random-person/mit-project with the wrong license (and since then there's also been work to prevent this kind of thing).

Co-pilot isn't unique, or the first AI/ML project to use copyrighted works; all the GPT models use copyrighted works as their input. Some doubts have been raised over the legality of that too, but it's received nowhere near the amount of criticism that Co-Pilot has, certainly not on HN, and I've never seen anyone doubt the morality of it – only the legality.

If you were to go through my public open source code I'm sure you can find stuff that's very similar to some code from my previous employers or other open source projects. Not because I copy/pasted anything, but because my brain was trained on that dataset: you see or write something that works, you face a similar problem a few years later, you write a similar solution.

"Using existing works as input" is common throughout creative works. As Phil Anselmo once said: "with Pantera we took our five favourite bands and ripped 'em off to hell".

People are already getting sued because "that one melody sounds a bit similar to this other melody"; fair use is already widely ignored/disrespected. Much will depend on the exact details, but any win in this lawsuit has a very real chance of empowering that sort of nonsense.

ralph84 · on Jan 6, 2023

Agreed. It's quite bemusing how so many people are now copyright maximalists because they somehow think Microsoft will be hurt in such a world.

eternalban · on Jan 6, 2023

One way to address this is to somehow turn "little guy" into a "giant". For labor, this was accomplished via unions.

A union of creative minds seems long overdue. It can provide a copyright trust, addressing your concerns, as well as removing the excuse of inability to get permission from n thousand creators used. It can also address matter beyond OSS, such as overreaching employment agreement clauses that assert ownership of everything in your head.

[Possibly 'trust' is more suitable than 'union'. Something like Creative Commons Trust.]

Aerroon · on Jan 6, 2023

Then you consider that there are many countries in the world and the whole system breaks down. Sorry, we don't allow this copyrighted code to your country!

codetrotter · on Jan 6, 2023

> moving out of GitHub when Microsoft bought it was the right decision

What do you use instead?

The top alternatives in my opinion are:

- SourceHut https://sr.ht/

- Codeberg https://codeberg.org/

- Self-hosted using Forgejo https://forgejo.org/ (fork of Gitea)

I was self-hosting my code with Gitea for a while but currently I’m using GitHub. Planning on setting up a Forgejo instance in the coming weeks.

At work we use GitLab, but personally it is one of my least favourite platforms, so I am excluding GitLab from the list above.

tpxl · on Jan 6, 2023

> - Self-hosted using Forgejo https://forgejo.org/ (fork of Gitlab)

Forgejo is a fork of Gitea, not Gitlab.

codetrotter · on Jan 6, 2023

Sorry, that was an autocorrect mistake. I meant to write fork of Gitea. Edited it now.

nucleardog · on Jan 6, 2023

Depending how big of an install you’re running and exactly what featureset you need if you’re planning on self hosting I’ve been happily using Gitea for years.

fhd2 · on Jan 6, 2023

I agree, I've yet to see any sufficiently large organisation successfully change their DNA. Gates hired people who thought like him, those people hired people who thought like them, and so on. The culture can and does change, but it takes an enormous amount of energy and time to change the course of such a large ship, if it's even possible at all. And I don't really see them trying - I've never even seen them apologise for the stunts they pulled with Netscape and all that. They put on a new coat of paint, hoping people forget. But I suppose they can't help themselves.

teo_zero · on Jan 7, 2023

> moving out of GitHub when Microsoft bought it was the right decision

But what prevents Microsoft from harvesting open-source code from any hosting site? What have you gained?

qayxc · on Jan 6, 2023

[flagged]

NicoleJO · on Jan 6, 2023

This is so irrelevant it's annoying. None of those other companies sell the data after stripping the copyrights. Stop comparing them.

jacquesm · on Jan 6, 2023

Why make the same comment twice in one thread?

bluGill · on Jan 6, 2023

While the fine is a cost of doing business, if they don't change behavior they can be sued again, and courts tend to impose very large fines if they discover you were already fined for this and didn't change afterwards.

NicoleJO · on Jan 6, 2023

The "fine" is 9 billion dollars.

jwestbury · on Jan 6, 2023

Against a company which makes 6-7x that in yearly profits, that's still not an effective deterrent.

tenebrisalietum · on Jan 6, 2023

1/6th or 1/7th of profits is significant. 9 million dollars could be ignored by a company making 54,000 or 63,000 million dollars, but 9,000 million is a lot.

drstewart · on Jan 6, 2023

Exactly. That's why you're arguing for fines against people to be significantly more than 1/6th of their net pay, since harsher punishments are effective deterrents?

Parking tickets should start at 50% of your yearly take home income. Didn't feed the meter an extra quarter? $10k minimum sounds fair.

pessimizer · on Jan 6, 2023

> fines against people to be significantly more than 1/6th of their net pay

Pay is not profits, it's revenue.

drstewart · on Jan 6, 2023

Look up the word net

counttheforks · on Jan 6, 2023

Look up the word gross

yladiz · on Jan 6, 2023

Companies aren’t people (even if they are legally defined as such) so you can’t treat the way you fine them equally.

drstewart · on Jan 6, 2023

Stating something isn't proving it (even if you learned it as such), so you can't just state something as fact without proving it.

yladiz · on Jan 7, 2023

What?

qayxc · on Jan 6, 2023

But why stop there? What's the difference between Microsoft, Google, Meta, and OpenAI in this regard?

All of those build their models based on the same sources and it's therefore a much more general issue than just one particular company being sued.

jacquesm · on Jan 6, 2023

That's a good point, but the subject of the thread is Microsoft. I'm pretty sure that Google will happily train their models on the contents of your Gmail account, I wouldn't trust Facebook with my birthdate and OpenAI is likely doing the exact same thing.

But that doesn't make it right in this case and, conveniently, someone has decided to bring suit. The funny thing is that Microsoft depends on Copyright law for their existence and now they want to change the rules to favor them when it suits them. In fact one of the first things that Bill Gates ever did that I remember is bitch about people copying the software that he wrote.

marcosdumay · on Jan 6, 2023

Google and Meta aren't redistributing your code claiming that it's theirs and that you can re-license freely.

pjmlp · on Jan 6, 2023

Only because they arrived later at the party.

jacquesm · on Jan 6, 2023

Hm, I don't know about that. I think they do plenty of stuff with other data that crosses that same line but they have not crossed it with (explicitly copyrighted) code, and until they do I don't think we should argue that if they had been there before that they would have done it, they had ample opportunity.

tpush · on Jan 6, 2023

I mean, yeah.

Cthulhu_ · on Jan 6, 2023

If the fine is higher than what they can feasibly gain from the product in a reasonable time, it'd make them drop the product entirely even if they are able to adjust it to conform to the laws.

judge2020 · on Jan 6, 2023

> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider,

I mean, it's obvious that uploading code requires you license the hosting provider a license to host it (which is not singing over copyright); although feel free to argue that the license doesn't or shouldn't extend to CoPilot usage.

belorn · on Jan 6, 2023

With creative common and GPL there is a fairly common practice that work include multiple authors and rights holders. When a single user uploads such work to a hosting provider, the permission given to the provider will be limited to the permission that the user had. They can't give out permissions that they themselves do not have.

It is a similar case when a single user uploads a movie or game to a pirate torrent site. The site can have a terms-of-use that gives a license to the hosting provider, but naturally the users who upload the content might not have the permission to grant anything to the hosting provider. Depending on how much the hosting provider is or should be aware, hosting the content can still be illegal.

judge2020 · on Jan 6, 2023

> They can't give out permissions that they themselves do not have.

Then, chances are, it's technically illegal to upload those other contributors' code, although if that code is contributed via GitHub itself then the code in the pull request has already been licensed to GH.

It boils down to copyright/DMCA not requiring that hosting providers ensure the code people say they have the rights to is valid at submission, so GitHub now has tons of examples where people themselves lied about the permission when they uploaded code that wasn't theirs, and this will probably be a valid legal defense, at least only for the argument of "does GitHub have the right to use the source in their ML model" (it might really boil down to "are GH's terms vague enough to where nobody thought they the license included the ability to train artificial intelligence").

tzs · on Jan 6, 2023

"The person who uploaded the code lied about their permissions" won't be a valid defense in a copyright lawsuit by the actual copyright owner, at least in the case where there is no other copy of that code also on GitHub that was uploaded by the copyright holder.

In the US what it will be is good evidence to support a claim by GitHub that they were an "innocent infringer"--someone who did not know they were infringing and had no reason to believe that they were.

What that does is in the case where the plaintiff seeks statutory damages (which they almost certainly will¹) is lower the lower limit. Statutory damages are normally $750-30000 (amount determined by the court). If a defendant proves they are an innocent infringer that lower limit drops to $200. If the plaintiff can prove that the infringement was "willful" the upper limit goes up to $150000.

Statutory damages are per work infringed, not per infringement, so we aren't talking $200 or so multiplied by the number of copies GitHub distributed. We are talking of a likely award of $200 or so total (plus maybe attorney fees).

¹It is usually way too hard to determine actual monetary damages in cases like this, and actual damages are likely to be quite low anyway, so plaintiffs almost certainly will go for statutory damages.

belorn · on Jan 6, 2023

"someone who did not know they were infringing and had no reason to believe that they were."

Can this be said by microsoft? They explicitly chose to not include hidden repositories by their paid customers, likely because they knew that those customers would sue them if proprietary code was used as training data.

Apple seemed to have chosen not to include GPL in the app store for very similar reasons. Their term of service require a permission which is incompatible with the terms of GPL, and knowing that GPL software tend to include multiple rights owners, Apple chose to go the route of not allowing GPL.

And last, authors has requested to have their works removed from the training data. It is part of the lawsuit. Can Microsoft then still claim that they did not know they were infringing?

tzs · on Jan 6, 2023

The comment I was responding to was about the case where person X uploads code to GitHub, and that code contains code from person Y whose license to X does not give X permission to grant GitHub the rights that GitHub requires from the uploader, and so GitHub's use of Y's code is without copyright permission.

I believe GitHub would likely be seen as an innocent infringer in that case.

belorn · on Jan 6, 2023

Would that still be the case if Microsoft know that such infringement is likely to occur? Microsoft has been in the software industry for 50 years, has like Apple a app-store and has distributed software from millions of different rights owners. Can they with good faith argue that they had no idea that software often has multiple rights owner and thus a single person who upload software to github is unlikely to have sole copyright ownership.

I doubt Microsoft would make that argument. It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.

comex · on Jan 6, 2023

> It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.

Or they're worried about leaking secrets, which is a different matter entirely. The amount of copying needed to leak secrets is far lower than the amount needed to commit copyright infringement.

If Copilot is trained on Microsoft's code and accidentally regurgitates a comment, "// for 2024 Xbox", it has done one but not the other.

belorn · on Jan 7, 2023

When copilot was release there were people who got it to print out account and passwords that had been put into the training data. Microsoft should had at minium sanitized the training data so it would not include such information. There is also likely personal information stored in some of those open repositories.

Copyright infringement doesn't have a fixed size. It depend on context and what kind of information is copied. It demonstrate that copilot has not actually learned how to code (as many people like to claim), but is simply a algorithm for copying code. If it had learned to code like a human it wouldn't divulge secrets.

smoldesu · on Jan 6, 2023

You're right, but GitHub's TOS doesn't (or at least shouldn't) change the conditions of the original license. You're giving GitHub a copy of the source code, not the ability to dictate your license for you. There's certainly a lot of legal ambiguity in the copyright sense, but one thing seems clear: Microsoft trained Copilot on code they weren't certain they could use.

judge2020 · on Jan 6, 2023

Technically the GitHub TOS is in itself a license; much like how you can dual-license code, uploading to GitHub is its own license grant separate from the license of the code you're granting to anyone who wants to use it for their own purposes. LICENSE.txt/md is not the only way to grant access to code you write.

jacquesm · on Jan 6, 2023

GitHub's TOS is subject to change so that doesn't hold water. Tomorrow they could claim in their TOS you owe them your firstborn if you upload code to GitHub but that doesn't mean that you are bound by those terms because they cross the reasonable expectation of what you are signing up for. Granting Microsoft a blanket license to use your code in any way they see fit was not a part of the deal for GitHub, and as far as I know it still isn't for code that you claim copyright on. If you release your code into the public domain or use a license that is so permissive that anybody can use it at will, even without attribution that would make it fair game.

dangus · on Jan 6, 2023

Well, that's why major TOS changes are accompanied by the option to discontinue using that service. Usually they say that continuing to use the service after a certain date constitutes your agreement to the new terms.

I think we're over here in our armchairs weirdly assuming that GitHub doesn't have any lawyers working for them. I think they know they're legally in the clear on CoPilot.

I'm not at all a lawyer, but in my opinion we observe that the non-automated version of AI-generated works (the act of making art and prose in the style of an existing copyright work based on the artist's observation of that work) is not illegal. The only thing that AI introduces is automation.

jacquesm · on Jan 6, 2023

I see Copilot as a trial balloon. If they get away with it you can expect the next move to appropriate the body of open source that is GitHub. Why the archenemy of open source should suddenly be trusted to play nice is something I really can't grasp.

robertlagrant · on Jan 6, 2023

It's not sudden - they've owned it for a while now.

What I can't understand is people feel locked into Github because of the social features. To me they seem the least important part of Github, particularly with so many OSS projects running communities on Discord or Slack.

hyperdimension · on Jan 7, 2023

> The only thing that AI introduces is automation.

Hmm...Automated Inference? Automatic Infringement? Maybe we can make a nice backronym out of this.

smoldesu · on Jan 6, 2023

GitHub is still subject to the terms of your license though; they can impose whatever rules they want on you service-wise, but their use of your software should be dictated by the accompanying LICENSE file.

To illustrate: GitHub could delete any project they want, and there would be no real recourse for the project's author. That is a service decision that they reserve the right to impose via their TOS. However, if they were to steal code from a user's private repository and violate the license therein, the author could sue for theft of intellectual property.

judge2020 · on Jan 6, 2023

> GitHub is still subject to the terms of your license though; they can impose whatever rules they want on you service-wise, but their use of your software should be dictated by the accompanying LICENSE file.

Again, the LICENSE file in the repo is not the only license for that code. A copyright holder can grant people licenses to their work with or without documentation and with or without that license being accompanied within their work itself.

By uploading code to GitHub, you are asserting that you can legally grant GitHub a license to that code for hosting as described below.

> If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post.

Note that this is literally only limited to the provisions set below; uploading to GH doesn't allow them to import or use your code in Windows or the Github codebase or anything like that, doing so would indeed be bound by the license terms you've granted the world via the repo's LICENSE file.

> 4. License Grant to Us We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/site-policy/github-terms/github-t...

vharuck · on Jan 6, 2023

The TOS do say users grant GitHub a license to host, copy, and distribute works as comes up while they're providing the GitHub service. So saving copies on servers, making backups, and "distributing" it via their website.

IANAL, but I think Copilot is not a reasonable thing to include in these services.

yladiz · on Jan 6, 2023

I’m skeptical of this interpretation since it seems to imply that you could upload copyrighted code and now GitHub has a license to do whatever they want with the code, which is obviously not true. An example would be someone uploading Microsoft Windows source code illegally, and GitHub can’t just use it because it was uploaded to their service. I would argue that this then extends to CoPilot, in that just because they have a license to host it, they don’t have a license to do whatever they want with it.

judge2020 · on Jan 6, 2023

The license is quite limited, but does include "improving the service over time" which might be their key to CoPilot being okayed by their legal team, at least originally:

> 4. License Grant to Us We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/site-policy/github-terms/github-t...

yladiz · on Jan 7, 2023

If that is in fact how CoPilot got the green light from their legal team and what the case will eventually hinge upon, I really wonder if the argument that CoPilot is part of the "service" will hold up. I can imagine a judge or jury not being convinced here because the majority of the paragraph is clearly about the general use of the service (parsing it into a search index so you can search in your repo; making backups so that service isn't disrupted in the case of some server failure; share it with others so that others can access the content you uploaded). In other words, if this is what they argue is the reason CoPilot is okay, I can imagine plaintiff's lawyers able to successfully argue against it on the basis that CoPilot isn't really part of the normal service like e.g. the repos are and the argument that it could fall into the "otherwise analyze" statement is flimsy as it's not clear what analyze is defined as and it's arguable that adding it to an AI training model is not the same as or similar to indexing for search.

I suspect that the main argument will hinge not on the permission though, but rather if the use of code that is copyrighted in an AI model is transformative enough to fall under fair use. Obviously it's to be decided but I would imagine that because it wasn't a human transforming the code and/or hand selecting the code to put into the AI model, that it won't be considered transformative and therefore the use of the code doesn't fall under fair use.

I'm very curious how this case will play out.

wtetzner · on Jan 6, 2023

That requires the uploader to own copyright on that code. What if the uploader only has access to the code through the license?

robinsonb5 · on Jan 6, 2023

Luckily by agreeing to the TOS you've indemnified Github against the consequences of that scenario.

wtetzner · on Jan 7, 2023

I don’t think it works that way.

jacquesm · on Jan 6, 2023

Copyright is pretty complex legal material but the one thing that stands out for me is that you receive it upon creation and it requires a positive act on your part to relinquish it.

Macha · on Jan 6, 2023

There's also code that github themselves uploaded which they were permitted to do under the open source licenses in the mirrors user. I know some of these repos have since been moved as the authors became active on github (e.g. mirrors/linux is now torvalds/linux, indicating Linus has control of it even if it's a read only mirror), but I'm sure there's a few of them remaining.

PeterisP · on Jan 7, 2023

The legal argument made in this court case is that there is a substantial difference as redistributing the source as-is (keeping all the attached copyright, attribution and license notifications) is explicitly permitted by every open source license; but in the CoPilot usage the attribution gets removed, which is something which even the repository owner (assuming they're not the only author/copyright holder) does not have the right to do themselves, much less grant permission to others.

anothernewdude · on Jan 7, 2023

It doesn't actually. You can upload anything to github. The website doesn't stop you at all, even if you don't have the ability to grant anything to Github at all.

Eleison23 · on Jan 6, 2023

That's not obvious, because you don't necessarily own the code you're uploading. I can upload any sort of MIT-licensed, BSD-licensed, Apache-licensed, Creative-Commons-licensed, or GNU-copylefted works I want, anywhere within reason and compatible with those licenses, but if I didn't write them then I don't have the legal right to relicense, grant exclusive or restricted license to any specified parties.

So in a way this would void parts of many TOS agreements where you do relicense your User-Generated Content. If we're uploading memes to Facebook, they're gonna have to work out license terms with the copyright holders, not the uploaders.

judge2020 · on Jan 6, 2023

Uploading someone else's code without permissions is, in itself, copyright infringement. Just like you can't take someone else's code and license it to GitHub without the copyright holder's permission, you can't take images off of someone's website and sell/license them to Getty Images for profit.

pron · on Jan 6, 2023

I'm not sure that's the case if the repo is private and the code is not shared/disseminated to others. If you buy a book, you are not allowed to distribute copies of it, but you can loan it or sell your copy, or store it wherever you like. Copilot does allegedly distribute copies of code that it doesn't have the copyright to.

ClumsyPilot · on Jan 6, 2023

> Uploading someone else's code without permissions is, in itself, copyright infringement

Suppose person A comitted a crime, that does not mean you are now allowed to profit from someone else's crime

judge2020 · on Jan 6, 2023

But imagine Getty Images sells the stolen photo 10,000 times. They had no idea it was illegally stolen and fraudulently passed off as the fraudster's own work. If they get sued for infringement, they can just sue the actual fraudster for damages.

Same will be for GitHub: if people really didn't have the legal authority to bind someone else's code to GitHub's TOS, then GitHub can go after the $x million of users that have uploaded code they shouldn't have.

bryanrasmussen · on Jan 6, 2023

>GitHub can go after the $x million of users that have uploaded code they shouldn't have.

Ok, but can they go after them in an efficient manner that doesn't end up costing more than it's worth?

ClumsyPilot · on Jan 6, 2023

> They had no idea it was illegally stolen and fraudulently passed off

Thats doesnt mean. Getty can keep the money

zozbot234 · on Jan 6, 2023

Not "excellent" at all. This Richard Roe plaintiff is trying to use the notorious DMCA as an end-run around having to prove copyright infringement and withstand a possible fair use defense. That shouldn't be allowed, as a matter of Constitutionally-relevant protections.

chrismorgan · on Jan 6, 2023

But hang on, there would only be a DMCA violation if copyright infringement had in fact occurred, right? So fair use would be a perfectly legitimate defence, causing the DMCA not to apply.

Look, proving copyright infringement is downright trivial here, if copyright law applies. And that shows where GitHub’s defence will—must—lie.

(And for other readers unfamiliar with the parent comment’s phrasing: “end-run” is apparently an American sporting term which here makes “as an end-run around” mean “to circumvent” or “to work around”.)

speeder · on Jan 6, 2023

That unfortunately, is not how the DMCA works.

DMCA was designed to catch not just normal pirates that share the copied content, but also "crackers" that figure out how to share content that is protected somehow, as such it offers various ways to infringe it without infringing the copyright itself.

Basically they are accusing MS of behaving like crackers, by removing stuff from code to allow it to get shared illegally.

baby · on Jan 6, 2023

You say that like Github Copilot could have been trained in a different way. There's just too much utility and progress in Github Copilot to let this lawsuit win.

kllrnohj · on Jan 6, 2023

Why should Microsoft's ability to create a new revenue stream be more important than anyone else's ability to enforce their licensing terms?

baby · on Jan 7, 2023

You’re saying this like users of github copilot are out of the eq

sfifs · on Jan 7, 2023

The mafia used to steal dresses from New York garment factory delivery trucks and hawk them door to door in poorer neighbourhoods. The users (poorer households) definitely got value out of this by being able to obtain dresses they could not afford, but doesn't make a case for what the mafia was doing was right.

baby · on Jan 7, 2023

I feel pretty strongly that getting rid of copilot will slow down progress at a massive scale. Not only by affecting users who are getting a huge benefit from it, but by setting a precedence in how you can train AI

kllrnohj · on Jan 8, 2023

So ignoring licenses is perfectly fine as long as you do it:

a) at scale

and

b) make some other population happy

In other words, piracy should be perfectly fine, too, right? After all there are a huge number of users who benefit from it?

baby · on Jan 9, 2023

I’d rephrase this as “open source is open source”

bee_rider · on Jan 6, 2023

That’s really up to everyone whose license may have been violated to determine. Personally I get zero utility from copilot (although, I only have a little code on GitHub so it isn’t a pressing issue).

Signing up for disperse litigation like this seems like a pretty ballsy move by GitHub, but hey, Microsoft presumably has in-house lawyers with lots of spare time.

jacquesm · on Jan 6, 2023

That's not how the law works.

baby · on Jan 7, 2023

The law around copyrights is from a different era

hot_gril · on Jan 6, 2023

Could have been trained with the code owners' consent, or at least warning.

dspillett · on Jan 6, 2023

> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider

Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

Of course the real fun comes if someone agrees to their terms then uploads some of my code which they have to right to assign the licence to GitHub for. What come-back do I get in that case if I don't want my stuff used that way?

It seems odd to me that MS² who for many years strongly spoke against touching anything with the remotest whiff of GPL because of what it could legally do to your release requirements, are now more than happy to hoover up all the GPL covered code in GitHub and potentially mix it into their users' work output via copilot.

----

[1] in my not-at-all-legally-trained understanding

[2] current owners of GitHub, for those not paying attention

mhitza · on Jan 6, 2023

> Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

I disagree, IANAL, and I'm happy they are getting sued. The fact that they are are foremost a code hosting/collaboration company and the terms of service we all agreed to when creating our accounts was to have them host our code, and use it however they need in order to provide the service. The fact that they changed, post agreement, the service provided (from mere hosting/collaboration to feeding it into Copilot) should be an opt-in. I hope it's tested in court what the service is, because if you have a feature that (let's say) 1% of your users use, that's not the service, is it?

bastardoperator · on Jan 6, 2023

The service is displaying code... also I'm unaware of any TOS/EULA that cannot be amended or changed post agreement.

mhitza · on Jan 6, 2023

Not really about disallowing amendments, but at least sending out a notice of the changing terms. Like you get with your privacy policy.

I'm pretty sure I didn't receive one about them using my public (although unpopular) open source code into their NN mixer.

Edit: Anyway a bit outside the point. It being, when your ever expanding set of services incorporate your ownership in ways unforseen when the agreement was made, opt-in would have been the agreeable aproach in my opinion. Even ignoring over the licensing woes, as that's something to be tested in courts with this lawsuit, and interesting to follow.

bastardoperator · on Jan 6, 2023

They didn't change the terms, you can't expect privacy when you're out in public. I'm also curious how you're certain your project was used?

dspillett · on Jan 6, 2023

> you can't expect privacy when you're out in public

This isn't about privacy, it is about licensing (and possibly copyright). mhitza mentioned privacy as another policy, that you agree to upon sign-up like the terms of service, one for which updates are regularly announced.

> I'm also curious how you're certain your project was used?

Hasn't it been suggested that all public repositories at least could have been used? It makes sense to give the training pool as much information as possible.

bastardoperator · on Jan 6, 2023

> This isn't about privacy, it is about licensing (and possibly copyright). mhitza mentioned privacy as another policy, that you agree to upon sign-up like the terms of service, one for which updates are regularly announced.

The terms of service say you grant GitHub an implicit license to display your code. They also say:

"We may modify this agreement, but we will give you 30 days' notice of material changes."

Are you claiming that hasn't happened?

> Hasn't it been suggested that all public repositories at least could have been used? It makes sense to give the training pool as much information as possible.

Has it? I don't like to make assumptions.

dspillett · on Jan 7, 2023

> > "We may modify this agreement, but we will give you 30 days' notice of material changes."

> Are you claiming that hasn't happened?

Your post that I replied to explicitly stated that it hasn't.

Is that the case or was that one if the assumptions you don't like to make?

tremon · on Jan 6, 2023

it grabs licence to use your code however they see fit

Not your code. Anyone's code that's uploaded to github by any third party. Under open source licenses, that's expressly permitted. However, it seems you're arguing that Github is not bound by the license under which they (and their users) acquired the code because of their TOS.

How many projects on github are put there by the original copyright holders? Perhaps it's more than 50%, but it certainly is less than 100%. So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it, or that they received permission from third-party rights holders that did not agree to their TOS?

dspillett · on Jan 6, 2023

> So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it

Under other circumstances they don't need it. But if CoPilot is creating a derivative work including parts of that code without including the licence terms or attribution (as required by many licences) things are far more grey, or possible full black.

Some argue that the AI is unaware of the terms so can't be held responsible. Two possible counters for that: 1. it is the licence that gives you the right to use the copyrighted code, if you are unaware of the licence then why assume you have the righ tto use the code? 2. if I found some useful code that happened unbeknownst to me to be from MS, and used it in a way that I wasn't licensed to, and MS noticed, it is a pretty safe bet that they'd state ignorance of the copyright terms doesn't mean you can't be held to them.

Or another angle: the tool is allowing, even encouraging, people to use code or other materials in a way that infringes copyright (again: you don't have the right to use the code under most licences unless you give correct attribution and such) – the very conditions often stated as reasons for trying to ban other tools.

Plus of course the general argument: if this is entirely a non-issue, why is no Windows, Office, or SQL Server code in the training set? Surely they are great examples of how to do things to train the AI with?

marcus_holmes · on Jan 6, 2023

This is a good point. If I vendor my dependencies and upload to Github, does that give them a licence to use that code however they wish?

It's interesting, because Github certainly have the right to set whatever terms they like on their website. The dependency authors certainly have the right to set the licence terms on their code, and that gives me the right to vendor their work and include it in my upload to Github. But I, obviously, don't have the right to agree to Github's terms on behalf of the dependency authors.

I think the problem here is that Github assumes that everything I upload is my property, and that I have the ability to assign a licence to what I upload. This is not true for any project that vendors its dependencies.

jacquesm · on Jan 6, 2023

You would need to agree for every copyrighted work that you intend to allow them to use it for different purposes. Since GitHub can be completely invisible from the point of a contributor I highly doubt that clears the bar for such an invasive and irrevocable act.

judge2020 · on Jan 6, 2023

It seems a lot of this stems from how the DMCA does not require that these hosting providers actually check for code ownership at submission, or maybe just how they don't have an explicit checkbox for "I affirm that I can license this code to GitHub" every time someone is uploading code.

If it is shown that the license in the TOS is valid, the legal question might boil down to "is the TOS License broad enough to where nobody thought that it allowed their code to be used in for-profit ML models?"

toast0 · on Jan 6, 2023

> how the DMCA does not require that these hosting providers actually check for code ownership at submission

How does one check who owns a work if the work does not include the authorship information? (or if the work has been altered to have incorrect authorship information)

judge2020 · on Jan 6, 2023

That's the tough part, and the technological infeasibility is probably why the DMCA has no such provision. Even for Video, only YouTube has come up with a system that is "mostly right" for "most content" they have to deal with, that being Content ID, and it takes a lot of horsepower and a lot of money to run (and it requires that every rights holder must upload their content to the service for scanning; quite a feat).

dspillett · on Jan 6, 2023

A checkbox would make no difference in the case where the uploaded has not been granted the right to agree to such terms with respect to the code in question, leaving the matter in the same situation.

TAForObvReasons · on Jan 6, 2023

I'd like to propose a "golden rule" test. If Copilot was truly not at risk of regurgitating large blocks of code verbatim, why didn't Microsoft train it on proprietary Microsoft code as well? Why was it limited to user-submitted code on GitHub? If there is any argument pointing to licensing or copyright or patents, it stands to reason those concerns would apply to any corpus of user-submitted code since users could easily misrepresent the licensing.

jacquesm · on Jan 6, 2023

Precisely. I made this exact point a while ago, how come Microsoft didn't submit the source code to Windows as part of the training data, that is at least code that they can plausibly claim they have the rights to.

Dylan16807 · on Jan 7, 2023

There's a good point made over here: https://news.ycombinator.com/item?id=34282407 that it might accidentally spit out some kind of secret that is much smaller than a copyrightable piece of code. That's not a risk for code in public repositories.

bombolo · on Jan 6, 2023

Perhaps you want to check the comment section here: https://lwn.net/Articles/914150/

user bluca works at microsoft, but i think their opinions are their own.

mch82 · on Jan 6, 2023

How did you type the superscript footnotes?

Edit: Wow, this is game changing. Markdown parsers need to implement superscript ascii character support!

Lowercase ⁽ᵃ⁾ Uppercase ⁽ᴬ⁾ Numbers ⁽⁹⁹⁾

dspillett · on Jan 6, 2023

They are available in most fonts with reasonable-or-better Unicode coverage (https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc...). 1, 2 and 3 are available in ISO-8859-1 so can sometimes be used in 8-bit-only text, but I'd use them with care in that context.

To type them easily you'll usually need composition (sometimes called chording) support. Some Linux (and other Unix) distributions still have this built in by default, though last time I used Linux for much desktop use it seemed to be fading from common availability, otherwise you'll have to hunt for another method. On Windows I use http://wincompose.info/ (here [atlgr][^][1] produces “¹”, for instance, in the default settings) which is useful for a number of other things (I first started using it for accented characters like á on a UK keyboard). If you have a keyboard with programmable function keys then you could use its customisation tool to map some of them to produce the super-script (or sub-script, or other) characters you commonly want.

For less convenient typing, use your OS's Character Map or similar tool.

On Android, unless you have a different keyboard in use which doesn't support this of course, long press on the number on the touch keyboard gives superscripts as an option.

sillysaurusx · on Jan 6, 2023

> On Android, unless you have a different keyboard in use which doesn't support this of course, long press on the number on the touch keyboard gives superscripts as an option

chucks iPad out the window

We only get the standard shift character as an option. E.g. 1 shows !, 2 shows @, etc.

I’d use superscripts all the time if it was on the keyboard. Anyone know if MacOS can do it? Other than pressing the weird globe key and searching.

shadowfiend · on Jan 6, 2023

Easiest way I can think of is that you can use text substitution (System Settings -> Keyboard -> Text Replacements...) to set a string of characters that should substitute to the superscript characters (you can look it up with the globe key to set it up initially). So you could make e.g. [1] map to ¹, [2] map to ², etc. You could also use this trick on i(Pad)OS (Settings -> General -> Keyboard -> Text Substitutions). In fact, the substitutions should sync across if you're logged into the same Apple ID.

guffins · on Jan 6, 2023

The “UniChar” app, among others, can be used as an alternative keyboard “language” to access Unicode symbols with an interface similar to the emoji picker. Includes “favorites” and “recently used” sections.

https://apps.apple.com/us/app/unichar-unicode-keyboard/id880...

dspillett · on Jan 6, 2023

> implement superscript ascii character support

They are _not_ in ASCII. A few are available in some 8-bit code-pages that expand on ASCII's 7-bit character set, otherwise you need to be working in a Unicode-supporting environment (which is most these days, thankfully).

Dylan16807 · on Jan 7, 2023

I interpreted the phrase as superscript variants of characters that are in ASCII.

jfk13 · on Jan 6, 2023

There are Unicode characters for superscripted digits; you can probably find them in some kind of character-map application, depending on the platform you're using.

breck · on Jan 6, 2023

[flagged]

pdonis · on Jan 6, 2023

If you want people to actually read your site, you might want to not set an unreadably small font size. In fact, you might not want to set a font size at all, since you are extremely unlikely to know more than the reader does about what font size works for them. Browsers have default font size settings for a reason.

breck · on Jan 7, 2023

> If you want people to actually read your site

The slowest day in the past 2 weeks got over 600 readers, with the peaks many times that.

> unreadably small font size

The font size is actually larger than the NYTimes used for 100 years (https://www.amazon.com/York-Times-Complete-Front-Pages/dp/07...). I'm pretty sure the smartest publishers in the world knew what they were doing.

So on both points, your data is wrong.

You are the exception, who prefers a larger font size. Nothing wrong with that. It's one button press: cmd&+. You can even set that as your default.

jacquesm · on Jan 6, 2023

I'm sorry but Copyright is very much the law of the land no matter how often you post your links.

breck · on Jan 6, 2023

Physical slavery was once the law of the land too. I like to think I would have been on the right side of history at that time, as well.

jacquesm · on Jan 6, 2023

You are simply not making much sense, and to compare physical slavery with copyright is ridiculous.

breck · on Jan 7, 2023

> and to compare physical slavery with copyright is ridiculous.

On the contrary, you cannot mathematically distinguish (c)opywrong laws as anything but a kind of slavery.

Define a person A as a slave to person B if person B has legal control over person A at all times.

Now imagine person A is hanging out with Person C. With (c)opywrong laws, Person B has legal control over a subset of person A's behavior in this scenario (they are forbidden from sharing certain files with Person C by Person B). Hence, Person A is a partial slave to Person B.

It is not metaphor, it is literally a subset of the same thing. Intellectual Slavery is just slavery from many masters.

bastardoperator · on Jan 6, 2023

It made perfect sense. Laws change all the time... in some cases they're invalidated by the court.

Dylan16807 · on Jan 7, 2023

> Laws change all the time

That still doesn't make breck right.

If I say "slave ownership can't be signed over by a ToS change" that's a true statement, despite me being anti-slavery.

Copyright could be many things. But that doesn't change what copyright is right now. And even if they are offensive, they don't "void logic".

jrm4 · on Jan 6, 2023

I'm still baffled as to why people treat Github like a public library despite being owned by what was at one time the greatest enemy of free and open source software in existence. Not saying they haven't changed their tune somewhat, but a library owned by Barnes and Noble is going to have very different incentives than an actual library.

Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month and that's probably overpaying.

deltarholamda · on Jan 6, 2023

>Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month

Ah, but GitHub isn't selling git hosting. It's a social network that also does git. That's the cake that Microsoft bought, not the UI and API frosting.

culi · on Jan 6, 2023

Hmmm I wonder if there is any work on federated github alternatives. Seems like a much more consequential network effect to tackle than social media tbh

lygaret · on Jan 6, 2023

check out https://forgefed.org/

> ForgeFed is an upcoming federation protocol for enabling interoperability between version control services. It’s built as an extension to the ActivityPub protocol, allowing users of any ForgeFed-compliant service to interact with the repositories hosted on other instances.

Too · on Jan 6, 2023

And Dropbox can also be hosted yourself with a oneliner rsync-script…

The value GitHub provides is far from unique in anyway, but let’s not pretend it’s trivial. Especially for an open source project already struggling to get contributors to their main code base, even more so for any ops work.

npteljes · on Jan 6, 2023

That's because there's no feedback loop of what you're saying, in the active lives of the people interfacing with GitHub. Consider an example of ingesting poison. If the poison tastes bad, I'll be sure to spit it out immediately, either by involuntary disgust, or because I associate with that negative feeling of being poisoned, something I don't want, so I react. But what if the poison tastes good? And what if it not only tastes good, it actually rewards me for ingesting it, in some way? People tell me it's poison, it might not say so on the label, and many are also ingesting it. Is it even believable that it's poison, given that I don't experience the negatives at all?

b3morales · on Jan 6, 2023

To take the analogy further, it also (arguably) wasn't poisonous for many years. The poison would have been added in June of 2018.

dijit · on Jan 6, 2023

your analogy is spot on and aligns with humanity very well.

Drugs, alcohol and sugar all fit this description very neatly.

thealchemistdev · on Jan 6, 2023

I saw this a few months ago.

  local> ssh user@example.com
  user@example> git init --bare $DIR
  user@example> exit
  local> git clone user@example.com:$DIR

I've seen VPS services for as low as $4 a month.

I'm with you in camp baffled.

tick_tock_tick · on Jan 6, 2023

And dropbox is just rsync with a bit of cute UI basically worthless. These comments are peak examples of how disconnected some hackernews users are from real life.

jrm4 · on Jan 7, 2023

Nah. I'll still charge that it's laziness if this is your job. I get that this kind of practice is common, but I still find it lazy.

I compare it to using gmail as your professional email. It's a bad idea; even if it never bites you, because the cost is so little and the harm is so great if it ever screws up.

Except I can see why average joe user might not think about it, and I don't think developers have that excuse. You should know and understand that tech is risky enough. Same reason there was no excuse for the kik zero padding debacle. The fact that a LOT of people did a lazy dumb thing doesn't make it not dumb or lazy.

Dylan16807 · on Jan 7, 2023

That's not a fair comparison. The argument is more like a version of dropbox that's only used by programmers and owned by a company programmers should hate. And the rsync is capable of doing full multi-directional sync. In that scenario the hypothetical dropbox loses a ton of value.

shepherdjerred · on Jan 6, 2023

What about backups? Managing access to the repository? Making the repository easy to discover? Can you browse the code in a browser, or read the README without cloning?

Of course you could do all of these things with enough work. But why would the average developer want to? Do you really think most developers care so much about Microsoft owning GitHub?

acomjean · on Jan 6, 2023

Our nonprofit host provider has a git repo option on their control panel.

They use this for access:

https://gitlist.org/

(somewhat ironicaly code hosted on github. Not sure its still being updated..)

we currently don't use it, just because its on the same network as our site.. though because its git and it replicates the repo everywhere its less of an issue.

GrumpySloth · on Jan 7, 2023

> But why would the average developer want to?

As for myself, I don’t like GitHub UI, which is slow, has low information density, and is generally lollipop-like; and its social media aspect, which often turns the bug tracker into a Twitter equivalent with viral bugs, emojis to cheer and boo people.

So I set up a cgit instance for myself. Patches can be sent over mail as attachments generated with git format-patch (attaching files to mails isn’t hard). Issues can similarly be sent over mail and described in a BUGS file.

That’s sufficient for my single person projects. Probably also for n-person projects for small values of n (let’s say 7). Past that I’d set up a Gerrit instance, which IMO has the best UX of all code review tools available, free or not; and a Redmine for ticket tracking.

jrm4 · on Jan 6, 2023

Honestly -- if this is too much for the average developer then we have WAY too many underskilled, and perhaps useless, developers.