Hacker News new | past | comments | ask | show | jobs | submit login
GitHub is sued, and we may learn something about Creative Commons licensing (sspnet.org)
403 points by doener on Jan 6, 2023 | hide | past | favorite | 451 comments



Excellent. GitHub is in my opinion crossing a whole pile of lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built. Copyright is not something that can be signed over by a terms-of-use change of a hosting provider, the expectation is that your host does not automatically claim the rights to anything that you store there.

Such projects should always be opt-in, not just because it is the law but also because it is common sense and the right thing to do from an ethical perspective.


>lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built.

Fyi... Google Books (scanned and OCR'd books) eventually won against the authors filing lawsuits of copyright infringement. So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

https://www.google.com/search?q=google+books+%22is+transform...

But courts in Europe may judge things differently.


A number of points in Google's favor: they are not passing off Google books content as their own, they limit your access to a small fraction of the offering.

The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.


> The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.

They did appeal it. SCOTUS declined to hear the case.

https://www.nytimes.com/2016/04/19/technology/google-books-c...


Yes, sorry I could have worded that more precisely.


Google also used all of this to improve their OCR algorithms, almost certainly used in Google Cloud Vision[0], but I doubt this was a consideration when deciding if it was transformative/fair use.

0: https://cloud.google.com/vision


Yet they did not build and market a service to authors that would write novels for them based on their OCR-ed catalog.


> Yet they did not build and market a service to authors that would write novels for them based on their OCR-ed catalog.

I find this to be a very appropriate analogy. If Google had done such a thing, they would be facing the same kinds of lawsuits that Microsoft is facing now. And despite Microsoft's money, I don't see how they can wiggle their way out of this one. They basically ignored the license terms and attribution requirements of the authors. Something Microsoft would never stand for, if "the shoe was on the other foot".


Maybe they will? They still have that data and the kind of people to make such a service.


I guess we'll discuss that if and when they do.


That's not really what Copilot proposes to do either.


Indeed; that would be an excellent topic for litigation, and they would fight it with every lawyer they have b/c it could invalidate their efforts to zero out human labor costs in all possible areas.


As well, Google is, because of these things, somewhat acting as a library.

An libraries are very special entities.


Not really. Google won because Google Books was not actually a new concept; someone else had already built a book search engine the same way Google did, also got sued by the Authors Guild, and also prevailed. The only thing different about Google Books was that it'd give you two pages worth of excerpt out of the book. So it was very easy for a court to extend the fair use logic that they had already weaved into the law.

I still think "training is fair use" still has a leg to stand on, though. But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing). Fair use is not transitive; if I make 100 Google Books searches to get all the pages out of a book, I don't suddenly own the book. There is no "copyright laundry" here.


> I still think "training is fair use" still has a leg to stand on, though

If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think meanwhile the most reasonable solution is that an AI should always produce content compatible with the training material licenses. So if you want to use GPL training sets, you can only use that to create GPL-compatible code. If you use public domain (or e.g. 0BSD?) training sets, you can produce any code I guess.


> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".


It would be an essential feature, imo, to have this 'near-verbatim check' for copyleft code.

Overall it feels like it's a bit too much of specialized learning on GPL/Copyleft code to be fair. It's not like a human that reads some source code and gets an idea how it works. It's really learning code from scratch on Copyleft code, without which it would likely perform much worse and not generate a number of examples. It's not just copy-paste, but it's closer on the spectrum to copy paste than just super-abstract inspiration to feel fair.

As others have said, I don't think it would be fine (specially from big companies pov.) to decompile proprietary code (or just grab publicly available but illegal to reproduce code) and have AIs learn from that in a way that seems different in scope and ability to human research and reverse engineering.

I think we need a good tradeoff that isn't ludditism (that would reject a benefit for us all, i.e. that is good for everyone), but that still promotes and maintains open source software. In this case it's really a public "good" that's being seized and commercialized, that doesn't seem quite right: make copilot public, or use only permitted code (or share your revenue with developers -- although that would seem more complicated and up to each copyright holder to re-license for this usage). I remember not long ago MS declaring Open Source was a kind of "Cancer", now they're relying on it to sell their programming AIs. I personally think Open Source is quite the opposite of cancer, it is usually an unmitigated social good.

Much of the same could be said for the case of artists an generative AI art.

And this isn't even starting on how we move forward as a society that has highly automated most jobs and needs to distribute the resources and wealth in a good way to enable greatest wellbeing for all beings.


You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think, for the desired outcome to occur, you should instead ask:

You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

Ask the above, and suddenly Microsoft will agree.


Depends if you think the GPL means "copyright is great!" vs "let's use their biggest weapon against them..."

It's a surprisingly subtle distinction.

EDIT - if I squint hard enough in exactly the right way, there's a sense in which CoPilot etc aligns perfectly with the goals of the free software movement. A world in which you can use it as a code copyright laundry might be a world where code is actually free.

Is that any weirder than bizarre legal contortions such as the Google/Oracle "9 lines of code"? Or the whole dance around reverse engineering: "It's OK if you never saw the actual code but you're allowed to read comprehensive notes from someone who did"..?

There's a ton of examples like this. Tell me with a straight face that there's a clear moral line in either copyright or patent law as it relates to software.

IP is a mess and it's not clear who benefits. Is a world where code isn't subject to copyright so bad?


If Copilot was released as FOSS with trained model weights, I don't think the Free Software movement would have "shot first" in the resulting copyright fight.

It is specifically the idea of using copyright to eat itself that is harmed by AI training. In the world where we currently live in, only source code can be trained on. If I want to train an AI on, say, the NT kernel; I have to decompile it first, and even then it's not going to be good training data because there's no comments or variable names to guide the AI. The whole point of the GPL was to force other companies to not lock down programs and withhold source code, after all.

Keep in mind too that AI is basically proprietary software's final form. Not even the creator of an AI program has anything that resembles "source code"; and a good chunk of AI safety research boils down to "here's a program you can't comprehend except through gradient descent, how do we design it to have an incentive to not do bad things".

If you like copyright licensing and just view the GPL as an exception sales vehicle, then AI is less of a threat, because it's just another thing to sell licenses for.


> You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

> Ask the above, and suddenly Microsoft will agree.

Does Microsoft actually agree? Many people have posted leaked/stolen Microsoft code (such as Windows, MS-DOS 6) to GitHub. Microsoft doesn't seem to make a very serious effort to stop it – sometimes they DMCA repos hosting it, but others have stayed up for ages. They could easily build some system to automatically detect and takedown leaks of their own code, but they haven't. Given this reality, if they trained GitHub Copilot on all public GitHub repos, it seems likely that its training included leaked Microsoft source code. If true, that means Microsoft doesn't actually have a problem with people using the outputs of an AI trained on their own closed source code.


> If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

Is that new? If I include some excerpt from copyrighted material in my own work and it's deemed to be fair use, that doesn't limit my right to profit from the work, sell the copyright to someone else, and so on, does it?


If open source code authors (and other content creators) don't want their IP to be used in AI training data sets then they can simply change the license terms to prohibit that use. And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place. Of course Microsoft is going to look for ways to monetize that data.


> they can simply change the license terms to prohibit that use

GitHub's argument* is not that they're following the license but that the license does not apply to their use. So they would continue to ignore any provision that says they can't use the material for training.

Previously discussed: https://news.ycombinator.com/item?id=27740001

Moving off GitHub is a better step at a practical level. But again they claim the license doesn't matter, so even if it's hosted publicly elsewhere they would (presumably) maintain that they can still scoop it up. It just becomes more work, for them, to do so.

*Which is completely wrong in my opinion, for the record


And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place

No. It's called copyright, it is enforceable, and that's the control.

GPL source code is available everywhere, in all formats, in textbooks, on CDs, on websites, but it is still gpl.

And Microsoft doesn't get to scrub the license.


> But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing).

But if you read the source code of 100 different projects to learn how they worked and then someone hired you to write a program that uses this knowledge, that should be legit. I'm not sure if the law currently makes a distinction between learning vs. remixing, and if Copilot would qualify as learning.


That's not necessarily true at all. There's even techniques designed to demonstrably avoid such knowledge-contamination.

https://en.m.wikipedia.org/wiki/Clean_room_design


That kind of legal ass-covering is expedient when you are going to explicitly reproduce someone else’s source-available work. It’s cheaper in that case to go through the whole clean room hassle than to risk getting into an intractable argument in court about how your code that does exactly the same thing as someone else’s code came to resemble the other people’s code so much.

But, for the general case, the argument still stands. I have looked at GPL code before. I might have even learned something from it. Is my brain infected? Am I required by law to license everything I ever make as GPL for the remainder of my days?


Yes, it will sometimes depend on the unique qualities of the code. For instance, if you learned a new sorting algorithm from a C repo, and then wrote a comparable imperative solution in OCaml, that might be a derivative work. But if you wrote a purely functional equivalent of that algorithm, I don't think that could be considered a derivative work.


Precisely, that is the key point.


And Google kept the copyright notices and attributions; probably not super relevant, but it's a difference between the two cases.

I mean in essence Github is a library; they did have a license to a point to do with the code as they pleased, but they then started to create a derivative work in the form of an AI, without correctly crediting the source materials.

I mean I think they made a gamble on it; as far as I'm aware, AI training sets were yet unchallenged in a court of law, so legally not fully defined yet. These lawsuits - and the ones (if any) aimed at the image generators, using CC artwork from e.g. artstation - will lay the legal groundwork for future AI / ML development.


Libraries are really not very special. They mostly exist on the basis of first-sale doctrine and have to subscribe to electronic services like everyone else.

Entities like the Internet Archive skate by (at least before their book lending stunt during COVID) by being non-profit and bending over backwards to respect even retrospective robots.txt instructions, meaning that it's not really worth suing them given they'll mostly do what you ask anyway.

But I guarantee you that if I set up a best comic strips of all time library I'll probably be in court.


What if a library builds AI models from the content they host? The Internet Archive, for example.


There are two important differences.

Google Books retains the bibliographical information so you can properly cite the authors or contact them for permission to use their material.

And Google Books does not automatically write new books for you that you can then send off to Penguin Books or self-publish on Amazon.


Copilot also isn't retaining the actual content of the source code repositories and then deriving works from that. If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work. I'm not sure that the fact that a program can then take that associative data and generate new code doesn't suddenly make it not ok.


> If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work.

That sounds to me somewhat close to "if I take an FFT of each of those copyrighted images, glue them together, and sell this as a picture, is that a derivative work?" - I'd say yes, or perhaps even a different encoding of the original work, since you can reverse the frequency domain representation and get the original spatial representation - the original images - back.


Sure, ROT13 encoding is a derivative work because the entire original work is still there, encoded. Ditto for FFT. Large language models are not that.

Sometimes parts of the original works are still encoded, which we've seen when some code is reproduced verbatim, and I'm sure that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.

Once they iron out those wrinkles, it's not clear to me that a large language model is a directly reversible function of the original works. At least, not any more than a human learning from reading a bunch of code and then going on to have a career selling his skills at writing code.

Edit: by which I mean, LLMs are lossy encodings, not lossless encodings.


> Ditto for FFT. Large language models are not that.

They're not, but the "giant table of token frequencies and associative keywords" reminded me of doing FFT on images, and I wanted to communicate the idea that transformations like this can actually retain the original information, and reproduce it back through inverse transform.

> by which I mean, LLMs are lossy encodings, not lossless encodings

Exactly. And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. As you move away from lossless, it's not obvious when, or if at all, the result is clear from copyright of original inputs' author. Compare e.g. with JPEG, which employs a less sophisticated lossy encoding - no matter how hard you compress a source image, the result would still likely retain the copyright of the source image author, as provenance matters.

(IANAL, though.)


> And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. [...] Compare e.g. with JPEG

I'll just finally note that LLMs are not lossy encodings in the same sense as JPEG. LLMs are closer to human-like learning, where learning from data enables us to create entirely new expressions of the same concepts contained in that data, rather than acting as pure functions of the source data. That's why this will be interesting to see play out in the courts.


My belief is there is no fundamental difference here. That is, learning is a form of compression. Learning concepts is just a more complex form of achieving that much greater (if lossy) compression levels. If the courts will see it the same way too, things will get truly interesting.


Yes learning concepts is a form of compression, but I'm not sure that implies there's no "fundamental" difference. I see it as akin to a programming language having only first-order functions vs. having higher-order functions. Higher-order functions give you more expressive power but not any more computational power.

You could say a higher order program can "just" be transformed into a first-order program via defunctionalization, but I think the expressive difference is in and of itself meaningful. I hope the courts can tease that out in the end, and we'll see if LLMs cross that line, or if we need something even more general to qualify.


> I see it as akin to a programming language having only first-order functions vs. having higher-order functions.

Interesting analogy, and I think there are a couple different "levels" of looking at it. E.g. fundamentally, they're the same thing under Turing equivalence, and in practice one can be transformed into the other - but then, I agree there is a meaningful difference for humans having to read or think in those languages. Additionally, if those are typical programming languages, you can't really have the code in the "weaker" language self-upgrade to the point the upgraded language has the same expressive power as the "stronger" one. If the "weaker" one is Lisp though, you can lift it like this.

In this sense I see traditional compression algorithms - like the ones we use for archiving, images and sound - to be like those typical weaker languages. There's a fixed set of features they exploit in their compression. But human learning vs. neural network models (or sophisticated enough non-DNN ML) is to me like Lisp vs. that stronger programming language, or even Lisp vs. a better Lisp - both can arbitrarily raise their conceptual levels as needed. But it's still fundamentally compression / programming Turing machines.


> that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.

And if such algorithm is copyrighted, that would be infringing! It doesn't matter if you copy on purpose or by chance.


You can't copyright algorithms, you can only copyright specific expressions of code.


What do you mean by "glue them together"?

If you overlap a hundred different FFTs, then the result is likely fine copyright-wise.

These networks are not [supposed to] contain much of the original data. Like the trivia point that Stable Diffusion has less than two bytes per source image, on average.


> What do you mean by "glue them together"?

Stitch them side by side. Yes, this is not how those DNNs work, but the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it.

> Stable Diffusion has less than two bytes per source image, on average.

I'm not convinced by this trivia point, though. Stable Diffusion is, effectively, a lossy compression of the training data. Nothing says lossy compression algorithms can't exploit some higher-level conceptual structures in the inputs[0], and applying lossy compression to some work doesn't automatically erase the copyrights of the original input's author.

--

[0] - SD isn't compressing arbitrary byte sequences, it's compressing images - which is a small subset of all possible byte sequences as large as the largest image used in training. "Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.


> the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it

That depends entirely on how many frequencies you're keeping.

> high-level patterns in such data

High level patterns across thousands of images are generally not copyrightable.

I might even describe the purpose of stable diffusion as extracting just the patterns and zero specifics.


>"Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.

Two bytes would only let you uniquely identify ~65k images though, which to me doesn't sound plausible for a lossy compressor.


> since you can reverse the frequency domain representation and get the original spatial representation - the original images - back

I'd have thought that's exactly what you can't do with CoPilot.


Yeah, I guess the court will be the real test of what is allowed here.


> So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.

Curiously, from the article, copyright infringement is not alleged:

> As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement.

Perhaps the plaintiffs are trying to avoid exactly this prior law?


GitHub is actively driving a product as opposed to duplicating and those products may go on to generate income


If this lawsuit succeeds, I have a startup idea that I think would be effective.

Create a for-profit copyright registry for code snippets that are long enough to qualify for copyright protection. You can be the canonical owner of the copyright for a given piece of code! For a premium fee, we can generate and submit a patent on your behalf as well.

Once I have a large corpus (perhaps millions of entries of code, most one or two lines long), I can automatically scan new respositories and send cease and desist letters for violating my client's copyright. Even if a piece of code is very common, that doesn't mean its unoriginal, it just means that there are many people violating its copyright after all. According to the logic of the folks in this thread at least.


Do note that there’s the concept of https://en.wikipedia.org/wiki/Threshold_of_originality, which may be substantial for mere code snippets.

This may be one of the reasons why the lawsuit isn’t based on copyright.


This is the code of the future to ensure your code remains original: https://twitter.com/TylerGlaiel/status/1611115741809627139


Copyright grants an author of a copyrightable work the exclusive right to make more copies of it. However, if people independently come up with the same exact thing, copying has not occurred and that exclusive right was not violated (and then the court battle effectively becomes one about proving whether copying did in fact occur).

In copyright law there is no such concept as "code snippets that are long enough to qualify for copyright protection" or "canonical owners". Quite explicitly, copyright does not give a monopoly over an idea, but merely protects against the unlawful reproduction of an original work.

If you take some snippet from a work in which you own copyright and find that in the world multiple people have somehow managed to write the exact snippet, but they did it independently without copying it from you, then copyright law effectively states the following things:

1) They definitely aren't violating your copyright, and you have no claim on them whatsoever - independent creation is a complete defense to copyright infringement;

2) Perhaps this snippet might be judged uncopyrightable, as the existence of multiple independent recreations is some evidence that it lacks originality and thus would not qualify for copyright protection at all.


Length does come into play. Copyright is about creativity. If there is no creativity there is no copyright. A code snippet that is too short probably doesn't have enough creativity for it to be copyrightable. If you wrote some code and then registered it so it has a date. It will be difficult for someone who wrote the exact same code, later, to prove it wasn't copied.


You don't need to prove anything in court to send someone a cease and desist. It's often cheaper to settle.

There does exist the concept of 'originality' in copyright law, which I was erroneously conflating with length.


If the code snippets are so "obvious" that many people solve the problem axactly the same way, your going to have a lot of trouble asserting a copyright or patent over it.

But your idea is pretty much what almost all manufacturers do, and have been doing for decades.


Automated code scanning working on similar principles is already in use in many large tech companies, but in reverse. Basically, to prevent shipping improperly licensed code, missing attribution notices etc.


Oh, the internet!

Just yesterday, I bought a nice domain name for an idea that's very close to what you mention, monetize on snippets of code.

If you want to team up, hit me up!


That sounds like leftpad but with more steps.


I just hope it doesn't end in Microsoft paying some (from their perspective) small fine that is just the cost of doing business.


The range of possible outcomes is enormous, I'll just wait by the sidelines but cherish the thought that moving out of GitHub when Microsoft bought it was the right decision. They can't be trusted, this has been proven over and over again and yet people keep falling for it. It's the fox guarding the chickens. I wrote about my misgivings at the time:

https://jacquesmattheij.com/what-is-wrong-with-microsoft-buy...


> The range of possible outcomes is enormous

The most likely of which – if this lawsuit ends up winning – is that corporations will have new ways to sue everyone and that the world will be a worse place.

Copyright expansion has never benefited the "little guy" such as Open Source authors, only large entities with deep pockets who can litigate to no end.


It's not so much copyright expansion as it is copyright re-affirmation. In this case it is especially open source authors whose rights are in play. Keep in mind that all of open source relies on copyright, without that everything is PD from Microsofts point of view, if they hosted it. Think of Copilot as a trial balloon, if they get away with it they will likely use that as the stepping stone to the next level and bit by bit your rights are salamied out of existence. This is the first slice, it should stop right here.


I don't see any rights being taken away from me. CoPilot doesn't copy my code, it just learns from it, just as you can. "Others can learn from this" is one reason I release stuff as open source in the first place.

People will quote that John Carmack Doom example where it copies the function verbatim, but as far as I can tell that's a rare thing, and it's a function that's been widely copied around without proper licensing; a human could also get it wrong by copying it from github.com/random-person/mit-project with the wrong license (and since then there's also been work to prevent this kind of thing).

Co-pilot isn't unique, or the first AI/ML project to use copyrighted works; all the GPT models use copyrighted works as their input. Some doubts have been raised over the legality of that too, but it's received nowhere near the amount of criticism that Co-Pilot has, certainly not on HN, and I've never seen anyone doubt the morality of it – only the legality.

If you were to go through my public open source code I'm sure you can find stuff that's very similar to some code from my previous employers or other open source projects. Not because I copy/pasted anything, but because my brain was trained on that dataset: you see or write something that works, you face a similar problem a few years later, you write a similar solution.

"Using existing works as input" is common throughout creative works. As Phil Anselmo once said: "with Pantera we took our five favourite bands and ripped 'em off to hell".

People are already getting sued because "that one melody sounds a bit similar to this other melody"; fair use is already widely ignored/disrespected. Much will depend on the exact details, but any win in this lawsuit has a very real chance of empowering that sort of nonsense.


Agreed. It's quite bemusing how so many people are now copyright maximalists because they somehow think Microsoft will be hurt in such a world.


One way to address this is to somehow turn "little guy" into a "giant". For labor, this was accomplished via unions.

A union of creative minds seems long overdue. It can provide a copyright trust, addressing your concerns, as well as removing the excuse of inability to get permission from n thousand creators used. It can also address matter beyond OSS, such as overreaching employment agreement clauses that assert ownership of everything in your head.

[Possibly 'trust' is more suitable than 'union'. Something like Creative Commons Trust.]


Then you consider that there are many countries in the world and the whole system breaks down. Sorry, we don't allow this copyrighted code to your country!


> moving out of GitHub when Microsoft bought it was the right decision

What do you use instead?

The top alternatives in my opinion are:

- SourceHut https://sr.ht/

- Codeberg https://codeberg.org/

- Self-hosted using Forgejo https://forgejo.org/ (fork of Gitea)

I was self-hosting my code with Gitea for a while but currently I’m using GitHub. Planning on setting up a Forgejo instance in the coming weeks.

At work we use GitLab, but personally it is one of my least favourite platforms, so I am excluding GitLab from the list above.


> - Self-hosted using Forgejo https://forgejo.org/ (fork of Gitlab)

Forgejo is a fork of Gitea, not Gitlab.


Sorry, that was an autocorrect mistake. I meant to write fork of Gitea. Edited it now.


Depending how big of an install you’re running and exactly what featureset you need if you’re planning on self hosting I’ve been happily using Gitea for years.


I agree, I've yet to see any sufficiently large organisation successfully change their DNA. Gates hired people who thought like him, those people hired people who thought like them, and so on. The culture can and does change, but it takes an enormous amount of energy and time to change the course of such a large ship, if it's even possible at all. And I don't really see them trying - I've never even seen them apologise for the stunts they pulled with Netscape and all that. They put on a new coat of paint, hoping people forget. But I suppose they can't help themselves.


> moving out of GitHub when Microsoft bought it was the right decision

But what prevents Microsoft from harvesting open-source code from any hosting site? What have you gained?


[flagged]


This is so irrelevant it's annoying. None of those other companies sell the data after stripping the copyrights. Stop comparing them.


Why make the same comment twice in one thread?


While the fine is a cost of doing business, if they don't change behavior they can be sued again, and courts tend to impose very large fines if they discover you were already fined for this and didn't change afterwards.


The "fine" is 9 billion dollars.


Against a company which makes 6-7x that in yearly profits, that's still not an effective deterrent.


1/6th or 1/7th of profits is significant. 9 million dollars could be ignored by a company making 54,000 or 63,000 million dollars, but 9,000 million is a lot.


Exactly. That's why you're arguing for fines against people to be significantly more than 1/6th of their net pay, since harsher punishments are effective deterrents?

Parking tickets should start at 50% of your yearly take home income. Didn't feed the meter an extra quarter? $10k minimum sounds fair.


> fines against people to be significantly more than 1/6th of their net pay

Pay is not profits, it's revenue.


Look up the word net


Look up the word gross


Companies aren’t people (even if they are legally defined as such) so you can’t treat the way you fine them equally.


Stating something isn't proving it (even if you learned it as such), so you can't just state something as fact without proving it.


What?


But why stop there? What's the difference between Microsoft, Google, Meta, and OpenAI in this regard?

All of those build their models based on the same sources and it's therefore a much more general issue than just one particular company being sued.


That's a good point, but the subject of the thread is Microsoft. I'm pretty sure that Google will happily train their models on the contents of your Gmail account, I wouldn't trust Facebook with my birthdate and OpenAI is likely doing the exact same thing.

But that doesn't make it right in this case and, conveniently, someone has decided to bring suit. The funny thing is that Microsoft depends on Copyright law for their existence and now they want to change the rules to favor them when it suits them. In fact one of the first things that Bill Gates ever did that I remember is bitch about people copying the software that he wrote.


Google and Meta aren't redistributing your code claiming that it's theirs and that you can re-license freely.


Only because they arrived later at the party.


Hm, I don't know about that. I think they do plenty of stuff with other data that crosses that same line but they have not crossed it with (explicitly copyrighted) code, and until they do I don't think we should argue that if they had been there before that they would have done it, they had ample opportunity.


I mean, yeah.


If the fine is higher than what they can feasibly gain from the product in a reasonable time, it'd make them drop the product entirely even if they are able to adjust it to conform to the laws.


> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider,

I mean, it's obvious that uploading code requires you license the hosting provider a license to host it (which is not singing over copyright); although feel free to argue that the license doesn't or shouldn't extend to CoPilot usage.


With creative common and GPL there is a fairly common practice that work include multiple authors and rights holders. When a single user uploads such work to a hosting provider, the permission given to the provider will be limited to the permission that the user had. They can't give out permissions that they themselves do not have.

It is a similar case when a single user uploads a movie or game to a pirate torrent site. The site can have a terms-of-use that gives a license to the hosting provider, but naturally the users who upload the content might not have the permission to grant anything to the hosting provider. Depending on how much the hosting provider is or should be aware, hosting the content can still be illegal.


> They can't give out permissions that they themselves do not have.

Then, chances are, it's technically illegal to upload those other contributors' code, although if that code is contributed via GitHub itself then the code in the pull request has already been licensed to GH.

It boils down to copyright/DMCA not requiring that hosting providers ensure the code people say they have the rights to is valid at submission, so GitHub now has tons of examples where people themselves lied about the permission when they uploaded code that wasn't theirs, and this will probably be a valid legal defense, at least only for the argument of "does GitHub have the right to use the source in their ML model" (it might really boil down to "are GH's terms vague enough to where nobody thought they the license included the ability to train artificial intelligence").


"The person who uploaded the code lied about their permissions" won't be a valid defense in a copyright lawsuit by the actual copyright owner, at least in the case where there is no other copy of that code also on GitHub that was uploaded by the copyright holder.

In the US what it will be is good evidence to support a claim by GitHub that they were an "innocent infringer"--someone who did not know they were infringing and had no reason to believe that they were.

What that does is in the case where the plaintiff seeks statutory damages (which they almost certainly will¹) is lower the lower limit. Statutory damages are normally $750-30000 (amount determined by the court). If a defendant proves they are an innocent infringer that lower limit drops to $200. If the plaintiff can prove that the infringement was "willful" the upper limit goes up to $150000.

Statutory damages are per work infringed, not per infringement, so we aren't talking $200 or so multiplied by the number of copies GitHub distributed. We are talking of a likely award of $200 or so total (plus maybe attorney fees).

¹It is usually way too hard to determine actual monetary damages in cases like this, and actual damages are likely to be quite low anyway, so plaintiffs almost certainly will go for statutory damages.


"someone who did not know they were infringing and had no reason to believe that they were."

Can this be said by microsoft? They explicitly chose to not include hidden repositories by their paid customers, likely because they knew that those customers would sue them if proprietary code was used as training data.

Apple seemed to have chosen not to include GPL in the app store for very similar reasons. Their term of service require a permission which is incompatible with the terms of GPL, and knowing that GPL software tend to include multiple rights owners, Apple chose to go the route of not allowing GPL.

And last, authors has requested to have their works removed from the training data. It is part of the lawsuit. Can Microsoft then still claim that they did not know they were infringing?


The comment I was responding to was about the case where person X uploads code to GitHub, and that code contains code from person Y whose license to X does not give X permission to grant GitHub the rights that GitHub requires from the uploader, and so GitHub's use of Y's code is without copyright permission.

I believe GitHub would likely be seen as an innocent infringer in that case.


Would that still be the case if Microsoft know that such infringement is likely to occur? Microsoft has been in the software industry for 50 years, has like Apple a app-store and has distributed software from millions of different rights owners. Can they with good faith argue that they had no idea that software often has multiple rights owner and thus a single person who upload software to github is unlikely to have sole copyright ownership.

I doubt Microsoft would make that argument. It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.


> It is more likely they will argue fair use, but by not using closed repositories owned by paying customers, it seems to show that they themselves have doubt about the legal status of using other peoples copyrighted work for copilot.

Or they're worried about leaking secrets, which is a different matter entirely. The amount of copying needed to leak secrets is far lower than the amount needed to commit copyright infringement.

If Copilot is trained on Microsoft's code and accidentally regurgitates a comment, "// for 2024 Xbox", it has done one but not the other.


When copilot was release there were people who got it to print out account and passwords that had been put into the training data. Microsoft should had at minium sanitized the training data so it would not include such information. There is also likely personal information stored in some of those open repositories.

Copyright infringement doesn't have a fixed size. It depend on context and what kind of information is copied. It demonstrate that copilot has not actually learned how to code (as many people like to claim), but is simply a algorithm for copying code. If it had learned to code like a human it wouldn't divulge secrets.


You're right, but GitHub's TOS doesn't (or at least shouldn't) change the conditions of the original license. You're giving GitHub a copy of the source code, not the ability to dictate your license for you. There's certainly a lot of legal ambiguity in the copyright sense, but one thing seems clear: Microsoft trained Copilot on code they weren't certain they could use.


Technically the GitHub TOS is in itself a license; much like how you can dual-license code, uploading to GitHub is its own license grant separate from the license of the code you're granting to anyone who wants to use it for their own purposes. LICENSE.txt/md is not the only way to grant access to code you write.


GitHub's TOS is subject to change so that doesn't hold water. Tomorrow they could claim in their TOS you owe them your firstborn if you upload code to GitHub but that doesn't mean that you are bound by those terms because they cross the reasonable expectation of what you are signing up for. Granting Microsoft a blanket license to use your code in any way they see fit was not a part of the deal for GitHub, and as far as I know it still isn't for code that you claim copyright on. If you release your code into the public domain or use a license that is so permissive that anybody can use it at will, even without attribution that would make it fair game.


Well, that's why major TOS changes are accompanied by the option to discontinue using that service. Usually they say that continuing to use the service after a certain date constitutes your agreement to the new terms.

I think we're over here in our armchairs weirdly assuming that GitHub doesn't have any lawyers working for them. I think they know they're legally in the clear on CoPilot.

I'm not at all a lawyer, but in my opinion we observe that the non-automated version of AI-generated works (the act of making art and prose in the style of an existing copyright work based on the artist's observation of that work) is not illegal. The only thing that AI introduces is automation.


I see Copilot as a trial balloon. If they get away with it you can expect the next move to appropriate the body of open source that is GitHub. Why the archenemy of open source should suddenly be trusted to play nice is something I really can't grasp.


It's not sudden - they've owned it for a while now.

What I can't understand is people feel locked into Github because of the social features. To me they seem the least important part of Github, particularly with so many OSS projects running communities on Discord or Slack.


> The only thing that AI introduces is automation.

Hmm...Automated Inference? Automatic Infringement? Maybe we can make a nice backronym out of this.


GitHub is still subject to the terms of your license though; they can impose whatever rules they want on you service-wise, but their use of your software should be dictated by the accompanying LICENSE file.

To illustrate: GitHub could delete any project they want, and there would be no real recourse for the project's author. That is a service decision that they reserve the right to impose via their TOS. However, if they were to steal code from a user's private repository and violate the license therein, the author could sue for theft of intellectual property.


> GitHub is still subject to the terms of your license though; they can impose whatever rules they want on you service-wise, but their use of your software should be dictated by the accompanying LICENSE file.

Again, the LICENSE file in the repo is not the only license for that code. A copyright holder can grant people licenses to their work with or without documentation and with or without that license being accompanied within their work itself.

By uploading code to GitHub, you are asserting that you can legally grant GitHub a license to that code for hosting as described below.

> If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post.

Note that this is literally only limited to the provisions set below; uploading to GH doesn't allow them to import or use your code in Windows or the Github codebase or anything like that, doing so would indeed be bound by the license terms you've granted the world via the repo's LICENSE file.

> 4. License Grant to Us We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/site-policy/github-terms/github-t...


The TOS do say users grant GitHub a license to host, copy, and distribute works as comes up while they're providing the GitHub service. So saving copies on servers, making backups, and "distributing" it via their website.

IANAL, but I think Copilot is not a reasonable thing to include in these services.


I’m skeptical of this interpretation since it seems to imply that you could upload copyrighted code and now GitHub has a license to do whatever they want with the code, which is obviously not true. An example would be someone uploading Microsoft Windows source code illegally, and GitHub can’t just use it because it was uploaded to their service. I would argue that this then extends to CoPilot, in that just because they have a license to host it, they don’t have a license to do whatever they want with it.


The license is quite limited, but does include "improving the service over time" which might be their key to CoPilot being okayed by their legal team, at least originally:

> 4. License Grant to Us We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/site-policy/github-terms/github-t...


If that is in fact how CoPilot got the green light from their legal team and what the case will eventually hinge upon, I really wonder if the argument that CoPilot is part of the "service" will hold up. I can imagine a judge or jury not being convinced here because the majority of the paragraph is clearly about the general use of the service (parsing it into a search index so you can search in your repo; making backups so that service isn't disrupted in the case of some server failure; share it with others so that others can access the content you uploaded). In other words, if this is what they argue is the reason CoPilot is okay, I can imagine plaintiff's lawyers able to successfully argue against it on the basis that CoPilot isn't really part of the normal service like e.g. the repos are and the argument that it could fall into the "otherwise analyze" statement is flimsy as it's not clear what analyze is defined as and it's arguable that adding it to an AI training model is not the same as or similar to indexing for search.

I suspect that the main argument will hinge not on the permission though, but rather if the use of code that is copyrighted in an AI model is transformative enough to fall under fair use. Obviously it's to be decided but I would imagine that because it wasn't a human transforming the code and/or hand selecting the code to put into the AI model, that it won't be considered transformative and therefore the use of the code doesn't fall under fair use.

I'm very curious how this case will play out.


That requires the uploader to own copyright on that code. What if the uploader only has access to the code through the license?


Luckily by agreeing to the TOS you've indemnified Github against the consequences of that scenario.


I don’t think it works that way.


Copyright is pretty complex legal material but the one thing that stands out for me is that you receive it upon creation and it requires a positive act on your part to relinquish it.


There's also code that github themselves uploaded which they were permitted to do under the open source licenses in the mirrors user. I know some of these repos have since been moved as the authors became active on github (e.g. mirrors/linux is now torvalds/linux, indicating Linus has control of it even if it's a read only mirror), but I'm sure there's a few of them remaining.


The legal argument made in this court case is that there is a substantial difference as redistributing the source as-is (keeping all the attached copyright, attribution and license notifications) is explicitly permitted by every open source license; but in the CoPilot usage the attribution gets removed, which is something which even the repository owner (assuming they're not the only author/copyright holder) does not have the right to do themselves, much less grant permission to others.


It doesn't actually. You can upload anything to github. The website doesn't stop you at all, even if you don't have the ability to grant anything to Github at all.


That's not obvious, because you don't necessarily own the code you're uploading. I can upload any sort of MIT-licensed, BSD-licensed, Apache-licensed, Creative-Commons-licensed, or GNU-copylefted works I want, anywhere within reason and compatible with those licenses, but if I didn't write them then I don't have the legal right to relicense, grant exclusive or restricted license to any specified parties.

So in a way this would void parts of many TOS agreements where you do relicense your User-Generated Content. If we're uploading memes to Facebook, they're gonna have to work out license terms with the copyright holders, not the uploaders.


Uploading someone else's code without permissions is, in itself, copyright infringement. Just like you can't take someone else's code and license it to GitHub without the copyright holder's permission, you can't take images off of someone's website and sell/license them to Getty Images for profit.


I'm not sure that's the case if the repo is private and the code is not shared/disseminated to others. If you buy a book, you are not allowed to distribute copies of it, but you can loan it or sell your copy, or store it wherever you like. Copilot does allegedly distribute copies of code that it doesn't have the copyright to.


> Uploading someone else's code without permissions is, in itself, copyright infringement

Suppose person A comitted a crime, that does not mean you are now allowed to profit from someone else's crime


But imagine Getty Images sells the stolen photo 10,000 times. They had no idea it was illegally stolen and fraudulently passed off as the fraudster's own work. If they get sued for infringement, they can just sue the actual fraudster for damages.

Same will be for GitHub: if people really didn't have the legal authority to bind someone else's code to GitHub's TOS, then GitHub can go after the $x million of users that have uploaded code they shouldn't have.


>GitHub can go after the $x million of users that have uploaded code they shouldn't have.

Ok, but can they go after them in an efficient manner that doesn't end up costing more than it's worth?


> They had no idea it was illegally stolen and fraudulently passed off

Thats doesnt mean. Getty can keep the money


Not "excellent" at all. This Richard Roe plaintiff is trying to use the notorious DMCA as an end-run around having to prove copyright infringement and withstand a possible fair use defense. That shouldn't be allowed, as a matter of Constitutionally-relevant protections.


But hang on, there would only be a DMCA violation if copyright infringement had in fact occurred, right? So fair use would be a perfectly legitimate defence, causing the DMCA not to apply.

Look, proving copyright infringement is downright trivial here, if copyright law applies. And that shows where GitHub’s defence will—must—lie.

(And for other readers unfamiliar with the parent comment’s phrasing: “end-run” is apparently an American sporting term which here makes “as an end-run around” mean “to circumvent” or “to work around”.)


That unfortunately, is not how the DMCA works.

DMCA was designed to catch not just normal pirates that share the copied content, but also "crackers" that figure out how to share content that is protected somehow, as such it offers various ways to infringe it without infringing the copyright itself.

Basically they are accusing MS of behaving like crackers, by removing stuff from code to allow it to get shared illegally.


You say that like Github Copilot could have been trained in a different way. There's just too much utility and progress in Github Copilot to let this lawsuit win.


Why should Microsoft's ability to create a new revenue stream be more important than anyone else's ability to enforce their licensing terms?


You’re saying this like users of github copilot are out of the eq


The mafia used to steal dresses from New York garment factory delivery trucks and hawk them door to door in poorer neighbourhoods. The users (poorer households) definitely got value out of this by being able to obtain dresses they could not afford, but doesn't make a case for what the mafia was doing was right.


I feel pretty strongly that getting rid of copilot will slow down progress at a massive scale. Not only by affecting users who are getting a huge benefit from it, but by setting a precedence in how you can train AI


So ignoring licenses is perfectly fine as long as you do it:

a) at scale

and

b) make some other population happy

In other words, piracy should be perfectly fine, too, right? After all there are a huge number of users who benefit from it?


I’d rephrase this as “open source is open source”


That’s really up to everyone whose license may have been violated to determine. Personally I get zero utility from copilot (although, I only have a little code on GitHub so it isn’t a pressing issue).

Signing up for disperse litigation like this seems like a pretty ballsy move by GitHub, but hey, Microsoft presumably has in-house lawyers with lots of spare time.


That's not how the law works.


The law around copyrights is from a different era


Could have been trained with the code owners' consent, or at least warning.


> Copyright is not something that can be signed over by a terms-of-use change of a hosting provider

Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

Of course the real fun comes if someone agrees to their terms then uploads some of my code which they have to right to assign the licence to GitHub for. What come-back do I get in that case if I don't want my stuff used that way?

It seems odd to me that MS² who for many years strongly spoke against touching anything with the remotest whiff of GPL because of what it could legally do to your release requirements, are now more than happy to hoover up all the GPL covered code in GitHub and potentially mix it into their users' work output via copilot.

----

[1] in my not-at-all-legally-trained understanding

[2] current owners of GitHub, for those not paying attention


> Agreeing to GitHub's terms doesn't try to assign copyright over your code, it grabs licence to use your code however they see fit which is¹ legally quite different.

I disagree, IANAL, and I'm happy they are getting sued. The fact that they are are foremost a code hosting/collaboration company and the terms of service we all agreed to when creating our accounts was to have them host our code, and use it however they need in order to provide the service. The fact that they changed, post agreement, the service provided (from mere hosting/collaboration to feeding it into Copilot) should be an opt-in. I hope it's tested in court what the service is, because if you have a feature that (let's say) 1% of your users use, that's not the service, is it?


The service is displaying code... also I'm unaware of any TOS/EULA that cannot be amended or changed post agreement.


Not really about disallowing amendments, but at least sending out a notice of the changing terms. Like you get with your privacy policy.

I'm pretty sure I didn't receive one about them using my public (although unpopular) open source code into their NN mixer.

Edit: Anyway a bit outside the point. It being, when your ever expanding set of services incorporate your ownership in ways unforseen when the agreement was made, opt-in would have been the agreeable aproach in my opinion. Even ignoring over the licensing woes, as that's something to be tested in courts with this lawsuit, and interesting to follow.


They didn't change the terms, you can't expect privacy when you're out in public. I'm also curious how you're certain your project was used?


> you can't expect privacy when you're out in public

This isn't about privacy, it is about licensing (and possibly copyright). mhitza mentioned privacy as another policy, that you agree to upon sign-up like the terms of service, one for which updates are regularly announced.

> I'm also curious how you're certain your project was used?

Hasn't it been suggested that all public repositories at least could have been used? It makes sense to give the training pool as much information as possible.


> This isn't about privacy, it is about licensing (and possibly copyright). mhitza mentioned privacy as another policy, that you agree to upon sign-up like the terms of service, one for which updates are regularly announced.

The terms of service say you grant GitHub an implicit license to display your code. They also say:

"We may modify this agreement, but we will give you 30 days' notice of material changes."

Are you claiming that hasn't happened?

> Hasn't it been suggested that all public repositories at least could have been used? It makes sense to give the training pool as much information as possible.

Has it? I don't like to make assumptions.


> > "We may modify this agreement, but we will give you 30 days' notice of material changes."

> Are you claiming that hasn't happened?

Your post that I replied to explicitly stated that it hasn't.

Is that the case or was that one if the assumptions you don't like to make?


it grabs licence to use your code however they see fit

Not your code. Anyone's code that's uploaded to github by any third party. Under open source licenses, that's expressly permitted. However, it seems you're arguing that Github is not bound by the license under which they (and their users) acquired the code because of their TOS.

How many projects on github are put there by the original copyright holders? Perhaps it's more than 50%, but it certainly is less than 100%. So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it, or that they received permission from third-party rights holders that did not agree to their TOS?


> So where is github's legal paperwork that shows that they're only processing the code that's copyrighted by the user who uploaded it

Under other circumstances they don't need it. But if CoPilot is creating a derivative work including parts of that code without including the licence terms or attribution (as required by many licences) things are far more grey, or possible full black.

Some argue that the AI is unaware of the terms so can't be held responsible. Two possible counters for that: 1. it is the licence that gives you the right to use the copyrighted code, if you are unaware of the licence then why assume you have the righ tto use the code? 2. if I found some useful code that happened unbeknownst to me to be from MS, and used it in a way that I wasn't licensed to, and MS noticed, it is a pretty safe bet that they'd state ignorance of the copyright terms doesn't mean you can't be held to them.

Or another angle: the tool is allowing, even encouraging, people to use code or other materials in a way that infringes copyright (again: you don't have the right to use the code under most licences unless you give correct attribution and such) – the very conditions often stated as reasons for trying to ban other tools.

Plus of course the general argument: if this is entirely a non-issue, why is no Windows, Office, or SQL Server code in the training set? Surely they are great examples of how to do things to train the AI with?


This is a good point. If I vendor my dependencies and upload to Github, does that give them a licence to use that code however they wish?

It's interesting, because Github certainly have the right to set whatever terms they like on their website. The dependency authors certainly have the right to set the licence terms on their code, and that gives me the right to vendor their work and include it in my upload to Github. But I, obviously, don't have the right to agree to Github's terms on behalf of the dependency authors.

I think the problem here is that Github assumes that everything I upload is my property, and that I have the ability to assign a licence to what I upload. This is not true for any project that vendors its dependencies.


You would need to agree for every copyrighted work that you intend to allow them to use it for different purposes. Since GitHub can be completely invisible from the point of a contributor I highly doubt that clears the bar for such an invasive and irrevocable act.


It seems a lot of this stems from how the DMCA does not require that these hosting providers actually check for code ownership at submission, or maybe just how they don't have an explicit checkbox for "I affirm that I can license this code to GitHub" every time someone is uploading code.

If it is shown that the license in the TOS is valid, the legal question might boil down to "is the TOS License broad enough to where nobody thought that it allowed their code to be used in for-profit ML models?"


> how the DMCA does not require that these hosting providers actually check for code ownership at submission

How does one check who owns a work if the work does not include the authorship information? (or if the work has been altered to have incorrect authorship information)


That's the tough part, and the technological infeasibility is probably why the DMCA has no such provision. Even for Video, only YouTube has come up with a system that is "mostly right" for "most content" they have to deal with, that being Content ID, and it takes a lot of horsepower and a lot of money to run (and it requires that every rights holder must upload their content to the service for scanning; quite a feat).


A checkbox would make no difference in the case where the uploaded has not been granted the right to agree to such terms with respect to the code in question, leaving the matter in the same situation.


I'd like to propose a "golden rule" test. If Copilot was truly not at risk of regurgitating large blocks of code verbatim, why didn't Microsoft train it on proprietary Microsoft code as well? Why was it limited to user-submitted code on GitHub? If there is any argument pointing to licensing or copyright or patents, it stands to reason those concerns would apply to any corpus of user-submitted code since users could easily misrepresent the licensing.


Precisely. I made this exact point a while ago, how come Microsoft didn't submit the source code to Windows as part of the training data, that is at least code that they can plausibly claim they have the rights to.


There's a good point made over here: https://news.ycombinator.com/item?id=34282407 that it might accidentally spit out some kind of secret that is much smaller than a copyrightable piece of code. That's not a risk for code in public repositories.


Perhaps you want to check the comment section here: https://lwn.net/Articles/914150/

user bluca works at microsoft, but i think their opinions are their own.


How did you type the superscript footnotes?

Edit: Wow, this is game changing. Markdown parsers need to implement superscript ascii character support!

Lowercase ⁽ᵃ⁾ Uppercase ⁽ᴬ⁾ Numbers ⁽⁹⁹⁾


They are available in most fonts with reasonable-or-better Unicode coverage (https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc...). 1, 2 and 3 are available in ISO-8859-1 so can sometimes be used in 8-bit-only text, but I'd use them with care in that context.

To type them easily you'll usually need composition (sometimes called chording) support. Some Linux (and other Unix) distributions still have this built in by default, though last time I used Linux for much desktop use it seemed to be fading from common availability, otherwise you'll have to hunt for another method. On Windows I use http://wincompose.info/ (here [atlgr][^][1] produces “¹”, for instance, in the default settings) which is useful for a number of other things (I first started using it for accented characters like á on a UK keyboard). If you have a keyboard with programmable function keys then you could use its customisation tool to map some of them to produce the super-script (or sub-script, or other) characters you commonly want.

For less convenient typing, use your OS's Character Map or similar tool.

On Android, unless you have a different keyboard in use which doesn't support this of course, long press on the number on the touch keyboard gives superscripts as an option.


> On Android, unless you have a different keyboard in use which doesn't support this of course, long press on the number on the touch keyboard gives superscripts as an option

chucks iPad out the window

We only get the standard shift character as an option. E.g. 1 shows !, 2 shows @, etc.

I’d use superscripts all the time if it was on the keyboard. Anyone know if MacOS can do it? Other than pressing the weird globe key and searching.


Easiest way I can think of is that you can use text substitution (System Settings -> Keyboard -> Text Replacements...) to set a string of characters that should substitute to the superscript characters (you can look it up with the globe key to set it up initially). So you could make e.g. [1] map to ¹, [2] map to ², etc. You could also use this trick on i(Pad)OS (Settings -> General -> Keyboard -> Text Substitutions). In fact, the substitutions should sync across if you're logged into the same Apple ID.


The “UniChar” app, among others, can be used as an alternative keyboard “language” to access Unicode symbols with an interface similar to the emoji picker. Includes “favorites” and “recently used” sections.

https://apps.apple.com/us/app/unichar-unicode-keyboard/id880...


> implement superscript ascii character support

They are _not_ in ASCII. A few are available in some 8-bit code-pages that expand on ASCII's 7-bit character set, otherwise you need to be working in a Unicode-supporting environment (which is most these days, thankfully).


I interpreted the phrase as superscript variants of characters that are in ASCII.


There are Unicode characters for superscripted digits; you can probably find them in some kind of character-map application, depending on the platform you're using.


[flagged]


If you want people to actually read your site, you might want to not set an unreadably small font size. In fact, you might not want to set a font size at all, since you are extremely unlikely to know more than the reader does about what font size works for them. Browsers have default font size settings for a reason.


> If you want people to actually read your site

The slowest day in the past 2 weeks got over 600 readers, with the peaks many times that.

> unreadably small font size

The font size is actually larger than the NYTimes used for 100 years (https://www.amazon.com/York-Times-Complete-Front-Pages/dp/07...). I'm pretty sure the smartest publishers in the world knew what they were doing.

So on both points, your data is wrong.

You are the exception, who prefers a larger font size. Nothing wrong with that. It's one button press: cmd&+. You can even set that as your default.


I'm sorry but Copyright is very much the law of the land no matter how often you post your links.


Physical slavery was once the law of the land too. I like to think I would have been on the right side of history at that time, as well.


You are simply not making much sense, and to compare physical slavery with copyright is ridiculous.


> and to compare physical slavery with copyright is ridiculous.

On the contrary, you cannot mathematically distinguish (c)opywrong laws as anything but a kind of slavery.

Define a person A as a slave to person B if person B has legal control over person A at all times.

Now imagine person A is hanging out with Person C. With (c)opywrong laws, Person B has legal control over a subset of person A's behavior in this scenario (they are forbidden from sharing certain files with Person C by Person B). Hence, Person A is a partial slave to Person B.

It is not metaphor, it is literally a subset of the same thing. Intellectual Slavery is just slavery from many masters.


It made perfect sense. Laws change all the time... in some cases they're invalidated by the court.


> Laws change all the time

That still doesn't make breck right.

If I say "slave ownership can't be signed over by a ToS change" that's a true statement, despite me being anti-slavery.

Copyright could be many things. But that doesn't change what copyright is right now. And even if they are offensive, they don't "void logic".


I'm still baffled as to why people treat Github like a public library despite being owned by what was at one time the greatest enemy of free and open source software in existence. Not saying they haven't changed their tune somewhat, but a library owned by Barnes and Noble is going to have very different incentives than an actual library.

Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month and that's probably overpaying.


>Made all the more silly by the fact that it's Git. You could just host it yourself for five bucks a month

Ah, but GitHub isn't selling git hosting. It's a social network that also does git. That's the cake that Microsoft bought, not the UI and API frosting.


Hmmm I wonder if there is any work on federated github alternatives. Seems like a much more consequential network effect to tackle than social media tbh


check out https://forgefed.org/

> ForgeFed is an upcoming federation protocol for enabling interoperability between version control services. It’s built as an extension to the ActivityPub protocol, allowing users of any ForgeFed-compliant service to interact with the repositories hosted on other instances.


And Dropbox can also be hosted yourself with a oneliner rsync-script…

The value GitHub provides is far from unique in anyway, but let’s not pretend it’s trivial. Especially for an open source project already struggling to get contributors to their main code base, even more so for any ops work.


That's because there's no feedback loop of what you're saying, in the active lives of the people interfacing with GitHub. Consider an example of ingesting poison. If the poison tastes bad, I'll be sure to spit it out immediately, either by involuntary disgust, or because I associate with that negative feeling of being poisoned, something I don't want, so I react. But what if the poison tastes good? And what if it not only tastes good, it actually rewards me for ingesting it, in some way? People tell me it's poison, it might not say so on the label, and many are also ingesting it. Is it even believable that it's poison, given that I don't experience the negatives at all?


To take the analogy further, it also (arguably) wasn't poisonous for many years. The poison would have been added in June of 2018.


your analogy is spot on and aligns with humanity very well.

Drugs, alcohol and sugar all fit this description very neatly.


I saw this a few months ago.

  local> ssh user@example.com
  user@example> git init --bare $DIR
  user@example> exit
  local> git clone user@example.com:$DIR
I've seen VPS services for as low as $4 a month.

I'm with you in camp baffled.


And dropbox is just rsync with a bit of cute UI basically worthless. These comments are peak examples of how disconnected some hackernews users are from real life.


Nah. I'll still charge that it's laziness if this is your job. I get that this kind of practice is common, but I still find it lazy.

I compare it to using gmail as your professional email. It's a bad idea; even if it never bites you, because the cost is so little and the harm is so great if it ever screws up.

Except I can see why average joe user might not think about it, and I don't think developers have that excuse. You should know and understand that tech is risky enough. Same reason there was no excuse for the kik zero padding debacle. The fact that a LOT of people did a lazy dumb thing doesn't make it not dumb or lazy.


That's not a fair comparison. The argument is more like a version of dropbox that's only used by programmers and owned by a company programmers should hate. And the rsync is capable of doing full multi-directional sync. In that scenario the hypothetical dropbox loses a ton of value.


What about backups? Managing access to the repository? Making the repository easy to discover? Can you browse the code in a browser, or read the README without cloning?

Of course you could do all of these things with enough work. But why would the average developer want to? Do you really think most developers care so much about Microsoft owning GitHub?


Our nonprofit host provider has a git repo option on their control panel.

They use this for access:

https://gitlist.org/

(somewhat ironicaly code hosted on github. Not sure its still being updated..)

we currently don't use it, just because its on the same network as our site.. though because its git and it replicates the repo everywhere its less of an issue.


> But why would the average developer want to?

As for myself, I don’t like GitHub UI, which is slow, has low information density, and is generally lollipop-like; and its social media aspect, which often turns the bug tracker into a Twitter equivalent with viral bugs, emojis to cheer and boo people.

So I set up a cgit instance for myself. Patches can be sent over mail as attachments generated with git format-patch (attaching files to mails isn’t hard). Issues can similarly be sent over mail and described in a BUGS file.

That’s sufficient for my single person projects. Probably also for n-person projects for small values of n (let’s say 7). Past that I’d set up a Gerrit instance, which IMO has the best UX of all code review tools available, free or not; and a Redmine for ticket tracking.


Honestly -- if this is too much for the average developer then we have WAY too many underskilled, and perhaps useless, developers.


I would consider myself pretty knowledgeable and I love going on tangents when setting up projects. I love tinkering and learning.

There is no situation in which I'd want to do all of the above work for every single repository I setup.

I have hundreds of repositories that I own on GitHub for things like school assignments and personal projects. If I start a weekend project as you describe then the first 4 hours are going to be setting my repository up.

Nobody wants to do this. Just because something is difficult or time consuming doesn't mean that it is good or useful. Doing this once would be a fun learning experience. Doing it more than once is a useless chore.


> There is no situation in which I'd want to do all of the above work for every single repository I setup.

What the heck are you talking about?

The only work you'd have to do "for every single repository" is the single git command.

Getting a server, managing access and discoverability, setting it up for browser access, setting up backups, those are things you would do once.

If that's a waste of four hours, valid argument, but it's not a waste of four hundred hours. You're grossly exaggerating the cost.

Oh and you didn't mention keeping the server updated but that's probably ten minutes of effort once a month.

> Doing this once would be a fun learning experience. Doing it more than once is a useless chore.

Then there's no problem.


Heh - a former employer of mine used something similar to this to sync code between laptops and development VMs :) We had a script that made a temporary commit, and pushed to a git repository in the way you describe, then reverted that commit.


Eh, I have no love for Github but this is huge bikeshedding. This can apply to so many pieces of a software project that at some point you just aren't even working on a project. Every tool has tradeoffs and based on use github and gitlab are the kind of tradeoffs developers are willing to make


I'd disagree. I think it's a "Black Swan" esque problem. Software developers shouldn't use Github for their bread and butter, in the same way that I argue real businesses should pay for email and not use Gmail.

Sure, it might work fine forever, but when it doesn't, you're really screwed and you could have avoided that in a relatively simple way. Reminds me of seatbelts and fire extinguishers.


It's been a bit over four years since the acquisition and Microsoft hasn't screwed it up yet.

It's good to have a backup plan, but it's convenient to keep using it for now.


GitHub built goodwill over the years. There were many controversies, but there were also many die-hard fans. That didn't evaporate overnight. Microsoft bought GitHub (and minted 3 billionaires in the process) specifically to acquire that goodwill and monetize it.


Its not good will, its features and comfort. Github has the UI that almost every developer is used to, easy to use CI/CD, great issue and Pull request handling. And more importantly, everything is free.

Even Ignoring the value and the features, Employers don't ask for your git link, they ask for your GitHub account. And since most projects are on GitHub having all of your projects there tooz makes it easier to see all of your commits, making your profile look more active.

That's without mentioning the ease of discovery and issue reporting since everyone has an account.


Agreed. And I think the road to technology hell is paved with convenience and "free."


> acquire that goodwill and monetize it

Embrace

Extend <-- here

Extinguish


Uh, what's the extend that has happened since microsoft bought github?

Do you mean copilot? I would not classify that as extending anything. It's just a thing they made.

"Monetizing goodwill" is not an extend.


Of course that's "extending." It's a very clever attack on open source generally.


But it's not open source specific. Not in creation and definitely not in use.

And the idea of the extend of EEE applying to all of open source at once, the way you could apply to a product or a standard or a protocol, doesn't really make sense.


You're thinking way too narrowly. EEE doesn't require a specific plan -- if you look at the history of Microsoft engaging in EEE, you can see it's been much more experimental than you're suggesting.

It's basically Microsoft saying "Here is a thing that looks like it might in some way eventually be a threat to our business model and/or we can make some money off of it -- let's get our hooks in now and see what happens, we have the money to do it."


But again, copilot does not get their hooks into open source any more than it gets their hooks into code in general. And they're not doing EEE against "code".

EEE doesn't need a specific plan but it does need some kind of target standard. Getting into an emerging market, without specific kinds of integration or malfeasance, is just competition.


> Microsoft <heart> Open Source


That's not what EEE refers to. "Embrace" does not mean "buy".


It can, I think. The EEE concept is useful enough to go slightly outside its original intended use.


So the idea here is

1. Acquire company for tens of billions.

2. Intentionally ruin own investment.

3. ???

4. PROFIT!

?


Well.

1. Acquire company for $7.5bn in stock, so fortunes are joined rather than cash paid

2. Still charge money for it

3. Create an AI product out of the open source bit of it, justifying the price tag, as well as build links to Microsoft dev tools to entice OSS back into the Microsoft ecosystem

4. Maybe profit, but almost certainly not loss


There are a lot of less-than-optimal possible outcomes between "create the best possible free open source library" and "utterly ruin."


I refuse to believe that Microsoft is any different from before (not just wrt to foss but general attitude). Doesn't matter if there's a new CEO. Look at what they did to Minecraft logins. Or this.


> I'm still baffled as to why people treat Github like a public library

Because too many open source projects rely on it. Projects like crates.io force you to have a github account to use it. Most (neo)vim plugin managers give preferential treatment to github over other forges.


I'd like to know more about how MSFT qualifies for the moniker "they're the greatest enemy of free and open source software". From my understanding they've invested in many open web resources previously (like jQuery).


I did say "at one time," not necessarily now.

But if you've been watching, literally the only reason they don't hate it now is because they lost the open source v. proprietary battle.


And they opened sourced .NET


This made me wonder. Is there a not-for-profit SCM host that does act like a library?


> “Your honor, we needed so many works that it was simply not practical to ask permission of the creators.” I don’t find this argument convincing given the ability today to license many content types at scale for TDM, including images, music and yes, journal articles (See “Full disclosure” above), but it is an argument often offered by infringers.

Why is this type of argument even valid? Isn't this fundamentally saying, "The cost of not infringing copyright is massive, so we will glibly infringe!"

So it is not okay to infringe copyright at a small scale but okay to do it in a large scale? How can such a line of argument be sensible in court? But apparently infringers are using this line of argument. So how? Is it not absurd?


On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).


But the thing is that we explicitly allow humans to learn and develop their own skills learning from other humans, but we have our own taboos around directly copying peoples work without permission and passing it off as your own. The debate is that copilot isn’t a human, it’s a machine that outputs copied work on a statistical basis.

Humans are allowed to be unoriginal, uncreative, boring, mediocre, and all sorts of things. But they’re not copying whole cloth the way copilot is.


> But they’re not copying whole cloth the way copilot is.

Stack Overflow content is CC-BY-SA 4.0 yet I can bet most corporate codebases include tons of code snippets without a link or citation to the original answer


Don't you have to work pretty hard to get copilot to reproduce snippets verbatim? My understanding is that, while its possible to make copilot reproduce snippets so long as they appear in a large number of files, this basically never happens under normal usage.


This argument doesn't work.

I don't know whether copilot is giving me infringing content or not. I'm always at risk that my question was one of the ones that trigger infringing replies.


Whenever I've used Copilot it never seems to copy whole sections of code. Can you provide examples of this?

From what I've seen it is producing fairly generic boilerplate that has been modified based on the rest of the code in my repo so that it works with the other functions and even incorporates other pieces of my code in the same style that I'm using. The boilerplate aspect makes sense because this would be the most common sequence of tokens that it observed during training. It's somewhat miraculous that it can incorporate code on the fly from my repo. I've never seen anything that looks like a direct copy paste from elsewhere though. If you have a different observation I'd love to see it.


Behold: https://twitter.com/StefanKarpinski/status/14109710611816816...

Probably helps that this is from a codebase that's been forked quite a bit.


Yeah, I wanted an example from a real project not a one file demo. The high fork number and probably also its existence in thousands of other projects likely results in this behaviour if you have no surrounding context.

This is also easily solved by checking the box in Copilot that says not to produce any code matching public code.


Can I circumvent the new OGL revocation by training an AI on 100 copies of the D&D rulebook, and using its output?


You can't even code search in forked repos so maybe forks were excluded (besides commits on top of the fork)?


Most forks probably happened before GitHub existed.


> On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet

And that would make sense and it would be argued on its own merit. The judge/jury will decide if this argument is correct and legal.

But the article implies that there are lawyers and infringers out there who are arguing that they could not have possibly afforded the cost of not infringing, so they were justified in their infringement. Since when did the massive cost of avoiding infringement become a valid reason to carry on with infringement? This seems just plain absurd by common sense. How do lawyers and infringers make this argument? How is it even entertained in court? What am I missing?


So if I make a script that automatically downloads every torrent in existence it's suddenly ok, since it is infeasible to check the copyright of them all?


We allow humans to do what copilot does because we take into account that the human brain is very limited in this regard. If we could scan all of GitHub in under a week and recall perfectly what we saw we would already have different laws. Now that machines are able to somewhat learn like humans but 1,000,000 faster we need new laws.

That's why I don't believe "but that's like humans doing X" is a strong argument.


>On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).

It might not be that easy, I think Wine developers are not allowed to read code related to Windows, even if this code is published on GitHub. The fact you looked at the code was decided to be a risk.

You also have cases with a NN producing an identical output, so you either prove your NN NEVER produces copyrighted code or you have to have a second process that is 100% correct and double checks the NN output and check for plagiarism.

I am against Microsoft in this case because they decided not to put their proprietary code in the NN , would have been funny to have the AI write an open source Windows re-implementation when you feed it the Win APi documentation.


The Wine developers are allowed to read whatever they want. They may choose to have a policy not to, because it makes it easier for them to prove that they didn't make unlawful use of proprietary code: they can't have copied something that they never read. If you reproduce something independent of knowledge of the original then that is a defense against copyright infringement. This is essentially the "Clean Room" tactic: https://en.wikipedia.org/wiki/Clean_room_design


If copilot is so advanced that we need to grant it right human has, then it has right to freedom and owning copilot is a crime.

I dony think Microsoft wants to go down this path


It is not a human though. It is a function approximated from inputs and outputs. The laws are different and the licenses call out derivative works.


If enough code is recalled verbatim, I can sue the author of that code. That seems to fit entirely with this case -- they are suing the owner of Copilot, partially because it reproduces chunks of code.


Why are wine developers and similar required to do clean room implementations to not be sued then?

Simply reading the leaked source code of Windows makes you not eligible to contribute to wine.

Why is Windows source code so much more important than mine?

The other thing is that copilot is not a human, so it doesn't matter anyways.

Humans are a special exception with laws, because they are intended to protect and benefit humans while also being fair. I don't think you can just substitute something in and assume that the same rules apply.


> Why are wine developers and similar required to do clean room implementations to not be sued then?

That's the neat part, they don't. It's essentially a self-imposed limitation which contradicts actual court rulings on the matter such as Sony v. Connectix, in which the court commented on clean-room being "inefficient" and the kind of inefficiency that fair use was "designed to prevent".


> Isn't this fundamentally saying, "The cost of not infringing copyright is massive, so we will glibly infringe!"

Copyright is not a natural human right; it's a construct invented and conferred by governments in order to achieve certain objectives. (It's more like a state license than a right, to be honest; using "right" was a historical masterstroke from the original inventors).

As such, if those objectives can be provably achieved in a better way without copyright (or rather cutting copyright a bit smaller), there might well be a case for foregoing punishment.

It's an exceptionally-hard argument to make, but it's not illogical.


This seems like a good argument for adjusting copyright law, but seems unhelpful in interpreting it. "This law isn't a good way to achieve the government's objectives" is not the same as "this law wasn't broken". Judges do have some discretionary power in interpretation and that can take into account congress's intent, but here that would be a massive stretch. A judge would simply say it's congress's job to fix copyright if it's not the best way to achieve certain policy goals.


As I said, exceptionally hard in practical terms - just not as baffling as the parent poster painted it. With the right judge anything is possible, and US history is full of controversial "overreaching" judgements.

(and with the deep pockets GH/MS have, it doesn't really matter if the case eventually loses on a big principle - it's just a case of dragging it long enough that, some time through the whole process of appeals, the plaintiff will get broke enough to give up or accept a deal. This line is likely just one of many that defendants will employ.)


And the purpose of copyright is to aid and encourage the progress of science and the useful arts:

>[The Congress shall have power] “To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”


> Copyright is not a natural human right; it's a construct invented and conferred by governments in order to achieve certain objectives. (It's more like a state license than a right, to be honest; using "right" was a historical masterstroke from the original inventors).

I don't know of a better definition for "natural human right" than "a right/privilege/protection given to everyone automatically, even if they don't know about it or claim it, unless they specifically opt out of it." We get to decide what our "natural human rights" are, and we've decided that you automatically get copyright on your creative works even if you don't know what copyright is. Seems like a good thing, and a natural human right.


That's certainly not the conventional definition of "natural right", nor was it historically. Interestingly, Jefferson, when arguing against patents, made a point that regular property ownership is not a natural right:

"It has been pretended by some, (and in England especially,) that inventors have a natural and exclusive right to their inventions, and not merely for their own lives, but inheritable to their heirs. But while it is a moot question whether the origin of any kind of property is derived from nature at all, it would be singular to admit a natural and even an hereditary right to inventors. It is agreed by those who have seriously considered the subject, that no individual has, of natural right, a separate property in an acre of land, for instance. By an universal law, indeed, whatever, whether fixed or movable, belongs to all men equally and in common, is the property for the moment of him who occupies it, but when he relinquishes the occupation, the property goes with it. Stable ownership is the gift of social law, and is given late in the progress of society."

(https://press-pubs.uchicago.edu/founders/documents/v1ch16s25...)


  > We get to decide what our "natural human rights" are
Words have meaning. You can’t just make things up. The rights we invent through human constructs are the opposite of natural. Copyright is totally arbitrary and nothing at all like (for example) the right one has to their own life.

https://en.m.wikipedia.org/wiki/Natural_rights_and_legal_rig...


> "a right/privilege/protection given to everyone automatically, even if they don't know about it or claim it

That's not what copyright was, when it entered the legal landscape; and it still isn't in so many countries. And even where your definition is somewhat accepted, people disagree on what exactly it means (How long should it last? Can it be inherited? Does it apply to this or that? Etc etc). Even simply the fact that it can literally be bought and sold would indicate that it is not a human right at all - those are typically unalienable. It is a commercial right at best.


>So how? Is it not absurd?

it is.

"Your honor, I've been drink-driving so many times I honestly can't tell you an accurate estimate any more so in both our interests let's agree it was basically uncountable, or 0."

wait a minute...


Or after a crypto heist:

"Your honor, it would have been onerous to ask each account owner if we could have their tokens so we just took all of them at once"


> So it is not okay to infringe copyright at a small scale but okay to do it in a large scale?

No, I think you're missing the "transformative" part.

The line of argument isn't "we're going to resell millions of codebases as-is for pure profit", which would be undisputed copyright infringement.

The argument is that something highly transformative (e.g. training models) isn't infringement at all, because transformative works are covered by fair use. And that, if we still wanted to explore interpreting/changing the law to force opt-in for highly transformative things, it's logistically unreasonable, to such an extent that the transformative thing couldn't occur at all. So that it's a waste of time to even be discussing asking for permission as some kind of potential compromise or requirement. If it's transformative and therefore fair use, asking for permission is an irrelevant distraction.

That's why this type of argument is valid. I'm not saying whether the argument will/should win in this particular case, but I'm definitely saying there's nothing absurd whatsoever about it.


Yes, transformative works may be allowed. So I'd guess that creating a model is probably OK (speaking as a non-lawyer!). But using output generated by that model is another matter. The "model" is fundamentally a machine that produces output that is derived from the input it was given. And that output might not be sufficiently transformative to "escape" copyright/licensing restrictions.

In the extreme case, the model's output might be a verbatim copy of a large portion of the original input ("training materials"); but even if it has been extensively modified, e.g. to conform to the coding style of a target repository or to follow a different language standard, this might not be "transformative".

(Compare: A translation of Harry Potter to French looks superficially quite different from the English original, yet it is still a derivative work; and if you're planning to publish one, Ms Rowling (or her publisher) may want a word with you. And that would apply whether you translated it "manually" or pushed it through Google Translate.)


I'm not sure I buy that training a model is transformative in the fair use sense. How is training a model different from lossy compression?


Great answer! Thanks for taking the time to write this answer. Learnt something new!


> But apparently infringers are using this line of argument. So how? Is it not absurd?

You realise that they haven’t actually used that line of argument, right? The article author speculated that it might be part of the defence and then said they didn’t find it compelling. Set up a straw man and then knocked it down in virtually the same breath.

Don’t waste your time complaining about legal arguments that have not been made except in the imagination of one author.


It's the same argument people make about why crypto doesn't have to follow the laws on Know Your Customer. Because someone designed the crypto to break that law, so their hands are tied, it's too technically hard to comply.


I don’t understand the author’s position in this article. It spends a long time talking about details of the licenses, but I can’t see any way the suit will actually be about licenses, because if it’s about licenses then it seems patently obvious to me that GitHub will lose very quickly, because they have undoubtedly violated the terms of the licenses.

As I see it, the only leg GitHub can possibly stand on is the “fair use” exemption of copyright law—that the license is irrelevant, because they weren’t using it under that license.

So then you get to the last paragraph of the article, and the “fair use” claim is finally mentioned—as something the plaintiffs seem to be seeking to avoid bringing into it because that would make things messy. But… GitHub’s defence must be “fair use”, I can see no other response. Yes, the plaintiffs “chose to focus on something that is beyond factual dispute”, but how are GitHub ever going to do anything other than bring fair use into it? So I don’t see how they could expect it to “still provide the same damages” without bringing fair use into it. (And I can’t imagine GitHub will settle for anything other that total vindication here—even settlement would doom Copilot.)

Returning to the title: I cannot imagine any way that We May Learn Something About Creative Commons Licensing from this suit. About the interactions between copyright law and machine learning, maybe. But about CC-*, GPL, Apache-2.0, MIT, whatever? Nah, there’s nothing interesting about them in the suit, because if they were involved, it’d be cut and dried.


That's entirely correct, and is why the suit will most likely [1] fail.

[1] 35% chance of success on https://manifold.markets/JeffKaufman/will-the-github-copilot...


IMO fair use is still not a strong argument for Microsoft. They commercialized the product and made money out of it.

Fair use is only allowed if the work you're doing is purely for the greater good. I might be wrong though, IANAL.


You are wrong, though public interest is certainly the basis of the purpose of fair use doctrine. But the simplest way of demonstrating that fair use still allows commercialisation is probably this example: search engines absolutely depend on fair use if they include any content from the linked pages. (And some countries have even tried to call the act of linking copyright infringement, though they’ve tended to back off at least a little, to requiring at least the title or other content for it to be infringement, and not just the URL.)

https://en.wikipedia.org/wiki/Fair_use, lots of good reading there.


Search engines link to the original website


It might be a problem to treat it as copyright. Copyright applies to reproduction, distribution, public performance... if I go to a library or bookstore and I read books and look at their covers, copyright does not apply. Would an android that walks around learning things be subject to copyright? To what extent does it need a body and mobility to be more like a person and less as a scraper?

It might seem stupid, but I worry that if copyright begins applying to "mining" then the next thing is that it applies to humans watching things.

Of course, if an AI re-creates copyrighted content, copyright should apply. Just like it applies when I redraw and sell the Mona Lisa, but not when I store it in my memory. I would pass on the responsibility to users. I don't fear my use of Github Copilot because it's far from infringing any reasonable copyright... then again, I'm assuming the most likely way to infringe copyright with GPT is to use a prompt that almost explicitly requests it.


Funny thing is that recreations are not necessarily covered by copyright law. The clearest example of this is fonts, where (using imprecise terminology but I think it’ll be clear enough) the US only grants copyright protection to font files, but not the shapes—so tracing a commercial font is perfectly legal, even if you happened to end up with an identical result (though good luck proving to a court that that’s what you did).


I mean, today it's pretty easy to prove that if you're setting out to copy one of these fonts. Just film the whole process and upload it to YouTube.


I think one of the interesting things that will be covered in this lawsuit is whether the licence under which the code is released applies at all in the case of screen scraping.

The current understanding of screen scraping is that it is allowed, despite what is in the websites terms. Effectively if a human can access the content freely without having to actively agree to a license or terms you can scrape the content. You can't republish verbatim, but you can data mine and perform an analysis and publish that. This is how the legal status of all AI training data scraped from the web is being interpreted.

When it comes to open source code I suspect it will be found to be similar, if the code is freely visible on the web by a human without an active agreement to view it, then it will be possible to "scrape" it. I don't think the license the code is under will apply if that is the case.

Obviously in this case is GitHub "scraping" its own site for the training data? Probably not, that may come back to bite them.

This then also opens up all sorts of interesting questions of whether you can copy paste code from a website and use it internally (not republishing), despite the license attached to the code. If it is freely visible.

Clearly a test case, this one, is needed to clarify the situation. And just because it's legal, it doesn't mean it's moral or ethical.

We may yet see the outcome of this case change the current interpolation of legal screen scraping, it's going to be an interesting time.

On top of all this there is then the question of an AI model reproducing code (or and image or music) verbatim. That obviously needs to be clarified by the courts too.


> When it comes to open source code I suspect it will be found to be similar, if the code is freely visible on the web by a human without an active agreement to view it, then it will be possible to "scrape" it. I don't think the license the code is under will apply if that is the case.

I don't see the scraping case applying here -- the idea that all human-readable code accessible on the public internet can be ingested into such a system without regard for its license would effectively mean that anything posted to the public internet is entered into the public domain, which seems ridiculous on its face (especially when the author(s) is including an explicit license alongside that code, which the scraper can theoretically also read and account for).


I think we're facing a copyright extinction event. The whole concept is out of touch with the new reality - when you can generate 100 variations for your text, code or image with the click of a button, what does it even mean to hold copyright over the original?

"In the style of" killed copyright in 2022.


This is a pipe dream. There's too much money behind strictly enforcing copyright protections on commercial products. If anything gets killed, it's going to be automatic copyright protection for "little guys". Microsoft will be able to copy your publicly shared code/art/images willy-nilly but will still send their compliance officers to check that your company has a valid Office 365 license if they notice you writing a private letter in Word.

Edit: If you doubt this, notice that co-pilot was trained on public, open-source code on Github. Not on Microsoft's/Github's own proprietary code. If co-pilot is truly so transformative that copyright doesn't apply, why not feed it all of Microsoft's code to train it better?


> If you doubt this, notice that co-pilot was trained on public, open-source code on Github. Not on Microsoft's/Github's own proprietary code. If co-pilot is truly so transformative that copyright doesn't apply, why not feed it all of Microsoft's code to train it better?

I hear this argument a lot but I think the answer is actually pretty mundane: the model behind Copilot was trained by OpenAI, not Microsoft. Microsoft has a large investment in OpenAI, but they don't own the company, and AFAICT OpenAI did all of the scraping for Copilot on their own, without any special access to MS code.


>If you doubt this, notice that co-pilot was trained on public, open-source code on Github. Not on Microsoft's/Github's own proprietary code. If co-pilot is truly so transformative that copyright doesn't apply, why not feed it all of Microsoft's code to train it better?

Presumably because there might be trade secrets in there that they don't want to leak. That seems entirely separate to copywrite to me.


I am sure they have an in-house model trained on all their code. It would be immensely more useful than a generic model.


Haven't stepped on the right toes yet. Create code that will generate art "in the style of Disney/Nintendo" and hey would you look at that, copyright is back and alive again!


As someone who would vote to repeal copyright entirely, because I think its downsides outweigh its benefits, this is a good thing.


Agreed. I don't understand how people on threads like these turn into copyright hawks just because it seems to undermine their livelihoods, same as in the AI art threads. We should be abolishing copyright, not extending it.


I'd actually cut the term of all existing copyrights/patents by a factor of 10 (ie. 70 years goes down to 7, 1 additional year if you aren't dead yet.).

I'd then make a new combined copyright/patent system. You would get 1 year on anything creative. You can double the remaining time if you publish all the info necessary to easily recreate what you did. For software, that would be the source code. For music, the score and source recordings. For paintings, the source material and types of paint used, etc. For toys, the 3d design files, etc.

1/2 years of head start is plenty of head start to get your business going. I might make say a 10 year attribution requirement. I'd leave trademark law mostly as-is.


How many people want to read AI generated text in the style of Lord of the Rings vs how many people want to read Lord of the Rings?


If an AI can generate an infinite number of stories that take place in the LotR world and do it well and faithfully in the style of JRR Tolkien then I would happily read it. You can only read the trilogy and The Hobbit so many times.


You'd happily read it, but that's an answer to a different question. You've already read LotR. An AI generated novel in a similar style doesn't take money from Tolkien's estate.

Some people will want to read the AI generated knockoff. That number of people will usually be much smaller than the number of people who want to read the original and many of those people interested in the AI generated knockoff will also read the original.


The proportion really depends on 1) the quality of what the AI generates, and 2) expectations imposed by the society - in particular, how consumption of AI-generated stuff is perceived as a matter signaling of one's social class. Both are going to evolve rapidly, so it's really hard to tell where we're going to be in 10 years.


With sufficient quality of the model I can certainly imagine people preferring to read AI generated text in the style of Lord of the Rings instead of the original, as the AI generation can adjust the content and style towards the interests of the particular reader, customizing it to deliver what they want.

For example, one obvious observation from looking at fan fiction creations is that 'shipping' is popular, that certain people would strongly prefer certain characters to have romantic relations - and AI generated Lord of the Rings lookalike could have the story with the particular relationships that particular reader would prefer, based on an AI analysis of their earlier reactions to other books; this is not something that we have working today (as far as I know) but we're not that far from it becoming real.


Maybe it's not interesting to see what other people generate with it.

I want to have chatGPT answer my questions, and reference materials when doing that. For example I could paste an article and start asking probing questions, debating it, asking for summary, ELI5, etc.

It would be a research assistant and tutor.


>the idea that all human-readable code accessible on the public internet can be ingested into such a system without regard for its license would effectively mean that anything posted to the public internet is entered into the public domain

It clearly isn't in public domain. But if we define something new, public knowledge domain. This being material it is legal for a human to look at and learn from. They gain no rights to copyright or IP, but they can learn from it and use it per the existing limits of copyright and IP. Saying that anything posted online gets entered into public knowledge domain seem agreeable. Few things wouldn't be allowed here, generally material that is agreed upon to be illegal world wide (and some countries may have tighter limits, like a theocracy banning someone from learning anything from material deemed blasphemous).

Then it is a question of if an AI can also learn off of such material as long as it doesn't produce works that violate existing copyright or IP laws, same as a human. This doesn't seem, on its face, inherently ridiculous. There are still corner cases and potential for abuse, but those also exist with copyright law yet we don't throw the whole system away and just ban all forms of copying or selling the right to copy.


I see lots of folks equate "trained on" to "available verbatim" and that simply isn't the case for the vast majority of training data. It becomes hard to have a productive discussion when there is such focus on the examples that are regurgitated verbatim (often by people with explicit knowledge of the expected output, so they would *know* that they are going to infringe if they republished it) to the exclusion of talking about an in-general system that is trained on data and outputs unique data.

IMO that second case is a FAR more interesting question.


The reason why people focus on snippets regurgitated verbatim is because even one such snippet, if sufficiently long and non-trivial, could be sufficient to claim the model itself as a derived work.


I'm not clear how it's particularly different from ingesting it into a browser, and then rendering it as part of the html into pixels


> I think one of the interesting things that will be covered in this lawsuit is whether the licence under which the code is released applies at all in the case of screen scraping.

Screen scraping is essentially a question of whether or not the actions constitute something akin to hacking, which is almost completely orthogonal to copyright. The main intersection you get is that many screen scraping scenarios are about things that aren't copyrightable (the US doesn't recognize "sweat of the brow" doctrine, so databases aren't copyrightable). When Google was scraping lyrics off of lyric sites--lyrics being totally and clearly copyrightable--it was dinged pretty hard for that.


> When Google was scraping lyrics off of lyric sites--lyrics being totally and clearly copyrightable--it was dinged pretty hard for that.

Was google dinged for scraping the lyrics or publishing them verbatim? My understanding is that it was the latter.


> Effectively if a human can access the content freely without having to actively agree to a license or terms you can scrape the content. You can't republish verbatim, but you can data mine and perform an analysis and publish that.

Would this apply to books? I can walk into a library or book store and OCR countless pages of countless books without agreeing to any license.


Try publishing those OCRed pages and see what happens.


As I understand it, "fair use" would allow you to do various kinds of analysis of all those pages, and to quote fragments of them in your own work, which is arguably similar to what copilot etc might do.

But republishing large chunks of the content, whether verbatim or in a "derivative" form such as a translation, would generally not be allowed without explicit permission.

How much "fair use" allows when it comes to copilot spitting out copies of functions it saw in somebody's repo.... well, that's a line that hasn't been defined yet, afaik.


Steve Ballmer once called Linux and the GPL License a cancer because to copy a portion of code from a copyleft project, minimal as it may be, would make the whole project require a copyleft license.

If Github Co-Pilot includes GPL code then produced works should have GPL too, right? It is known that it produces verbatim copies of sections of code, so the 'derivative' explanation doesn't hold water.

Alternatives may be to copy CC0-only (which can't be guaranteed, also who'd believe you lol) or target license-only - as in, if my project is MIT, go with MIT projects to source from, if I'm GPL, include GPL and so on.


>It is known that it produces verbatim copies of sections of code, so the 'derivative' explanation doesn't hold water.

For reference it's been shown to cough up code from Quake verbatim. This from John Carmack(?) also includes his profanity laden comments:

https://twitter.com/mitsuhiko/status/1410886329924194309


To be fair, that's probably one of the most copied pieces of code already.


And it was (apparently) copied from something or someone at Xerox


> It is known that it produces verbatim copies of sections of code

This happens only rarely, like under 1% of the time. It happens mostly for well replicated code and not so much for code that only appears once. It can be filtered out with search and bloom filters of ngram hashes.

But the prompter can goad the model into copyright infringement by quoting the start of a copyrighted text verbatim, and asking for completion. The longer and more precise the prompt, the higher the chance of regurgitation. So, when it happens, we're often "asking for it".

Both regurgitation and hallucination seem to be LM problems we can tackle. They are complementary - in one we don't want the model to replicate the training data exactly (be creative), in the other we don't want the model to invent facts out of thin air (be factual). Both can be tackled by using search for reference testing.


> I don't see regurgitation as a long term problem, it's just waiting for attention, probably wasn't top priority

The fact that Microsoft is wary of providing it with things like the Windows source code says all we need to know about how much it can be trusted.


it's not just trained on GPL code. it's trained on code of incompatible licenses, which means that the code is produces is potentially unlicensable in general. There's of course also the question of attribution, which many licenses require


I wonder who's decision was it to train the bots in 'code that is accessible to our scraper' and not in 'code we can sell derivatives products of'. I can tell the second group is quite small, so I understand the incentive, at least.

Maybe they chose the Uber strategy of 'What we are doing is bordering on illegal but by the time the bell rings we'll be valuable enough to write the law ourselves'.


Microsoft chose to not use all code that is accessible to their scraper. All proprietary closed repositories on github that large companies use are excluded. I can easily image all the lawsuit that would had happened if paying customers that kept company secrets hosted at github had been used and leaked by copilot.

The decision to only train it on "open" repositories was one of self preservation.


But the question realistically is, is it the developers responsibility because they are the ones creating the program.

It's not like copilot made you use the code, copilot didn't commit or release the app with the code.

Copilot didn't breach the copyright, you did. It's a tool. You used it. You released it.

Maybe there's a product which you can include to see what code of yours violates copyright in the future?


> If Github Co-Pilot includes GPL code then produced works should have GPL too, right

No.


It will be a real shame if the fantastic achievement of OpenAI with copilot etc is smothered by ego.

Innovation in code should be heralded but if in the majority of cases the coder using Copilot and similar tools is just saving time on bog standard functions they could write themselves, it's difficult to understand why that needs to be attributed.


> It will be a real shame if the fantastic achievement of OpenAI with copilot etc is smothered by ego.

I don't know if you watch YouTube, but this is probably how every creator hit with a bullshit DMCA claim for 5s of audio from a song feels. Why does OpenAI's work demand special consideration here? Or to put it another way - if we're going to be ignoring copyright, everyone should be able to do it.


> if we're going to be ignoring copyright, everyone should be able to do it.

Yes, yesss. You're getting to the logical conclusion. Now I don't think Microsoft have become 'based' and want to break the copyright system but I hope they have inadvertently done so through their actions.


Oh I have no problem with protecting genuinely innovative work, including code. That's not the vast majority of code produced by or derived from these tools though.


I don't think it would be a big deal if OpenAI/Copilot get shut down. Honestly it might be a good thing. Then we can generate new versions of these tools that are truly open using data that has been freely contributed, rather than obtained by for profit companies in shady cash grab.


And those tools will be similarly illegal if the court strikes down Copilot. Also it costs hundreds of thousands of dollars to train things like Copilot and GPT-3, so can we really rely on innovation happening in open source without any way to recoup costs? I get that you might not like OpenAI/Copilot for creating these tools in the way that they did, but surely you have to see that this decision goes WAY beyond what you blithely call a "shady cash grab".


You seem to completely miss the point about using data which was freely given. I would say that most of us like the idea of Copilot what OpenAI is accomplishing. The main issue stems from violating licenses which require attribution etc. As the article noted, one can get around attribution by getting express permission from the copyright holder (or by not using their work at all).

The fact of the matter is that some companies have made a paid service by violating the copyright of individuals. That's fundamentally not okay.


So we are going to end up with a less powerful version of Copilot, which would benefit who? Copilot competitors?


This field is moving so fast that copilot is already way behind state of the art. New tools, even with more limited data sets, are going to be more powerful, not less powerful.


It's very easy to be generous with other people's property, intellectual or otherwise.


It's also very easy to invent supposed "intellectual property" rights out of thin air that conveniently last for a century or more (!) for your own work.

But instead of either of these views that focus on selfishness, it's much more productive to instead think about what benefits society as a whole.


Microsoft has been one of the most aggressive enforcers of intellectual property rights ever known, and it can be easily argued that their stranglehold retarded the development of computing and the internet as a whole.

It's so bizarre to see these prima facie bad arguments being used by someone who isn't being paid by Microsoft to make them. If Microsoft puts all of their code in, requires anyone who uses copilot to share their code with the algorithm, and allows anyone to run the model on their own platform complete with updates, you'll see all of the FOSS objections dry up in an instant.


It's also pretty easy to be generous with your own when it comes to copilot. I have not been damaged in any way by copilot learning from my code, and no one else has either.


I'm sure that once you've informed everyone involved that you've declared this to be true, they'll call off the lawsuits and apologize.


You could make this kind of "just" and "bog standard" argument for anything. Just using an image for educational or illustrative purpose, just using a song for a political rally etc etc.

The fact is as a society we have decided to reward creators with copyright as a means to commercialise their creation and get compensation. Who is to say programmers are not creators and the compensation they want for open source licenses is attribution?

Microsoft really should have known better than to touch OpenAI without a 10 feet barge pole.


How are you a "creator" (in an attribution-worthy sense) if you are producing an unoriginal implementation of an old algorithm that thousands of coders have produced before you?

Most coding is not innovative, and that is the kind of code that these tools are producing and derived from in most cases.


> How are you a "creator" (in an attribution-worthy sense) if you are producing an unoriginal implementation of an old algorithm that thousands of coders have produced before you?

So your requirements are pseudo-code which you simply have ti translate. I see. No creativity required. Jepp.

> Most coding is not innovative, and that is the kind of code that these tools are producing and derived from in most cases.

I see what you want to suggest. Then it woulnb't be required to learn on these datasets and simply build a "fair use" product which covers these cases with a snippet engine.

Don't be naive.


> How are you a "creator" (in an attribution-worthy sense) if you are producing an unoriginal implementation of an old algorithm that thousands of coders have produced before you?

If programming is nothing but translating unoriginal old algorithms, then you should train copilot on those. Nobody would complain. The fact that they don't is an unassailable proof of the unsurprising fact that programmers add value to programs.


I agree with you. I love using copilot (and similar tools) and I’ve found it is exceedingly good at predicting patterns in my own coding. It has saved me a lot of time, and I would hate for it to go away or be crippled because of lawsuits like this. I could care less if my own code is used for training. My code isn’t precious, it’s what I do with it that is important.


You're arguing that it should be allowed because you don't feel it harms you personally.

Literally no one objects to people like you donating your code to Microsoft.


Protecting the property rights of the rich is protecting freedom and civilization, protecting the property rights of people who share their work is ego.


In my mind I don't want MS to rehash my code and sell it using "OpenAI" as some laundry machine.

Given how insane copyright laws are I would be pleased if they for once worked in my favour.


Completely agreed. It's crazy that people who claim to value freedom and being open quibble over copyright laws and licenses written for and by lawyers.

After reading the comments here, apparently everyone is a genius with code so special and unique, it would be unfathomable for two or more people to arrive at the same exact outcome.


Nobody is asking for AI in general to be illegal, only training on code you don't own and then emitting it for profit.

Why can't they train on code that they own, such as the windows source code?

Why doesn't copyright law apply to open source code but applies strongly to windows source code?


Tell me you don't know the history of and reasoning for free software (and attribution) licenses, without telling me you don't know the history of... etc.

Mind-bogglingly entitled.


Who's more entitled? The coder who has no issue with their unoriginal code being copied and mixed with millions of other samples and churned out in a helpful way for others, or the one who demands attribution in the most trivial of cases, or denies the access in these forms as it doesn't credit their brilliance in implementing a sort function?

I'm in the first category; I'm guessing by your abusive response you are in the second?


Nice straw-men you're collecting there.

First one: Uses of attribution and copyleft licenses are just ego-boosting, instead of legitimate protection of authorship against corporate piracy.

Second: Criticism of said corporate exploitation of community work is the actual entitled behaviour. Oh, it's also abusive.

Third strawmen: that people who oppose CoPilot in its current form just want to defend copyright around boilerplate stack-overflowish type code.

All false.

I can only assume... You're either too young and inexperienced to remember the early days of the copyleft, free software, and open source movements and why these licenses exist (and still need to exist)... or your values are so backwards that you just think it's Ok to harvest other people's hard work for your own (or your employer's) own profit.

To be clear: There is no heuristic at work in something like CoPilot that can distinguish between boilerplate code and genuine innovation. It has been shown multiple times to just freely copy and paste novel, copyrighted code; without attribution or conforming to license restrictions. That is unacceptable and deserving of legal countermeasures.

I would have no problem with CoPilot copying only the code of people who have opened their code for that kind of use. But that's not what it does.

Notable that Microsoft, its owner, is not training CoPilot on its own massive corpus of code. Just other people's code.


"Second: Criticism of said corporate exploitation of community work is the actual entitled behaviour. Oh, it's also abusive."

No. You are being abusive when you throw out insults to a commenter who argues something you disagree with - please don't try and obfuscate what you were doing, and check the HN guidelines before commenting further, as you are repeating the hostile and condescending tone and should know better.


Looking at this thread, I don't think you have much room to call someone else out for being condescending. I think it'd be useful for you to slow down and write a more considered position, taking time to address the valid concerns others have raised.


There are definitely valid concerns and counter arguments; they will be more powerful and persuasive when presented on their own merits rather than with the assumption and accusation the poster they are responding to is unethical, stupid or commenting in bad faith. I think I've been polite in response.

However, as I started the thread with a comment that I guess was more provocative (and perhaps more personally felt by others here) than I intended, I'll accept your criticism and bow out.


The coder that thinks that making the first decision about their own code entitles them to everyone else's.


[flagged]


Not that it matters but I contribute to open source myself.

Wrt "devious" and "asshole" I see you are a new poster, you may want to check the site guidelines linked at the bottom of this page.


"Copyright" is not a natural right, it's an artificial right we invented ostensibly to benefit society. If innovations like Copilot provide more benefit then they could get exemptions. That's why fair use is an exemption.


The company that owns Copilot believes in the aggressive enforcement of copyright.


AI advancements are shining a light on how warped our society has become. We _should_ invent technology to better our lives, but instead its become something to fear as it might destroy our livelihoods. I understand the threatening feeling AI brings, but rather than squashing innovation we should be rethinking the role of our economy, copyright, and intellectual property.


I agree with the sentiment however given the current political realities it seems we are happier to let people lose their livelihoods without any clear replacements , training or economic plan of any kind. See manufacturing in the rust belt for an example of what happened in the last two decades when we moved to the knowledge worker economy without a plan for the workers.


AI had enormous potential for interesting artistic effects. Instead it's being used to make (for now) shoddy knockoffs, and maybe someday better-than-the-original knockoffs. I was hoping for the invention of a car, but all we got was a faster horse.

Would anyone have been able to sell a car if faster horses were already clogging up the streets?


Copyright covers expression, but not the ideas themselves. So it should be ok to mine ideas from projects, open or not, as long as the model doesn't reproduce expression. And even expression can be copied if it is small enough, trivial, public knowledge, the only obvious way to do something or an API call.

If you want idea protection you need to look at patents.


Google couldn't get this argument to fly for software APIs-are-not-copyrightable at SCOTUS. And that was for an argument where pretty much every computer person except Oracle agreed that Google was right.

Arguing that AI is mining ideas and not expressions is going to be a lot less successful when you've got a large pool of expert witnesses who are going to be able and willing to say that AI is only capable of mimicking the form of what it sees.


Wait, what do you mean Google couldn't get that argument to fly? Google won the case. Are you referring to the fact that the SCOTUS didn't directly address the copyrightability of APIs and instead ruled in favor of Google on the basis that Google's use was fair use *even if* APIs are copyrightable?

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....


Yes. Google's argument that software APIs weren't copyrightable because of the idea-expression divide didn't find purchase. It won the case on other (fair use grounds). Indeed, if you read the opinion closely, there are a few places where it looks like there used to be a section in the opinion on software API copyrightability (given Breyer was the author, noncopyrightability is probably more apt) that was ripped out in later drafts, presumably because there wasn't the votes for it.


If Sally wrote a program that generated a giant dataset of token frequencies and associations by analyzing Github source code, that doesn't violate copyright and she could sell that.

If Bob wrote a program that took Sally's dataset and produced source code from prompts, that too would in principle not violate copyright and he could sell that.

But you're suggesting that if one person did both at once, that would violate copyright?


AI is not just that. That's a myth.

Yes, when you train on a human generated corpus you're going to get an interpolative AI. But when you train on a dataset generated by AI, you can surpass human level.

AlphaZero started, as the name says, from zero. No examples of how we play the game. In just three days of self-play it surpassed the best human. How was that possible? We got a 2000 year head start, larger brains, and many players, not just one model. AlphaZero did it by creating its own training data.

Here is an example of using code language models for this kind of dataset-creation by play.

> Evolution through Large Models

https://arxiv.org/abs/2206.08896

By the way, GPT-3 and chatGPT are not simple interpolative language models. They have also been trained on many tasks, problems, and codes, which is the kind of data that will awaken the skills of the model to a whole new level. What I am saying is that we can auto-generate more problem data and increase their skills.


>And even expression can be copied if it is small enough

This is what I would argue if I was Microsoft's lawyer. You can't win a copyright lawsuit over one bar of music, one dance move, or a few words. Similarly, copilot can't be considered to be infringing on anyone's copyright because the snippets it might copy verbatim are too short to be copyrightable.


> You can't win a copyright lawsuit over one bar of music

Actually, Kraftwerk did

https://www.factmag.com/2019/07/30/kraftwerk-sample-lawsuit-...


In Europe - not where this lawsuit is happening


Somebody needs to apologize to Biz Markie and all of hiphop for establishing precedents that say the opposite, and for all the work they have to go through to clear a sampled 50 year old drum fill (usually including credit as coauthors and publishing residuals.)

There are bands that are giving all of the profits of entire albums to the people whose sample they failed to clear on a single song.


> You can't win a copyright lawsuit over one bar of music, one dance move, or a few words.

Actually... The smallest successful copyright claim is I believe over 10 words or so.


Yes, this case: https://fkks.com/news/can-borrowing-ten-words-be-copyright-i...

However, this definitely pushed the boundaries of copyright law. Audi never appealed the $1m verdict, so it's hardly settled case law.


> If you want idea protection you need to look at patents.

No. You can't.

https://www.legalzoom.com/articles/can-you-patent-an-idea

'A machine sucking dust from a surface in order to clean it' is an idea.

Dyson design is an invention.

https://www.dyson.com/vacuum-cleaners/cordless/v15/detect/ye...

Here a broad description of differences between copyright and patent (trademark included):

https://copyrightalliance.org/faqs/difference-copyright-pate...

None of them covers the principle of idea.

And if you want an example of a dubious (in my opinion) attempt to patent an idea (patent troll):

https://patents.google.com/patent/EP3811791A1/en

While is it a patent trolling attempt (in my opinion) ?

Because it's basically an idea (adding a natural compound in ecigarette eliquids as sweetener) with an extensive list of layers covering general principles of everything related to ecigarette and ecigarette eliquid to pretend it's a specific formula (which is not) or an invention (which is neither because it's a natural compound known for its sweetener properties since 19th century).


If it reaches a decision, I’m curious how it will affect “education”. If you study a hundred repos to learn how to do a thing, and then produce something of your own that happens to be similar (because that’s how you learned), are you under any particular obligations?

One hopes Oracle et al are not further inspired by such an outcome.


Humans aren't machines, so it won't affect education at all. This argument is a distraction.


Humans are machines actually. Whether what Copilot does can actually qualify as "learning" vs. merely generating derivative works might actually be relevant.


Machines aren't humans.


Sure, but can machines learn and create new works the way we do? That's the only relevant question.


Learning from others is copyright infringement, better make sure none of your code contains any sequences longer than 150 characters that match any other code ever written.


> Learning from others is copyright infringement

Isn't this the reason for clean room implementations?


That's specific to reverse engineering a product that has the same features as an existing one though. Part of the concept of open source is allowing others to learn from your implementation. In many cases one doesn't want to include an open source project directly in their code base for a variety of reasons, maintainability being a big factor. I don't see a clean room implementation being a good solution to the problem of wanting to use accepted standards and practices which must be gleaned from experience working with or learning from other people's code. I don't think you can get far arguing that all code should be written in a vacuum with no ability to learn from outside sources.


Wouldn't this kind of ruling effectively put a halt to ChatGPT and other AI's training on publicly accessible data? What's the difference between Copilot creating output based on code on Github, and ChatGPT giving answers based on a NYT article (without attribution)?


IMO it should be treated like a human. Your output is 99% similar to this <code/article>? Copyright infringement, you should have mixed your own thoughts and reasoning into your output. Humans can plagiarize just as easy as ChatGPT/CoPilot can generate verbatim text from its training set.


At what point can one say that the source is unique enough to qualify for protection? Otherwise, I can't use `print "Hello world"` because I didn't mix my own thoughts and reasoning into the output.


It's no different from how current copyright works for us humans. Something is only copyright protected if it's a "sufficiently original" work and "possesses at least a minimal degree of creativity" https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...


So (just thinking out loud), if Copilot suggests something only seen in one codebase, the code owners have a decent copyright case. But if copilot suggests something that's frequent across multiple, there's really no case to be made.


That's at least one of the rules that GH is trying to enforce on CoPilot, but legally I imagine that even repeating code that appears multiple times on the internet could be considered copyrighted infringement (ie. if multiple people copied that code from one person).

The problem here ends up being that code, especially in popular languages, will always looks similar when you're doing something like finding the best implementation for an algorithm. So if you invoke CoPilot for a common problem, chances are it can pull the exact code it needs from is dataset, but it also could've generated that same code snipped had the solution not existed in its training dataset. And when you start out solving a problem then ask it to continue writing more code, it just assumes you're solving the exact same problem that the original source code was solving.

This could probably be remedied if CoPilot spit out a "this is % similar to <x> source code from the internet" so that you can know just how unique CoPilot is being. Legally, copyright is just a mess and was not ready for the scale of the internet nor the advancements in ML when there are machines that have a 50% chance of infringing on someone's copyright and 50% chance of creating something new.


My understanding is that code presented for learning/demonstration purposes are allowed to be copied verbatim.


Good question. I think that's exactly what we need to decide as a society: both whether these things are violating existing copyright laws, and whether the laws should be changed to specifically handle this new situation. I don't know where I land, myself. My gut instinct is on the side of the creators and that these AI tools are illegally infringing on the creators' work. But I think there's reasonable arguments on both sides.


There is a difference between products and research in this case. Frequently, research is allowed by law as an exception, where building a product requires some more extensive agreement.

Unfortunately this is frequently abused where researchers build a model under the exemptions, and then others use that model commercially, even if they wouldn’t be allowed to build that model directly themselves.

Anyways, the scientific progress would continue, but products would halt until product developers get some kind of agreements with content creators (eg maybe people start adopting a new kind of open-ish license).


There is no difference, which is why this lawsuit won't be the only one


I just skimmed through the case and liked how they defined Artificial intelligence

“Artificial Intelligence’ is referred to herein as ‘AI’. AI is defined for the purposes of this Complaint as a computer program that algorithmically simulates human reasoning or inference, often using statistical methods. Machine Learning (‘ML’) is a subset of AI in which the behavior of the program is derived from studying a corpus of material called training data.”


Has there been a law suit filed on the image generation side? Dalle and Stable Diffusion trained on images on the web, a lot of which weren't even freely licensed. So I would think similar legal arguments would apply there.


Definitely about to happen as soon as lawyers can establish reasonable damages can be assessable to be worth the effort - or they piss off some rich artist who will be willing to underwrite the cost ala Theil did with Gawker. There was a ruckus just some weeks ago about some artist feeling that her style of art has become very easy to rip off.


The big difference there is that you never get a 1 to 1 copy of the source content out of the image models, where you often do with copilot. Whether your use is transformative is a part of the fair use legal test.


> where you often do with copilot

Not at all. You need to bait it really hard and push it into a corner for it to reproduce anything. At that point you might just as well go to the repo and copypaste the code directly.

I've used Copilot since day one and still haven't seen anything that felt like a 1 to 1 copy of something. It's highly contextual and all about using the code I have already written to craft its suggestions. Even if I ask it explicitly for a known algorithm it will use my code style, patterns and naming conventions to write it.


Except that not only Stability AI have already admitted to training on copyrighted images without the permission or attribution of many artists [0], they already have a commercial SaaS API platform which uses the model [1]; effectively throwing out the sloppy 'fair use' or 'transformative' purposes claims as any artist can see all the digital art that Stability trained on without permission or attribution [2] and even producing outputs of famous copyrighted images verbatim.

There is no difference between the two. Given that another project called Dance Diffusion was trained on public domain music and audio and permission by musicians, it is clear that Stability knew that they would be sued to the ground if Dance Diffusion was trained on copyrighted music. [3] They already seem to have admitted and felt guilty into trampling over the copyright and watermarks of images and knowingly avoided doing the same with music.

My point is, use and train on public domain content only and content that has the permission of the creator. This applies to all of them; DALL-E, Copilot, Stable Diffusion. Clearly wasn't a problem to use public domain music with Dance Diffusion was it?

[0] https://venturebeat.com/ai/stability-ai-to-honor-artist-opt-...

[1] https://platform.stability.ai/

[2] https://twitter.com/EMostaque/status/1603147709229170695

[3] https://techcrunch.com/2022/10/07/ai-music-generator-dance-d...


I seem quite a few exact copies of watermarks transormatively 'artistically generated' on AI images.


An identical/similar blob of pixels across tens of thousands of otherwise unrelated images is exactly the kind of thing I would expect to see mashed up. I'm not sure what your argument is here.


I suspect that a lot of the people lining up on the other side against GitHub/Microsoft won't be so happy if the courts further lock down permissible uses under copyright across the board.

More specifically, if Copilot breaks the "rules," so too does (probably--IANAL) pretty much every generative AI project out there. Restricting training to public domain datasets would be very limiting.


All copyrights and patents slow progress. If the desire for generative models is greater than the desire to hoover up cash, refine or end copyright and patent laws. Free Software people would build a statue to Microsoft if they started campaigning for an end to software copyrights.


>Free Software people would build a statue to Microsoft if they started campaigning for an end to software copyrights.

Some might. But FOSS licenses can exist because of copyright. So if, hypothetically, there weren't software copyrights, anyone could take any code and monetize it however they wanted with no restrictions. That might or might not be a big deal--the general trend has been towards more permissive licenses anyway.


I consider myself to be kind of a Free Software person and I wouldn't cheer for the end to software copyright, I worry that the abolition would harm the cause I care for, which is preserving the freedoms of users. Getting rid of copyright wouldn't stop bad actors from trying to circumvent their users' rights, and with no recourse to copyleft it would become harder to fight that.


I’m surprised it’s taken this long to see something like this on HN.

Maybe something like this could be a propellant for legitimate copyright reform.


Given who has more money for lawyers, I worry that copyright reform will take the form of small creators being given weaker protection than large entities - for example by making formally registering a copyright on a work necessary to get "full" protections, while self-applied licenses are treated as merely suggestions.


There have been plenty of discussions of this subject here.

Eg

https://hn.algolia.com/?q=copilot+copyright


Well I clearly saw this miles away. [0] [1]

Effectively you can't even use the code since it has trampled on licenses which are incompatible with each other. But it is more of a problem for Copilot.

[0] https://news.ycombinator.com/item?id=27725322

[1] https://news.ycombinator.com/item?id=27772446


I see lots of anti GitHub (really Microsoft) sentiment here, but doesn’t a ruling against GitHub have massive implications for any “all powerful ML trained AI model” period?

Like, we’re all swooning about ChatGPT, but how can chatGPT be legal if this isn’t? I literally can ask it “write me a song about cryptocurrency in the style of Taylor swift” and it will. It can’t do that it it didn’t train with Taylor Swift song lyrics.

Doesn’t this kill chatGPT? Doesn’t this law suit potentially kill lots of training models?


It very much depends on the exact ruling. Judges usually don't like to make rulings broader than they need to be, so it may well be something decided on a technicality. Or the ruling may rely on some particular property of the Copilot model that does not apply to ChatGPT.


Good riddance.



I think MSFT will regret not using some kind of bloom filter to avoid producing existing code verbatim. If they could show that it will never reproduce a long sequence (unless that sequence appears in more than x independent repos) they would have a much stronger argument of being transformative.


I've produced a few pieces of software that has user generated content as a feature and every lawyer I've used to draft the ToS has added provisions that gave my organization explicit rights that extend beyond whatever other licensing may be applicable. Both in being able to improve our own features or create new ones with that content and other rights that would indemnify us from a wide range of legal issues.

I'm curious to see how this will be litigated because of that. The difference here seems to be how publicly Copilot's use of user generated content has been, the wide range of licensing that content falls under and the fact this is an entirely new product produced by a third party with a partnership versus a 1st party tool.


We've discussed the suit here before a few times (ex: https://news.ycombinator.com/item?id=33485544) and also whether something like co-pilot even needs a license (https://news.ycombinator.com/item?id=27736650).

There's a prediction market on this suit's success, which is currently at 35% (https://manifold.markets/JeffKaufman/will-the-github-copilot...).


Here's a link to the case itself so you won't have to guess about anything. https://githubcopilotinvestigation.com/


Would love to see an overlap of the commenters here versus the ones in the "piracy is totally okay!" thread with people saying it's morally just fine to pirate videos and music just because they wanna.


Good, I put all my code on GitHub under the MIT license because I'd like people to be able to use it as they see fit.

That doesn't include not giving me credit.


Isn't it strange how Microsoft GitHub is now finding itself party to source code licensing lawsuits after its 'association' with the SCO Unix cases of yesterday.

[1] https://www.cnet.com/tech/tech-industry/fact-and-fiction-in-...


Related. Others?

The lawsuit against Microsoft, GitHub and OpenAI that could change rules of AI - https://news.ycombinator.com/item?id=33546009 - Nov 2022 (5 comments)

An open source lawyer’s view on the copilot class action lawsuit - https://news.ycombinator.com/item?id=33542813 - Nov 2022 (175 comments)

Microsoft sued for open-source piracy through GitHub Copilot - https://news.ycombinator.com/item?id=33485544 - Nov 2022 (288 comments)

We've filed a lawsuit against GitHub Copilot - https://news.ycombinator.com/item?id=33457063 - Nov 2022 (781 comments)

GitHub Copilot may steer Microsoft into a copyright lawsuit - https://news.ycombinator.com/item?id=33278726 - Oct 2022 (11 comments)

GitHub Copi­lot inves­ti­ga­tion - https://news.ycombinator.com/item?id=33240341 - Oct 2022 (1219 comments)

What Copilot means for open source - https://news.ycombinator.com/item?id=31878290 - June 2022 (137 comments)

Should GitHub be sued for training Copilot on GPL code? - https://news.ycombinator.com/item?id=31847931 - June 2022 (300 comments)


This seems to ignore the widely repeated claim that GitHub's terms of service explicitly grant them a license beyond the actual open source license attached to the code and thus transfer the burden of liability to the uploader when it comes to code they can not control the licensing of.

So either this is about code authored by people who did not use GitHub (in which case GitHub would be immediately liable, though they could try to sue whoever uploaded that code to GitHub for damages) or it's going to have to argue that the terms of service can't smuggle in a provision that effectively sidesteps even the most permissive open source licenses.


The relevant part of the terms would be https://docs.github.com/en/site-policy/github-terms/github-t..., plus the definition of Service, which would include Copilot. I don’t believe GitHub have ever claimed to be or suggested that they are using this, and I think it would be a very shaky claim in a court due to the second paragraph.

Rather, GitHub have consistently cited “fair use”, as also noted in the suit, including in the summary at https://unicourt.com/case/pc-db5-doe-1-et-al-v-github-inc-et.... I also don’t believe GitHub have ever claimed to only use GitHub repositories, though I know of no obvious evidence of them having fetched from other sources, and they may well not have simply because it’s more convenient not to and they’ve got enough already, even if it honestly weakens their position (“if you’re relying on ‘fair use’, why haven’t you added closed-source software like the GitHub backend to show you mean it?”).


I think they're challenging the validity of what's in the user license agreement. Companies can put whatever they want in there, but not everything is enforceable.


I think it wouldn't be enforceable in a consumer service (at least in my jurisdiction, see a German court ruling against WhatsApp banning a user for using a third-party client by claiming doing so violated their ToS).

But given that implicitly or explicitly GitHub users act more like users of a commercial service (remember: commercial doesn't mean paid or b2b), things might be different given that consumer protections don't necessarily apply.

Personally I'd love to see the same "you can't hide surprises in your ToS to obtain 'consent'" yardstick be applied here though, commercial service or not.


I wouldn't be surprised if the result varied by country.


I suggest reading the actual complaint. Your interpretation is wrong.


Is the actual complaint linked or quoted somewhere in the article? I've re-read it twice and it spends most of its wordcount explaining what open source is, mentioning a previous case and describing the implications of the attribution requirement with regard to the DMCA. There are plenty of links but they go in all kinds of places except to the ruling itself.

Your response is not very helpful beyond telling me I lack information I'm unaware of and couldn't find.


Ethics aside, By the time this case is resolved it won't likely matter anymore. I see models getting more granular and specific (and thus smaller) while typical personal computer resources are getting larger. At some point you'll just download the largest 'javascript inference model' that will run on your hardware and be done with it.

There are a huge number of motivated developers that want this to exist, the techniques themselves are not novel, and the code itself cannot be kept from whomever wants to train a model. It's a lost cause.


Thought exercise: How can GitHub claim ANY license to my code that was uploaded by someone else without my permission or even notification? Lots of my code is there, uploaded by others.


Here is a dedicated website from the plaintiffs team. https://githubcopilotlitigation.com/

Interestingly, Matthew Butterick (of Practical Typography fame) is co-counsel on this. Programmer, lawyer and typesetting expert, damn. https://matthewbutterick.com/


If the model is being trained on the code and is not copying and pasting it or including it directly from the various repositories then I would think a blanket attribution that covers all material used to train the model, basically a giant list of all of the authors, added to the Copilot repo should cover the attribution requirement.

It'll be interesting to see how this plays out, to my understanding, these language models are strictly statistical in nature so they aren't creating a database of code that they paste snippets from. They're looking at all the examples and encoding the statistical likelihood that one token follows another and are then just feeding in the pre-amble (the code you wrote) and generating the chain of tokens that most likely follows that. It seems like it would be the same process if a person were to read a lot of code, identify patterns e.g. an <a> tag has an href= attribute or other more complex configurations, and then writes code based on that understanding. If you can prove that is infringing then you could potentially prove that the act of reading other people's code and then writing your own based on what you have learnt is infringement even if it doesn't exactly match the code that other people have written! I hope this can be effectively explained in court.


The main sources of discomfort come from when Copilot really does copy-paste code blocks verbatim. It doesn't happen all the time, but there are still many examples like the following: https://twitter.com/mitsuhiko/status/1410886329924194309


I consider it a derivative work, it’s using statistical models to string tokens and has no ‘knowledge’ of a given block of code.


The issues stem from Copilot frequently regurgitating code verbatim, as seen in examples like this: https://twitter.com/mitsuhiko/status/1410886329924194309


The article is once again mixing up the production of copyrighted work, which is illegal and the training on copyrighted but publicly available work, which afaik isn't illegal. And I don't see how it could be illegal when the code is public (although I don't doubt that lawyers will find a way).


The lawsuit is about attributions and licenses, not about copyright infringement.


It's the same issue. It's only valid if the code produced is the same than the code learned.


I don’t think this is necessarily true for derived works. It’s a gray area in any case, so the outcome will be interesting.


Even if it reproduces (proportionally) small sections of code it's been trained on verbatim, its function is transformative and will likely be deemed fair use. This lawsuit is going nowhere other than the HN front page.


The actual class-action lawsuit page, curiously not linked from the article: https://githubcopilotlitigation.com/


So really it boils down to attribution, therefore if GitHub were to disclose attribution to all the copyright owners of the code used to train the model, then this issue will be mute.

It will be a long list, but just a list.


It would be very difficult to track what record in the training set contributed to what weight adjustment, especially after all the tokenization that is done.

s/mute/moot/; moot: having little or no practical relevance, typically because the subject is too uncertain to allow a decision.


And there is a list right?

https://github.com/search?q=license%3Acc0-1.0&type=Repositor...

All repos with a particular license.


full screen popup/overlay "Sign up for daily email alerts"

This popup appears within a half second of the page loading. It's so abrupt and disruptive to my mind. My eyes have JUST located where to start reading and then BAM THIS GIANT POPUP TAKES OVER THE SCREEN.

I'm exiting your page immediately. I don't care if you have a free recipe for alchemy. I wish I had an easy way to outright block the domain so I could keep track of the offenders.

Popups and overlays have ruined the modern web.


If i create a news website that creates articles by scrapes all the major news websites across the world and charge people via ads, would that be legal? Not rhetorical.


I don't understand this case. Is Copilot copying code wholesale and presenting it as its own? Because if not, then it doesn't need to attribute anything more than I need to attribute John Go or Edward PHP everytime I use a trick I picked up by reading their code. Obviously, if they trained Copilot on private code repos, that a whole other discussion. But I assume they didn't, so you don't even have the argument that they should have paid the repo authors to be able to train on their code.


Yes, Copilot has had issues with copying entire blocks of code verbatim. There's a decent number of examples like the following: https://twitter.com/mitsuhiko/status/1410886329924194309.


yeah that's great but we still unlocked tons of productivity

who cares man

just build stuff

if you don't want people to learn off your code just don't share it!


Interestingly, the same companies who made the paid service by analyzing all of that open-source code would never, ever consider open-sourcing the code for that service.

> if you don't want people to learn off your code

"Learn" is a strange verb to use here. No one at Microsoft or OpenAI was scraping all of GitHub so that they could learn. They took people's licensed works, fed it into a very sophisticated copy-paste machine, and started making money off of it.

> just don't share it!

It's almost like licenses and copyright exist to protect the rights of their holders or something.

The entire point of licenses is to be able to share your work in a way that respects your wishes. "Just don't share it" is completely non-productive.


the cost of training is exponentially decreasing. It's only a matter of time before a codex-like model is released to the open.


There has to be an option or something like "repositoy opt-out from copilot" type of button.


This needs to be opt-in, not opt-out.


FOSS licenses need to add provisions that only 100% open and free to download AI models can be trained on works licensed under them. They need to add this yesterday.


That's almost certainly not enforcable though. See any of the cases where people have succesfully defended webscraping while violating TOS.


I can see that in like GPLv4, they could consider AI training and such as derivative work.


Do you consider WTFPL a FOSS license? The polite versions are MIT0, BSD0 and CC0. I guess anyone using them is fine with Copilot been close and doing whatever they want. But people using AGPL may have a very different opinion.


Doesn't sound very Free to me...


I hope people hoping this lawsuit succeeds will also accept any ruling favorable to Github.


Is encrypting your code and pushing that to GH an option or am I missing something obvious.


This is going to have ramifications for things like ChatGPT as well.


This may be of interest to others here. After reviewing the existing OSS licenses, I decided to write my own (SAUCR: Source Available Under Commercial Restriction—pronounced "saucer"). I'm still working on formalizing the details of it so others can use it, but if you're curious there's an example here [1].

tl;dr it gives specific permissions as to what derivative works are and are not permitted while making the source available for others. The key being: you can be as permissive or as limited as you want in how your code is used.

[1] https://github.com/cheatcode/joystick/blob/development/LICEN...


This is why we can't have nice things.


> Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?

> Does the attribution need to be included in the data set at every stage?

The above two questions seem identical. My gut feeling is no; you don't need to intentionally train on attributions nor do you need to ensure that the data set columns have attribution data in them when provided to your training code. The attribution requirement of CC-BY and CC-BY-SA triggers when you do any of the things copyright law says you have to get permission in order to do, and the license further restricts that requirement to public instances of such. So privately shoving Creative Commons data into a neural net trainer is probably fine. Not having attribution on the data the model sees does not foreclose the possibility of providing attribution alongside the model at the time of publication.

> Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

This is an active legal question.

My personal opinion is that if you can draw a line from a particular output to something in the training set, then you're either copying or creating a derivative work, and you need to follow any relevant licenses. Generative models are capable of outputting their training set data, especially if overfit; so using them exposes you to the licensing requirements of anything the model saw that matches its output domain.

I'm actually considering this as part of PD-Diffusion; which is my attempt at building an art generator trained on public-domain images. Since I'm scraping Wikimedia Commons to get both images and labels, the labels are CC-BY-SA[0]. This means that my trained models will also need to be CC-BY-SA and ship with a very, very long text file listing attributions for all the labels I used[1]. But, notably, because this is an art generator and not a label generator, I don't need to worry about attributing model outputs. None of the label data will make its way into the final image.

If I did have a reliable way to attribute model outputs, then I could make "CC-Diffusion", trained on all CC-BY and CC-BY-SA images on Wikimedia Commons. But even then there's noncopyright ethical concerns with doing that. I'm not even using the full public domain as-is, just the PD-Old category, because Wikimedia Commons has a lot of uncopyrighted Italian images of living people that I do not want in my model.

> Also sued were a confusing mishmash of for profit and non-profit related entities all using a variation of the name OpenAI (OpenAI, Inc., OpenAI, LLC, OpenAI Startup Fund GP I, L.L.C.; you get the picture). OpenAI received one billion dollars in funding from Microsoft although they seem “officially unrelated.”

OpenAI's ownership structure is hilariously convoluted, even by the standards of, say, Mozilla having separate 501(c)(3) and for-profit arms. The goal of the company is to launder noncommercial research into commercial products, and they even have a laughable "capped profit corporation" explanation for this.

[0] There are two exceptions to this:

- Structured data, i.e. the caption field and Wikidata, is considered to be copyright-free and explicitly has a CC0 license applied to it.

- Some public domain images have contradicting copyright terms applied to their wikitext; i.e. CC-BY-NC-SA. Those will need to be detected and filtered out of the label set, but I haven't written the code to do this yet. That's also why I haven't released any trained models.

[1] Currently this is going to be in the form of Wikimedia Commons usernames, specifically all the users that were in the revision history for the images. I believe there are also some wikitext attributions that I need to write code to find and reproduce.


> Some open code carries relatively light requirements, for example: “Don’t use my code commercially (don’t sell it or use it in something you sell)” and, very basically.

How can anyone even enforce this? Why can't I take some code, create a SaaS product for drug dealers, and then go sell it to my Opp Daquavion Marshawn III down the block? Who will ever find out?


The reality is that open source license violation is rampant, even in big, recognized names. Enforcement, as you feel, is sparse. Usually an entity has to notice, and then make the effort to react to the situation, which doesn't happen often. There are entities who specifically work on licence violations, for example, you can report them on gnu.org[0]. Because there are large cases like TikTok using OBS code, and not contributing back[1], I'm sure there are lots of cases where the community simply doesn't find out in the first place, similarly to how software piracy is rampant in some places in the world, even among commercial entities.

In case you're interested in more: https://en.wikipedia.org/wiki/Open_source_license_litigation

[0] https://www.gnu.org/philosophy/enforcing-gpl.en.html

[1] https://www.theverge.com/2021/12/20/22847213/tiktoks-live-st...


[flagged]


Thank you.

After pondering for months on this I have come to the same conclusions as you and reading your link made perfect sense.

We are fighting to work harder because we value currency above humanity. What a silly fight. Most jobs can already be done by an AI and we should work towards that, not the other way.

What is the point of free software again?

The tech is out of the bag. The hard part is done, the part where you are learning the unknown is infinitely harder than copying it afterwards. Nobody will ever control it.

Time to set ideas free.


>What is the point of free software again?

The point of the free software movement, and the licenses, is to make sure that in the future, they will still be free software. It's the Paradox of tolerance[0], expressed in legal language, for software source code. Digital goods, while often posing as tangible goods, actually have near zero replication cost, unlike tangible goods. This, combined with the special circumstances provided by free software licenses, enable a special economy, where the barrier to entry is the lowest possible, and the contributions to it are maximized for further enabling the special economy, for all current, and future participants.

[0] https://en.wikipedia.org/wiki/Paradox_of_tolerance




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: