GitHub Copilot investigation

schoen · on Oct 18, 2022

Here are a few thoughts I haven't formulated before:

It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works. It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture (grammar, concepts, facts, cultural tropes, and many others).

A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning (e.g. reading fantasy novels and learning concepts, tropes, or vocabulary that one uses to produce other fantasy novels; reading a newspaper and learning facts that one incorporates into an essay; learning artistic techniques or stylistic conventions from studying existing artworks and using them when producing new artworks). Current AI systems are (amazingly) becoming capable of all of these things and may do them in ways that are somewhat akin to how human beings do them. (although I guess Jaron Lanier would object "that's what they want you to think")

But there are also examples in existing copyright doctrine where people accidentally repeat enough of a prior work to get in trouble for infringement -- most often with song composition (like George Harrison's "My Sweet Lord") because relatively small pieces of melody (which a person might easily memorize) may be considered copyrightable.

If human beings had much more accurate memories, copyright would be quite a bit more intrusive (and/or quite a bit less effective) because, following any exposure to some kinds of works, we could use our own memories to reproduce those entire works from scratch for our own use or pleasure without obtaining authorized copies from elsewhere.

Computers do have such accurate memories, and machine learning systems, which are optimized for things like maximum likelihood estimation, can and do reproduce both copyrightable and non-copyrightable elements of works that they've been trained on. After all, the maximum likelihood continuation of a fragment of a text or a song is ... the complete original work. And the ability to reproduce the complete original work would, other things being equal, reduce loss in training. After all, that's something someone might specifically ask for, and if the system could oblige, it would be doing a better job of providing what the user wanted.

It's relatively foreseeable that machine learning systems would potentially be able to reproduce both copyrightable and non-copyrightable elements of various works, because the distinction between the two isn't especially clear from an algorithmic or mechanical point of view. (For instance, facts aren't copyrightable, but the notion of what constitutes a "fact" for this purpose is a culturally-bound legal notion and not at all straightforward to make precise.)

But if you had a human author or artist or scholar or programmer who was "trained on" exposure to an enormous body of works, and that person had an exceptional eidetic memory, you could imagine that he or she would be perfectly capable of recreating many of those works from memory (and that other people might request such recreations). (Again, in music in particular, it's already routine that someone could have unambiguously copyrightable material memorized and be subject to copyright restrictions on performing songs. Like if a singer or band performs a cover from memory.)

If you wanted to avoid this ability then you might need to build in an explicit notion of copyright that limits the accuracy or level of detail inside of the model in some way. This is tricky because (1) I don't think people have really tried to do this much so far, (2) copyright applies very differently to different categories of work, (3) it obviously wouldn't satisfy critics even if it mitigated the most extreme examples of "regurgitation", and (4) it would be kind of weird because you would be intentionally limiting the quality and extent of learning that the system was allowed to do. (I imagine Jaron Lanier getting mad again about my repeated comparison between human learning and machine learning, and between human memory and machine memory)

Some of the weirdness in point (4) is that accurate prediction is usually cool / great / impressive / accepted as an appropriate goal or capability, but if it's too accurate in certain contexts, it may be deemed a copyright infringement. Like if you said "what word comes next? FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS", there's a clear correct answer and knowing it requires having a certain text memorized. OK, if you said "what word comes next? MR. AND MRS. DURSLEY OF NUMBER FOUR PRIVET DRIVE WERE PROUD TO SAY" ... same thing, but Bloomsbury Publishing may be unhappy if you have a system that can get all such questions right.

naet · on Oct 18, 2022

I see this basic logic in almost AI ethics threads, and it starts with a big assumption: "humans learn from copyrighted source material without copyright violation". This then gets tenuously extended to "ai also learns, so it must not be in violation of copyright law".

The first assumption is highly flawed though. Humans routinely do violate copyright law. Plagiarism is a huge problem in many sectors; un-cited direct copies of people's work in violation of fair use is a regular every day occurrence in the human world. It doesn't matter if you memorized the source material or if you transcribed it or if you copy and pasted it, if it isn't your source material and is someone else's, you've committed a violation of the law. Learning to produce original work and reproducing someone else's work is not the same thing. If an AI is ingesting and perfectly reproducing someone else's copyrighted works, it is in violation of copyright law in the same way a human would be if they reproduce someone else's copyrighted works.

iroh2727 · on Oct 18, 2022

+1. And let's not forget too that "AI", that is, ML models, are not "autonomous" in the way that humans are autonomous. Sure, we use the word "learn" to describe what they do, which is one word that we also use to describe what people do. But ML models are always wielded by people or corporations for particular purposes.

If a corporation was to directly publish some copy that appears plagiarized, we'd call that plagiarism. I don't see how adding a piece of code—one that's fully created, owned, and wielded by the corporation—as an intermediary changes anything. If anything, it looks like plagiarism-as-a-service, which seems worse (at least to my eyes).

Of course, this matter is a bit confusing. Because, for example, (1) it's not always plagiarism, (2) defining what exactly is plagiarism even in the purely non-technological realm is difficult (and likely somewhat subjective), and (3) there is a lot of corporate marketing which suggests this "AI" is "autonomous" (presumably to distract from who exactly is autonomous in this picture). And of course ML art is quite useful for many things. But I mean, so are artists.

Not long ago, a lot of Silicon Valley rhetoric was that the purpose of "technology" was to free up time so that people could be more incentivized to "do what people love to do" like, for example, artistic creation. But now it seems that rhetoric was just that: rhetoric, or what was needed to be believed/said at the time.

And now at our present time, when technological "progress" has been followed a bit further (that is, when we've developed our machinery a bit further under the incentives of our present economic system), much rhetoric has conveniently shifted to something else, something largely contradictory, but again precisely to what is needed to be believed/said to continue following the same incentive structure.

BrainVirus · on Oct 18, 2022

A lot of really good points.

>Sure, we use the word "learn" to describe what they do, which is one word that we also use to describe what people do. But ML models are always wielded by people or corporations for particular purposes.

This is extremely important. "Learning" in machine learning is an aspirational label, not a descriptive one. People who claim otherwise either drank too much of their own Kool-Aid or are simply dishonest. This isn't just "wrong" in some taxonomical sense, this is dangerous in a very practical way. Conflating machine "learning" and human learning will inevitably lead to various kinds of sabotage of human learning.

judge2020 · on Oct 18, 2022

I mean, at what point will this change? When the AI has to first be trained by being in a robot in the physical world for 10 years learning human concepts before it can start looking at art in the ultimate goal of learning how to draw?

Vetch · on Oct 18, 2022

The main reason AI will be reproducing copyrighted works while the original license is not trivial to identify will be that in those instances, humans are already violating copyright at a high rate. It's just flown under the radar thus far as required machinery to so easily surface violations was not available.

Copilot is capable of going beyond retrieval and is competent at using variables, comments, types and local context to infer intention and generate appropriate code and even comment on it. Whenever copilot correctly predicts code of yours that's a novel combination of concepts, copilot has originated novel code.

For esoteric concepts, you usually already have to know how to prime it but Copilot is especially useful when it helps you bump into things you didn't know you didn't know (one way to increase the odds of this happening is to write out your thinking so far in markdown or comments. You'd be surprised how helpful and clever Copilot can be in some instances). My point here is Github isn't charging $10/month for run of the mill retrieval. My opinion is code-gen LLMs contribute value and more open versions are worth building.

t-vi · on Oct 18, 2022

Indeed, the "learning". To my mind, the most simple (but still speculative) explanation of the "learning" phenomena - working examples and limitations / failures - we see is that the large models implicitly memorize the training inputs (or some derived features that can be used to approximately reconstruct the inputs) and then do something between interpolation and rather simple non-parametric learning. The effect is outputs are basically a somewhat sensical agglomeration of copy-pasted" snippets.

That said I think the results are often useful and sometimes fascinating. We should not fool ourselves about the learning that these large neural nets do, though.

melagonster · on Oct 18, 2022

There is a common argument that "human just a better neural network", I don't know, did they mean we need to give GPT-3 or Dalle basic human rights?

visarga · on Oct 18, 2022

I am not sure human rights fit the case, but it is one of the most remarkable developments. It is a self replicating distillation of our culture. If our human-based culture grew up and had a baby-culture...

melagonster · on Oct 19, 2022

yes, this should happen, or we just build a new slavery system. but before this happen, using this argument to escape from question about copyright is dishonest.

zarzavat · on Oct 18, 2022

The law does not concern itself with trifles.[0]

Programmers tend to think of copyright as a Boolean valued function. Either something is infringement or it isn’t.

Judges think of copyright infringement as a real-valued function of many arguments corresponding to the circumstances of the parties (e.g. what actual damage was done?).

A human quoting a human without attribution, without any profit made or identifiable damage, returns an infringement value very close to zero. Such cases, if anyone is petty enough to bring them, are likely to get dismissed.

[0] https://en.m.wikipedia.org/wiki/De_minimis

kelnos · on Oct 18, 2022

I'm not convinced of what you wrote. Is it actually true that cases like that are dismissed? Or is it that the infringing party is ordered to stop, but not damages are awarded to the rightsholder? You've offered no examples of dismissal to back up your statement.

I absolutely agree with you that copyright is not a boolean, but I don't buy the idea that a judge will just shrug and allow infringement to continue just because there was no commercial harm.

I also think your example is just irrelevant to the case at hand. Sure, someone "performing" someone else's copyrighted words once may not be a big thing. But if Copilot is actually found to be infringing, these infringements will keep happening, over and over and over.

Bottom line is that none of this has been tested in court. I think it's great that someone is working on doing just that. Maybe the end result will be that Microsoft's use is indeed fair use, and that Copilot users have no further obligations. But I'd like to hear a court decide that, not a bunch of armchair non-lawyers (myself included) on a random web forum.

shagie · on Oct 18, 2022

https://www.lexisnexis.com/community/casebrief/p/casebrief-n...

> CONCLUSION:

> The court of appeals affirmed the district court's judgment. The court held that the Performers' use of the composition, as distinct from the use of the composer's performance, was de minimis and therefore not actionable. Considering only the compositional elements, the brief and relatively simple segment of the composition used by the Performers was neither quantitatively nor qualitatively significant when viewed in relation to the composition as a whole. Thus, despite the high degree of similarity from the actual use of the recorded composition, the scope of the similarity was not sufficiently substantial to support Newton's infringement claim.

kelnos · on Oct 20, 2022

Ok, sure, but that's not what I was asking for. I'm not surprised that there is one (or even two or three or a handful) of examples where this happened. But is it common?

Also I'm looking for a case where the judge acknowledged that copyright infringement was occurring, but decided not to do anything about it. From the bit you quoted, it sounds to me like it's implying that the judge believed there was a valid fair-use defense? Or even stronger, that the judge just did not believe the use was infringing at all?

abigail95 · on Oct 18, 2022

Louisiana Contractors Licensing Serv., Inc. v. Am. Contractors Exam Servs., Inc., 13 F. Supp. 3d 547, 554

(sorry i wrecked the citation and lost where i found it)

jfk13 · on Oct 18, 2022

Not necessarily where you found it, but sounds like https://willamette.edu/law/resources/journals/wlo/ip/2014/04...

kelnos · on Oct 20, 2022

This is pretty interesting, and IMO, really disappointing to see that was the ruling in that case.

CaptArmchair · on Oct 18, 2022

> A human quoting a human without attribution

... as opposed to AI. At the heart of the matter lurks a debate whether AI is an independent phenomenon which behaves in its own right, or a just a tool that's created and wielded by humans against a backdrop of clear incentives and motivations.

The argument isn't about whether or not the law deals in a absolutes - it's a basic principle that law is tested in courts through interpretation - the argument is that co-pilot can be perceived as merely a means to an end and that GitHub / Microsoft have created a massive mountain of liabilities for themselves.

OrangeMonkey · on Oct 18, 2022

I disagree with your statement about ai being an independent behavior.

Suppose we add a button to a visual studio plugin called 'Copy me a function' and when you click it, it 100% grabs some random code from github and plops it as-is into your code base.

I don't have to argue the ethics of if the button is 'thinking for itself'

CaptArmchair · on Oct 18, 2022

Well, I just pointed out it's an ongoing debate, I didn't connect any particular value attribution to that statement.

> Suppose we add a button to a visual studio plugin called 'Copy me a function' and when you click it, it 100% grabs some random code from github and plops it as-is into your code base.

Personally, that's exactly how I see co-pilot. To my mind, it's a tool that sits in the same category as p2p platforms, copying machines or video recorders. They are just tools.

How, for what purposes and by whom they are leveraged makes all the difference here.

OrangeMonkey · on Oct 18, 2022

That makes it easier, at least, in the US.

P2P platforms and those who violate copyright are routinely shut down (or attempt to be shut down) in the US. If co-pilot sits in that same space, it seems the books been written already... we know how it ends.

williamcotton · on Oct 18, 2022

> If an AI is ingesting and perfectly reproducing someone else's copyrighted works, it is in violation of copyright law in the same way a human would be if they reproduce someone else's copyrighted works.

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

Consider the impact on innovation if Microsoft or Oracle were allowed to claim a copyright over utilitarian aspects of their works such as the Java or Windows API!

BTW, Copilot seems to be reproducing copyrightable material when the tool reproduces comments verbatim!

paphillips · on Oct 18, 2022

In my opinion this is a key comment in this thread and everyone subject to United States law should read the Abstraction Filtration Comparison (AFC) legal test when refining their opinion. Also, I have no legal background, but as far as I know patent law != copyright law.

Specifically within AFC, note the "idea/expression dichotomy" [1] which clearly states:

"copyright law protects an author's expression, but not the idea behind that expression"

Thus, if this tools spits out someone else's code verbatim it is a definite copyright infringement. If it outputs code that is similar but not verbatim then it "could" be an infringement, at your own risk, and to be determined by the courts. Simply expressing the same idea in a different way is not a definite infringement.

Program code is naturally copyright-risky because the keyword/grammar space is constrained. It is far more difficult to accidentally duplicate verbatim the expression of one's ideas in a full language, such as English, than in C. And what of two separate programs (or constituent sub-parts such as functions) that by chance emit the same compiled binary?

Personally, for now I won't use this tool due to the risk of accidental plagiarism, and because it is a black box: I can't examine any lineage or attribution metadata to understand the source(s) of what I would then be incorporating into my own body of work. Of course I doubt I could get that type of traceability information for any other trained ML model I might use, so perhaps I need to re-examine my policies heading into the future.

[1] https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...

williamcotton · on Oct 18, 2022

> Thus, if this tools spits out someone else's code verbatim it is a definite copyright infringement.

That is not true. It can be verbatim and not a copyright violation if it can be shown that the expression in question is strictly utilitarian! I literally provided a quote from that AFC article that says this!

There's even precedent that prior art nullifies a copyright claim, as seen in Johannsongs-Publishing, Ltd. v. Rolf Lovland:

"Johannsongs failed to offer admissible evidence to rebut Ferrara’s analysis, so there is no genuine dispute of material fact as to his conclusions that Söknuður and You Raise Me Up are not substantially similar and most of their similarities are attributable to prior art."

And this was about music, not software, which has always sat uncomfortably between utility and expression, if only because it is some kind of writing. No one is claiming copyrights over Photoshop filter settings or other inputs manipulated by sliders or buttons!

paphillips · on Oct 19, 2022

Good point - the wording I chose should have qualified the scope in that first statement: how about something like "as the scope that is reproduced verbatim increases, the likelihood of infringement approaches 100%?" (excluding prior art, public domain, etc). Obviously no one could reasonably claim copyright on a variable declaration - the scope is too small, and in some languages there is only one way to express it.

However, the statement was only for cases of verbatim copies produced by Copilot. The AFC Wikipedia article states that "Proving copyright infringement requires proving both ownership of the copyright and that copying took place." The 3 detailed tests developed in that case appear to be "expand" the determination of infringement to close potential loopholes where, while there is not a verbatim copy, infringement is still deemed to have occurred because of "substantial similarity". e.g. someone copies a program but changes the variable names.

So where is the line between infringement and not, in cases where there is an exact copy of a code fragment? Can we still use the utilitarian defense or is that only used by the court to exclude portions of the code in the tests for "substantial similarity"?

williamcotton · on Oct 20, 2022

Personally I use common sense to determine if something is utilitarian in nature. The real issue is with the “overall shape”, like, class structure, specific data types, etc, which is somewhat arbitrary in nature.

At this point Copilot is awful at this higher-order level of abstraction but I can see a time where this is not the case!

Microsoft will have to put more work into filtering out responses that are indeed copyright violations if they want people to use their tools.

I doubt that MS will ever be held liable for the violations themselves as there is precedent in themselves and their legal department has plenty of cash to burn.

williamcotton · on Oct 18, 2022

As for why we should allow verbatim copies of utilitarian features...

First, let's preface this with the substantial similarity of the structure, sequence and organization as established in Whelan v. Jaslow which amongst other things says that you cannot merely change the variable names if the expressive structure of the code remains the same.

Now let's imagine 10,000 software developers who all implement Dijkstra's algorithm in C and then run it through clang-format. Aside from variable names, isn't it safe to assume that many of the implementations are going to be exactly the same?

autoexec · on Oct 18, 2022

> It can be verbatim and not a copyright violation if it can be shown that the expression in question is strictly utilitarian!

So if it's identical it might or might not be a copyright violation and even if is totally different it still might or might not be a copyright violation... and only after spending insane amounts of time and money in the court system can you ever be sure if what you (or your AI) created has made you a criminal. This is increasingly sounding like a very very broken system.

Timshel · on Oct 18, 2022

I agree it's like people are not aware that clean room design is a thing (https://en.wikipedia.org/wiki/Clean_room_design).

autoexec · on Oct 18, 2022

Not really a fix to the problem since as your own link shows you can still be dragged into court to defend against a copyright lawsuit which can cost someone tens if not hundreds of thousands of dollars and there are no guarantees when you're up against a team of lawyers representing an entity with far more money and resources than you have. In the end, you can do everything right and still easily end up being screwed which isn't how anything should work.

ghaff · on Oct 18, 2022

Yes, it's a thing but how often is it actually done? I can imagine circumstances where a company might do so--say a cloud provider wants to clone some open source software and offer it as a service. But I'm pretty sure the practice isn't routine.

kbelder · on Oct 18, 2022

>Humans routinely do violate copyright law. Plagiarism is a huge problem in many sectors; un-cited direct copies of people's work in violation of fair use is a regular every day occurrence in the human world.

And we need to accept that and get over it, not get better at outlawing it.

autoexec · on Oct 18, 2022

I'd say we need to reject it, refuse to get over it, and demand changes to fix copyright law so that humans and AI don't need to risk violating the law in order to learn things and create new works based on what they've learned from what others created before them.

mgkimsal · on Oct 20, 2022

Seems like 'learning' and 'producing something with that learned information' are two different things. I can 'learn' all day long from copyrighted materials. I'm not violating anything, because I've not produced a copy that would be distributed or used. First/knee-jerk reaction to that above - no doubt there's more nuance buried someplace.

iepathos · on Oct 18, 2022

Humans do violate copyright if they use copyrighted passages directly in their work and pass it off as their own without any attribution, which is what copilot has been show to sometimes do, though not always. Copilot will sometimes offer chunks of code that can be found verbatim in open source code bases and passes it off to users without attribution. I agree it is ok to learn from copyrighted work and reproduce new different work from a human or a machine learning algorithm, but it isn't ok to pass along exact copies as your own without attribution. Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.

bayindirh · on Oct 18, 2022

> Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.

A user once replied to one of my comment[0] about this with the following:

> It's not really an issue when you're a large software corporation; you already have mechanisms in place to check for license compliance in everything that ships, including F/OSS plagiarism checks [1].

IOW, from my understanding, they don't care. Big players do their own checks anyway, and small fish won't be creating problems because it's too convenient for them. Classic Microsoft (as I know from 90s).

The bigger thread can be seen in [2].

[0]: https://news.ycombinator.com/item?id=32534697

[1]: https://news.ycombinator.com/item?id=32539467

[2]: https://news.ycombinator.com/item?id=32533531

adamhp · on Oct 18, 2022

I'm curious about the endgame of copyright with respect to software. At some point, enough people will have written enough code that you can't write code anymore because some fragment of it violates a copyright. Where does the line get drawn? There's only so many ways to do certain algorithms, like DFS or BFS.

ghaff · on Oct 18, 2022

>you can't write code anymore because some fragment of it violates a copyright

Copyright (unlike patents in general) allows for independent creation. If I sit down to write a quicksort routine, it is going to look extremely similar to a zillion other quicksort routines out there.

The other question (IANAL) is whether writing a quicksort routine is even a creative act at this point.

salawat · on Oct 18, 2022

At some point it'll have to come home to roost that code is a subset of discrete mathematics first, a literary/artistic work second. There is really no way around it.

sooyoo · on Oct 18, 2022

> Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.

Or they could integrate a way to find the produced output back in the corpus if it's sufficiently close and provide a reference/attribution. Basically whatever tool a copyright lawyer would use to track down original work.

And that's just the engineering solution. The AI researcher solution would be to extend AI learning algorithms to attach attribution metadata to the learned data so that the output could already come annotated with information about the source.

But the latter is much harder to do, so maybe the engineering solution would suffice.

alpaca128 · on Oct 18, 2022

A Twitter thread linked yesterday showed that a keyword and the name of the original code's author in the Copilot prompt produced an almost exact copy of that developer's code. Copilot already does know the origin sometimes.

Edit: here's the related tweet: https://twitter.com/DocSparse/status/1581632706693079042

SahAssar · on Oct 18, 2022

> Or they could integrate a way to find the produced output back in the corpus if it's sufficiently close and provide a reference/attribution. Basically whatever tool a copyright lawyer would use to track down original work.

That assumes that the licenses of your code and the original code are compatible which often isn't the case.

sooyoo · on Oct 18, 2022

No, it doesn't assume that. Ensuring that they are compatible would be the next step. Either manually by the user or automatically by showing a fat warning or retracting the suggested code completion.

Vetch · on Oct 18, 2022

> which is what copilot has been show to sometimes do

In those cases it seems that humans are already copying code without also propagating licenses appropriately. LLMs are more likely to memorize things which occur a lot (and I'd bet rare things that are representative of some conceptual axis).

The main examples presented so far, Davis and Carmack, have the property of having been copied a lot. The generative model is only surfacing an existing pattern of ignoring attribution. Sort of like the code-gen version of generating bigotry if appropriately prompted.

I'll also note that this pattern of retrievable memorizing of copyrighted and sensitive material is present in GPT-3 too and not just for code. As the situation is equivalent, a lawsuit should address the concerns of non-programmers too.

diffeomorphism · on Oct 18, 2022

That seems like a weak defense: "sure, we violated copyright but only because many other people do, too".

Kinda the same problem as YouTube. Lots of people copy movies on the high seas, but if you are as big as yt you cannot easily get away with it.

Vetch · on Oct 18, 2022

No, this is not meant as a defense. My point is that it's an issue that is already rampant and what Copilot (or any model) does is make it more readily visible.

This is not like youtube because Github is already hosting those violations and people are already inappropriately copying or including such code. It matters not whether the local inclusion was fetched by copilot or a human fetched it using more manual steps through search.

xfer · on Oct 18, 2022

The difference is if you build tools that can be used to violate existing copyright laws your tool will get taken offline by the same corporations. Yet they are selling one that can be used to do so.

Vetch · on Oct 18, 2022

Yes, I agree there's a measure of double standards to this. It's why I feel it's important that AI does not remain in the control of just a handful of corporations. The decks are stacked against though, given how data and compute intensive SOTA is.

But in defense of copilot, code regurgitation is uncommon in routine use. An editor extension allowing search of github would be at least as easy to use to violate licenses but I do not think it'd be taken down since that would not be its core offering.

Copilot goes far beyond mere search and provides a useful service. GPT-3 can also be prompted into generating copyrighted works of writers but I do not see people talking as if that is its primary utility nor as much clamoring in these forums to end that service.

xfer · on Oct 18, 2022

> An editor extension allowing search of github would be at least as easy to use to violate licenses

Well, try doing that for music or movies or proprietary leaked codebase.

If you think copilot is uniquely producing things that are not that different from humans then surely no one would have any problem with feeding it massive amounts of corporate programs?

I am not aware what writers are doing but there have been plenty of uproar regarding stable diffusion. I have a feeling that if any tools like this get built for musicians/film-makers, it will look vastly different from the current situation.

Vetch · on Oct 18, 2022

First, the music and to an extent movie industry enforcement of IP are uniquely pathological. But I am not talking about music or movies. I am contending that a simple search extension being much less capable than Copilot and so even more scopeable as aiding copyright violation would not be taken down.

> surely no one would have any problem with feeding it massive amounts of corporate programs?

There is a similar gymnastics done by human engineers today due to the issue of patents. I don't think this is a good trend to uphold.

> I am not aware what writers are doing but there have been plenty of uproar regarding stable diffusion

Yes but mostly in the art community. On HN there were plenty of arguments just the other day how it is not the same for art and programmers have a stronger case. I disagree but regardless, the case is exactly equivalent for GPT-3 and writers but it wasn't an issue generating about a thousand comments on respecting IP and ceasing deployment of LLMs until copilot.

xfer · on Oct 18, 2022

> There is a similar gymnastics done by human engineers today due to the issue of patents. I don't think this is a good trend to uphold.

Are you arguing that copyright laws should be abolished? I have no problem with that as long as it's clearly defined, you can't not respect copyright of open source code but enforce it for proprietary code, just because the value is arguably non-monetary.

SkyBelow · on Oct 18, 2022

Without a sample case I can't say for certain, but couldn't it also be a defense that some code is generic enough that it shouldn't be copyrighted?

Vetch · on Oct 18, 2022

One thing I worry about is if the uncertainty around copyright violation cools down activity in open models while raising the price of commercial offerings. Commercial entities can afford devoting resources towards mitigating copyright violations such as eating the cost of maintaining a database of frequently copied code and identifying most likely origin combined with a large semantic database of code snippets.

An open equivalent might be wary of being accused of contributing to copyright violations since in that scenario, there is no way to force people to respect it.

odo1242 · on Oct 18, 2022

> not just for code

This is quite important, actually, and I don't think enough people realize this. I am a photographer sometimes and it would be really cool if I could share my photos online under a copyright license that forbids their use in training AI.

frumper · on Oct 18, 2022

The problem seems to be that someone could just copy your photo and repost it without that license and we're back to the same spot.

sdenton4 · on Oct 18, 2022

I really appreciated the argument in the book "The New Breed," which is that we should adapt ideas around the governance of animals to governance of ML: You can train your dog to attack random passersby, but if you do, you're a monster and ultimately responsible for the dog's actions.

Likewise, you can tell Copilot to crank out code specific algorithms written by specific people, but if you do so, you're still creating infringing code, same as if you'd taken the more direct route of ctrl-c+ctrl-v. The fact that you /can/ make the algorithm misbehave through adversarial input is irrelevant to the primary use cases which lead to boring non-infringing code completions.

delusional · on Oct 18, 2022

This just sounds like blaming the researchers to me. How would i ever know if my "boring code completion" was actually copyright infringement?

Your argument just disallows discussing the problem while doing absolutely nothing about it.

If you train your dog to NOT attack random passersby and it still does, that dog is euthanized no matter your intentions.

tremon · on Oct 18, 2022

If you train your dog to NOT attack random passersby and it still does, that dog is euthanized no matter your intentions.

Of course, but you will not face manslaughter charges in that case.

So, following the same logic, if you train your copilot NOT to infringe on other people's copyright and it still does, it should be destroyed no matter your intentions. But at least you won't be charged with copyright violation yourself.

That said, I don't believe Microsoft's actions to be benign. I think this copyright whitewashing scheme is fully in line with their old MO, purposefully creating a legal quagmire surrounding all open source code.

A4ET8a8uTh0 · on Oct 18, 2022

Personally, I think you are right distrusting MS ( as we should be we any corporation really ). I will admit that this attempt is working in a sense that it is a lot less clear to a non-computer person as to:

- whether there are any damages - what the big deal is

In my mind, the entire thread identified a lot of those, but I think someone already said that it will likely be tested in court ( and I have zero idea, which way it will turn ).

For the record, I personally think Copilot is a cool tool ( frankly, it is not that different from automated stack exchange in terms of results ). If I worry about anything, it is that the overall standards will decline even further.

sdenton4 · on Oct 18, 2022

Tim Davis doesn't actually have any instance of copyright infringement to complain about; he was able to induce Copilot to /mostly/ recreate his code through careful prompting, but no one has actually deployed the code. By the same token, we don't outlaw ctrl-c and ctrl-v buttons on computers.

There is plenty of space here to discuss developing tools to check for unintentional infringement. I would guess, though, that such tools would sweep up a whoooole lot of non-copilot human usage and make it much harder to deploy anything new.

So, maybe a better discussion to have here is how to make the animal safer, not the total outlawing of the animal. Single-line completions (the majority of co-pilot usage) aren't infringing. Probably true for almost-any few line completion. So, capping the amount of consecutive auto-completed code might be a reasonable 'muzzle' on the model to keep it reasonably safe.

delusional · on Oct 18, 2022

I think we have an ideological disagreement here. I'm not part of the "open source" movement, I believe in free software. Although I'm not prolific, I have authored some free software and shared it widely. I want people to have it, use it, and share it, so long as they extend the same rights to their users.

Now my software has been assimilated into a proprietary blob. Had that blob been free, like my software within it, I would have accepted it, but it's not. It's controlled exclusively by Microsoft and OpenAI, two entities which I place no trust in.

For me the dog has already bitten. The free software I extended to an audience I believe would show the same generosity has instead been made into a proprietary product.

The "copyright" question for me is not a question of "fairness" or ability of Microsoft or anyone else to make a product. For me it's a tool to protect my contribution from proprietary business.

Basically. I dont want the animal safer, I want it free (according to the FSF freedoms).

tempodox · on Oct 18, 2022

> The free software I extended to an audience I believe would show the same generosity has instead been made into a proprietary product.

That's exactly my complaint about Copilot. And since all code hosted on GitHub is now subject to this land-grab, my only recourse is not to use GitHub any more if I want to publish a project of mine.

Vetch · on Oct 18, 2022

Code ingestion is not limited to github or copilot. Your best recourse is to make your code publicly inaccessible.

anigbrowl · on Oct 19, 2022

I am not sure your software has been assimilated into a proprietary blob. Rather, it's been sliced into its constituent parts and those parts have been tagged by the proprietary blob.

Your code isn't being run by Copilot, as such; it's been categorized in a way that allows partial retrieval without the license or attribution. This might seem like a distinction without a difference, but it's kind for a ship-of-theseus problem; probably nobody is running any of your programs in their entirety, but it's very possible that bits of your code have found your way into other people's programs. How do you distinguish between contributions that are uniquely yours, and those which are just helper functions or cobbled together from other example code, eg in documentation or from a book or Q&A website?

swhalen · on Oct 18, 2022

I am not a lawyer, but I don't think anyone needs to deploy the code in order to infringe copyright: they just need to distribute the code to a third party (hence copyright -- the right to copy). And on the face of it, Microsoft would appear to have distributed Tim Davis's code, in compressed form, as part of the trained language model in Copilot.

terryf · on Oct 18, 2022

But in this case copilot is not equivalent to copy-paste. When doing copy-paste, you are acting with knowledge of the source of the copied code and with intent to copy code.

With copilot, you are not acting with knowledge of the source and not with intent to copy, in fact I'm sure the users would have a reasonable expectation of the tool not copy-pasting existing code verbatim.

IANAL, but I'm pretty sure that intent matters a lot.

Popcorn time was also just a tool to allow you to stream data from torrents. That didn't seem to help them put up a legal defence (nor should it have, because the intent was pretty clear on that one).

And seriously, if cases exist, where the only thing a tool does (albeit via a VERY complex implementation path) is to strip a license from a piece of code and serve that code up via an API, then that really does sound like the creators of the tool are at fault.

kbenson · on Oct 18, 2022

If you build a system that has a high likelihood of breaking the law in normal expected use, and then it's found to break the law, shouldn't we disincentivize that in some way? Is that just blaming the researchers/developers, or is that just making people respect the law?

I think the important thing to note in both dog attack scenarios presented is that the owner is responsible in both cases. Either they purposefully created an unsafe situation or they were negligent in protecting the public from their property. Whether the dog is euthanized is about preventing it from happening again. Preventing it from happening in the first place is done by making the owner liable to disincentivize it.

delusional · on Oct 18, 2022

I'd argue that the law was already broken when my free and viral software was included in a non-free package.

Personally, i don't care about the end users. If you want to read my source i welcome that. I just want the CoPilot model and system open, since it was based (in part) on my work. Otherwise they are free to remove my work.

abigail95 · on Oct 18, 2022

And you are free to sue them.

What was your plan when someone eventually infringed on your work?

If you want people to abide by your license, you have to enforce it yourself.

BaculumMeumEst · on Oct 18, 2022

This article is about pursuing a class action lawsuit…

thomastjeffery · on Oct 18, 2022

"Not knowing" doesn't free you from responsibility.

If you took a bunch of copyrighted and non-copyrighted books, cut them into pieces, shuffled them all together, then picked a passage at random from a hat; "not knowing" what you are going to get doesn't mean you aren't violating copyright.

That's essentially what copilot is doing: it's taking a bunch of code - some of it copyrighted without license - and using it as a dataset. The ML algorithm then tries to pattern match against that data to provide the user with something they want. That's just copyright violation lottery with extra steps.

jameshart · on Oct 18, 2022

What stops humans from reproducing thinly disguised copies of their influences is, essentially, their ethical judgement.

Which amounts to saying, humans are trained with a model that they can use to recognize when something they are thinking of producing is 'too similar' to something they have seen before.

And, of course, some humans choose not to apply that filter and go ahead and plagiarize anyway; some humans try to apply that model but get back a false negative, thinking they're producing something original when they aren't. And we have ways of dealing with humans who do that.

In the case where an AI is coming up with the work, perhaps the mistake is in relying on humans to try and apply their own trained judgement to figuring out if the result is unoriginal. We need an AI that scores work for how likely it is to be infringing on a prior copyright.

Then you use that AI to train the creator AI, and teach it 'originality'.

schoen · on Oct 18, 2022

That's a great point about what we naturally do.

> In the case where an AI is coming up with the work, perhaps the mistake is in relying on humans to try and apply their own trained judgement to figuring out if the result is unoriginal. We need an AI that scores work for how likely it is to be infringing on a prior copyright.

Isn't that latter AI going to be more likely to need to contain verbatim copies of original works? Or maybe not?

This in turn (and the SFC's and the law firm's concern about the GPL) makes me think that there are several different things that people may be concerned about machine learning systems doing:

* they could allow you to access verbatim copies for "consumptive" use (like if you asked an AI a question about what the text of a chapter of a Harry Potter novel was, and it answered you correctly)

* they could facilitate intentional or unintentional plagiarism, and, in the case of publicly-available works that are published under a license, intentional or unintentional reproduction or creation of derivative works contrary to that license

* they could contain something like a verbatim representation and allow you to use that in various ways that themselves are not extracting or literally copying that representation, but where the original copyright holder might complain that the existence of an unlicensed copy inside the model is already objectionable

* they could contain representations of uncopyrightable subject matter which was learned through training on copyrighted works, which can then be used to compete with the original creators for jobs, prestige, or attention, or can be used to produce works that the original creators would have found offensive or objectionable (this case isn't supposed to be restricted by copyright at all, but that doesn't necessarily stop people from caring!)

Not only will the same measures not prevent or avoid these cases, but if you wanted to prevent the first and second situations, one of the easiest ways to do it might be to literally include verbatim copies of lots of works inside a machine learning model! (along with software specifically trained or programmed to warn you against unintentionally making uses the user or copyright holder finds objectionable ... to facilitate the exercise of "essentially, their ethical judgement", as you put it)

abecedarius · on Oct 18, 2022

> copyright holder might complain that the existence of an unlicensed copy inside the model is already objectionable

I'd be wary of this one. There doesn't seem much distance between this and the same claim against a human's memory of a work in their own brain. Yes, that sounds like dystopian fiction. So do some things that have already happened.

belorn · on Oct 18, 2022

We have such AI already that scores work for how likely it is to be infringing on a prior copyright. It is written by Google and operates on Youtube.

The big question is if we think that Google made a poor work of that AI and if more money and more data rich company can make a better AI that teach originality.

cycomanic · on Oct 18, 2022

> Here are a few thoughts I haven't formulated before:

> It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works. It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture (grammar, concepts, facts, cultural tropes, and many others).

I said it already in a previous discussion, I would be very careful with comparing ML with how humans learn. To me there are still a lot of examples that show that AIs don't understand prompts (see e.g. the discussions around the "horse riding astronaut" prompts för stable diffusion et al.) and it seems like they really are just doing sophisticated pattern matching. If that is what they do aren't they themselves covered by the licenses/ restrictions placed on the "patterns" they "choose" from?

angusturner · on Oct 18, 2022

I think any argument against generative AI should not hedge on there being a fundamental difference in how humans and generative models work.

I mean sure, maybe humans aren't "just" doing sophisticated pattern matching, but there are good reasons to suspect this is some part of what we are doing. (Even if its not implemented with back-prop).

e.g) consider the work of people like Anil Seth, who propose that our brain is basically a generative model of the world, which aims to minimize the likelihood of perceptual data. (see also: Karl Friston's free energy principle). What's up for debate is how it is structured, what priors are built in, what is the learning algorithm etc.

Anyway, for all their limitations, it seems clear that current artificial generative models can: 1. learn hierarchies of abstractions, which 2. explain the observed data in the fewest possible number of bits, and 3. generate new, novel data based on the patterns that have been learned

If you want to describe this as "just sophisticated pattern matching".. then sure I guess? But I think there's a clear qualitative difference between this and searching for code in a discrete database (which imo would not be okay).

attemptone · on Oct 18, 2022

https://arxiv.org/abs/2106.06981 - Thinking like Transformes

The paper suggests that transformer work on a set of select, aggregate and element-wise operations. Which seems pretty close to the SQL statements i write from day to day.

nopenopenopeno · on Oct 18, 2022

AI copyright drama is my favorite gossip these days because it can’t be reconciled until we accept that intelligence is created and held by societies, not individuals. Recent AI is a new way to exercise that intelligence, but it presents a major conflict with capitalism.

smaudet · on Oct 18, 2022

Not a conflict with capitalism - it's another trajedy of the commons - the robber barons of old stole the owned commons land and started said 'capitalism'.

Capitalism is still very alive, and will continue to be. It's in conflict with the general welfare of the people...

ShamelessC · on Oct 18, 2022

I assume they meant “capitalism as it is currently implemented”. In any case, not like capitalism is suggested as being under threat - just that an economic model based on competitive markets will have a lot of issues with fairly allocating resources to all individuals, instead accumulating most of it to a few industry leaders.

Something will have to change.

angusturner · on Oct 18, 2022

the octopus would like a word with you

hypertele-Xii · on Oct 18, 2022

What is an octopus if not a society of cells?

Vetch · on Oct 18, 2022

Just about two-thirds of an octopus's neurons is found outside its central brain, distributed across its 8 arms. An octopus can be likened to a hive-mind of sorts.

dgellow · on Oct 18, 2022

> It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law

Which laws are considered in this case? I understand that fair use is a US concept. For example how does that apply to my projects, published and licensed by a European living in a European country? I would expect the majority of GitHub contributors to not be based in the US, so what laws should be considered?

dkyc · on Oct 18, 2022

From GitHub’s Terms of Service [0]:

> Except to the extent applicable law provides otherwise, this Agreement between you and GitHub and any access to or use of the Website or the Service are governed by the federal laws of the United States of America and the laws of the State of California, without regard to conflict of law provisions. You and GitHub agree to submit to the exclusive jurisdiction and venue of the courts located in the City and County of San Francisco, California.

[0] https://docs.github.com/en/site-policy/github-terms/github-t...

jeltz · on Oct 18, 2022

I have code of mine which has been uploaded to GitHub without my permission (other than it being licensed under GPL or MIT, no contributor agreement). I cannot see how that would be covered.

Additionally copyright infringement can be a criminal matter in my country and the Swedish prosecutors have certainly not signed these agreements.

justinclift · on Oct 18, 2022

The way GitHub is acting here, it seems to be a case of "if no-one takes us to court and sticks through to the end, then we can do whatever the hell we want". aka "Most people complaining are just making noise".

CoPilot doesn't seem to be a terrible implementation, instead it seems to be relying on it operating in a grey area. So they're going for broke, to try and get wide enough adoption that it becomes a fait accompli.

undefinedzero · on Oct 18, 2022

Anyone can say that, but that doesn’t make it real, especially with regards to European consumer protection.

abigail95 · on Oct 18, 2022

Is there some EU-USA treaty that would prevent jurisdiction clauses in a normal contract between an EU resident and a California company?

The mechanisms limiting this are mostly about privacy. Not whether you can agree to adjudicate copyright or TOU in California

dkyc · on Oct 18, 2022

It's real because our laws (including those of European countries) make it real.

Some seem to assume there's some general "If it's American it's invalid" law in Europe. This is not the case. With the exception of specific laws, such as GDPR regarding privacy, this is a perfectly valid clause.

jeltz · on Oct 18, 2022

Copyright law can be a criminal matter in my country and then it will be handled by a Swedish court. You cannot just write a contract which makes you immune to criminal prosecution.

abigail95 · on Oct 18, 2022

In Sweden you can set the jurisdiction for civil disputes in a contract.

Which is what everyone is talking about here.

For criminal, there is only jurisdiction in Sweden if the crime happened in Sweden. I would need you to link me a case where Sweden criminally convicted someone for copyright infringement who wasn't Swedish and wasn't in Sweden.

For example, I am not Swedish and do not travel there. Sweden has no power to enforce its laws against me. No matter what I do, I shouldn't be able to be convicted of criminal copyright infringent in Sweden.

clhodapp · on Oct 18, 2022

As a human, I am likely to have the specific goal of not violating copyright when I write code. That means either deliberately not writing things exactly the same way that I know I've seen them written in other places or else deliberately complying with their license policy if I really want to lift code verbatim. Perhaps Copilot needs some sort of feedback loop to avoid over-similar code by default & an extra "allow (compliant) copying of open source" toggle to make it behave similarly.

bjourne · on Oct 18, 2022

> Computers do have such accurate memories, and machine learning systems, which are optimized for things like maximum likelihood estimation, can and do reproduce both copyrightable and non-copyrightable elements of works that they've been trained on. After all, the maximum likelihood continuation of a fragment of a text or a song is ... the complete original work. And the ability to reproduce the complete original work would, other things being equal, reduce loss in training. After all, that's something someone might specifically ask for, and if the system could oblige, it would be doing a better job of providing what the user wanted.

This actually depends on how you train the model. Techniques such as using unlikelihood to penalize plagiarizing models exists. Microsoft/OpenAI are of course aware of those techniques but have chosen not to use them. The reason why is not difficult to figure out. Because the model hasn't learned how to implement sparse matrix multiplication in C, it has learned how to spit out someone else's code with a few variable names changed. Not unlike how many CS students not cut out for software development try to pass their entry-level programming courses. Professors use anti-cheating software to catch cheating students. Such software would catch Codex too and expose it as incompetent. Hence why it is not used.

chrismcb · on Oct 18, 2022

It seems pretty clear to me, training an AI on copyrighted materials is not fair use. I'm not sure why you seem to think it is fair use

cornel_io · on Oct 18, 2022

And it seems pretty clear to me that it is fair use, because it's not merely reproducing or creating a derivative work, but actually extracting patterns and modeling the works in a way that is intended to be used to create new and unrelated works. The fact that an occasional piece of code here and there might be reproduced verbatim is no different than e.g. Cliffs Notes occasionally quoting a passage, and Cliffs Notes are a well established case of fair use that to me, at least, seems even closer to "the line" than Copilot or Stable Diffusion.

FTA: "On the other hand, maybe you’re a fan of Copilot who thinks that AI is the future and I’m just yelling at clouds. First, the objection here is not to AI-assisted coding tools generally, but to Microsoft’s specific choices with Copilot. We can easily imagine a version of Copilot that’s friendlier to open-source developers—for instance, where participation is voluntary, or where coders are paid to contribute to the training corpus."

This is the same argument that people use about Stable Diffusion, and it's kinda meh to me...I guess it'd be nice to allow people to opt-out, like Stable Diffusion is doing with their next versions, especially since a negligible percentage of people will do so and it won't affect the models at all. But yes, it basically is yelling at clouds. Opt-in would cripple models, and some people would make them anyways and just keep them secret, which is worse for the world. And at the end of the day, this really does just seem to me like a fair use of stuff that you've published on the Internet for anyone with a browser to look at. The AI models of the future are going to gobble the whole net up, and if you don't want them ingesting your stuff and learning from it, then you just shouldn't make it freely available.

If OpenAI/GitHub/MS really wanted to get ahead of this and head off any potential legal conflict, they could always just open source the models and weights, which would be in line with the name "OpenAI"...it would be a minor project to scrape all the correct headers to add to a license file(s), but negligible compared to the many millions of dollars spent on training.

wewxjfq · on Oct 18, 2022

Cliffs Notes adds commentary and critique for educational purposes, they are doing what fair use is intended for. Copilot does not.

Also, it's pointless to say "But X does Y" in copyright discussions. You never know if they license the content properly or if they infringe the rights. In the Cliffs Notes case, they might not need fair use at all, because the old works are already in public domain.

rhdunn · on Oct 18, 2022

It depends on what the AI is learning.

If the AI is learning to repeat text (e.g. Copilot) or images (e.g. Dall-E), then that makes it possible to reproduce the copyrighted works, so I would agree that that case is not fair use. -- It would be akin to compressing and distributing those works.

If the AI is learning patterns -- such as "muggle" being a noun that relates to Harry Potter, or that the lemma for "muggles" is "muggle" -- then that is less clear. You can avoid the situation by creating your own sentences with those terms in them, and annotating those sentences instead of the copyrighted ones. That way, the AI is still learning the same information.

DarkmSparks · on Oct 18, 2022

You actually just convinced me of the exact opposite.

Because copilots "use" of the works _was_ the learning.

So it would seem to me that Microsoft needs to apply "fair use" to copy and redistribute _the entire works_ they used for training.

In which case lack of fair use my well be the least of their problems, they are really crossing into Computer Fraud and Abuse Act territory similar to when Aaron Swartz "borrowed" MITs data.

kjeksfjes · on Oct 18, 2022

I'm not sure how Copilot works, but I don't believe Dall-E repeats images. From my understanding it creates visual concepts of words and uses them to create entirely new images. If Copilot works in the same way for code, I honestly don't see that there should be any copyright issues here.

woodson · on Oct 18, 2022

It just so happens that, sometimes, parts of these entirely new images are exact copies of those used for training.

drawingthesun · on Oct 18, 2022

Do you have a source for this?

This issue has been claimed many times and I've heard that DALL·E 1 & 2, Stable Diffusion & Midjourney all can create images that are exact copies of the training material.

This doesn't make sense considering the compression ratio of training images to model is about 1:25,000.

Further investigations I have made show that all these cases can be explained via the following:

1) The prompt included an image, so some form of image2image was used. Of course if you use an image as a base, and tell the model to stick closely to that image, the output will largely resemble that image.

2) The example was completely made up.

So far I have seen no evidence, given a text prompt, the output of an image containing some portion of any image from the training set.

anigbrowl · on Oct 19, 2022

Your comment includes exact copies of words and phrases which I have also used prior to you, so you are violating my copyright even if you didn't intend that.

Well, I don't really think you are violating my copyright. But by focusing on parts, you go down a rabbit hole of equating an element with the whole thing. This would render all collage art illegal. Lawyers and art pundits love ruminating on the uncertain legality of collage art (because it's not a binary question, so they can churn out endless articles that boil down to 'it depends'), but this glosses over 2 important realities:

1. Nobody gets sued over collage art largely because any case is doomed to end up with lawyers measuring the size of collage elements with rulers and then arguing about what small percentage is too much, and uncertain exercise few law firms wish to gamble their reputation on, and

2. nobody gets sued because collage art isn't worth very much to begin with; collages aren't valued very highly because they aren't as hard to make as painting or other art forms. 'Appropriation artists' like Richard Prince get rich and famous partly because their art is less about the image than the cultivation of notoriety for artistic effect; they are artists of scandal rather than pictures.

In general, bits of things are just not that important, and I'd argue that the same applies to code. If part of your code matches a prompt (excluding highly specific prompts like '# insert Woodson's unique XYZ algorithm here') and is then deployed in another program without alteration, isn't that most likely to be because it performs some generic function?

kjeksfjes · on Oct 19, 2022

I've generated thousands of images on Stable Diffusion Dall-e 2 and Midjourney by now, and what you say here simply doesn't make any sense.

xvector · on Oct 18, 2022

> I'm not sure why you seem to think it is fair use

I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.

IncRnd · on Oct 18, 2022

> I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.

Actually, what the OP said is, "is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works". The rest of the eight paragraphs had nothing to do with fair use.

It's honestly a ridiculous argument to say that learning one non-copyrighted thing means that the regurgitation of another copyrighted thing, after stripping the license, will magically be fair use.

wewxjfq · on Oct 18, 2022

The comment is a quintessential HN comment: all tone, little substance. It just claims that it's fair use because the AI learns things, which is not a criterion for fair use at all. People here just throw around fair use as a catch-all term for everything that should be allowed based on their personal gut feeling.

smaudet · on Oct 18, 2022

The concept of fair use applies to small volumes of work.

Clearly, training on large volumes of data is not small volumes in any sense of the word. The argument that it is fair use is itself flawed.

cornel_io · on Oct 18, 2022

Absolutely incorrect, fair use applies to *reproducing* small volumes of work, not analyzing it. If I published an article gleaning some conclusion based on an analysis of 10,000 issues of the New York Times, that would still 100% be fair use; similarly, Google is absolutely allowed to publish word count metrics based on their scanned book repo, even though publishing the books themselves is not fair use. You are trying to read something into the fair use doctrine that is absolutely not there (to the extent that anything is there, which very little is other than "I'll know it when I see it" and prior case law, unfortunately).

Brian_K_White · on Oct 18, 2022

When I fair use a small quote from a book, I may have read the whole book.

Brian_K_White · on Oct 18, 2022

Now I'll go the other way and wonder if it should still fall under fair use if I respond to requests for small quotes programmatically and eventually quote the entire book.

Or here is the real analagous question:

Fair use is about more than just the size of the excerpt.

If you write an article about good writing, and quote a choice paragraph from someone else's work to show an example, and credit that quote, that is fair use.

Is it fair use if you read an awesome paragraph, something that really is the result of the authors unique intellect and effort and craftsmanship, and makes you think "damn", and then drop that same jewel into your book?

The difference is, the paragraph isn't being included for examination or comment or transformation, it's being included to directly copy and perform it's original function as part of what makes a work a great work, and, it's not being credited in any bibliography or footnotes or directly.

The reader reads the paragraph and is impressed by your deep insight, which you never had, and the original author did.

I think all in all, this sort of copying & re-use should be allowed to happen somehow, because software is more like a machine than a novel, and humanity benefits when machines work well. There just needs to be some sort of rules around it about what gets included in the training sets and how both the input and the output are credited and acknowleged.

Right now, I think Github are simply outlaws. 100% of the output is violating the copyright of the code in the training set, because 100% of the input is copyrighted one way or another and none of it is being declared on the output. And it's allowing incompatiple sources to mix and the origonal terms to be stripped. The training set includes both proprietary and open source, and the output is being used in both proprietary and open source.

And there is no way that Github does not have this same understanding that I just described. I refuse to believe I am that special that I can see this and no one at Github did.

So they are not merely possibly inadvertant outlaws, they are deliberate knowing intentional outlaws.

anigbrowl · on Oct 19, 2022

I think a key thing here is your identification of a paragraph. Nobody would think to exert copyright over individual words. Phrases and epigrams are considered worthy of attribution, but only in exceptional cases. Copying sentences is starting to get into plagiarism, though single sentences would usually be forgiven because noting or remembering a single sentence while forgetting the source is an easy mistake to make. Copying whole paragraph, by contrast, is unlikely to be casual.

I think in programming therms a useful parallel might be copying at the module rather than the statement or function level. For example, if I write some code prompts to do the following:

  - validate my API key with Twitter
  - solicit the input of a Twitter username
  - download the up to 500 of that user's tweets
  - convert the json to a dataframe
  - plot the derivative of the intervals between tweets

...many of those tasks can be fairly described as helper functions, either taken directly from documentation (like interfacing with an API) or being so elementary as to be generic. If any one of these tasks happened to come from your code or mine, and the rest from other programs, it wouldn't feel like much of an infringement. If all of them came from the same body of code, it would.

peoplefromibiza · on Oct 18, 2022

> A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning

Not really.

A human being learns by doing, it takes a lot of time, their knowledge is not transferable, and, above all, they buy the material they learn from (most of the time) It's not fair use, it's "I paid for the entire opera" , sometimes multiple times: different editions, movies, tv shows, etc.

Secondly, it's not true that derivative material is automatically copyright free.

It is in all honesty the contrary, most derivative work that reached popularity is plagued with plagiarism, lack of attribution, undisclosed ghost authors etc. all things that get settled with a contract or in court if the publisher thinks it's worth it.

Otherwise the publication simply disappears.

In other cases the work is licensed, so that the publisher can use someone else's IP and literally resell other people's ideas and/or change them the way they like (or the license permits), without having to create new material and take the risk that nobody will notice it.

Case in point (among too many)

https://en.m.wikipedia.org/wiki/Legal_disputes_over_the_Harr...

Siira · on Oct 18, 2022

Copyright no longer (and perhaps never) serves as a tool to further the creative/productive output of the society. It should be demolished and rewritten, and when in doubt, it should allow rather than disallow.

trandom199 · on Oct 18, 2022

I find that this comment reduces humans to elaborate Markov chains and then uses that misconception to make a point.

Many of humanity's best works (paintings, classical music, golden age of physics) have been created before humans voluntarily reduced themselves to automata.

AI hasn't produced anything apart from mashing together other people's creations, usually with a somewhat creepy result.

In programming this may work because quality does not matter, only LOC and social capital with the rest of the brogrammers. The objection that therefore real programmers do not have to be afraid is false. They either have to join the mediocrity or clean up the mess that the brogrammers make (while being disrespected by them of course).

marcodiego · on Oct 18, 2022

> A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning

You have to be very careful with this line of thinking. I remember SCO versus RedHat began on much smaller premises. I also remember it took years for ReactOS to audit their code after mere suspicions arose about code that seemed to be inspired by something like asm to C translation.

The GPL is clear: derived works must be under the GPL too. The license must be respected, it doesn't matter if it was copied "from inspiration because it was learned" by an algorithm or a person.

rhelsing · on Oct 18, 2022

I think the problem is that a powerful entity is profiting off of other people's work without their consent and gives nothing to the exploited members in return. Sure, an individual human learns from copyrighted works and reincorporates to make something slightly new all the time. And then they may also profit from it and not give anything in return to those that came before.

The problem here is the scale and the power that enables that scale. This is industrial level mining of non-consenting humans, exploiting their life's work in many cases.

reflexe · on Oct 18, 2022

Maybe the solution here is do adopt the approach from humans: if you independently produce someone's copyrighted work and then discover about it - you'll drop it. The same approach could be used here, they can add a check for the similarity between the output and the original training material, if it is above a threshold, they'll drop the suggestion (maybe they are already doing that).

smaudet · on Oct 18, 2022

That doesn't work, though - the problem is you have automated crime. When you do this, you can no longer handle this on a case by case basis - you have to resort to automating justice.

And at this point, you are getting both attacked and supported by AI, and not really better off in a meaningful way.

There's no good way of solving this issue, without general intelligence, and the problems that will bring (what reason does a generally intelligent AI have for supporting us or not enslaving us).

This is why all AI research IMO is unethical. Point me to a single AI use that has not already been abused, maybe I will change my mind, as it stands though we should be prosecuting the people misusing this technology, or at least irresponsibly releasing it, as fast and as quickly as possible before we get the point we are no longer fighting bad actors but the machines themselves.

ShamelessC · on Oct 18, 2022

AI research is basically just “the history of computer science”. Alan Turing’s Imitation Game being an apt example.

I don’t think AI research in a vacuum is as deeply unethical as you suggest. It’s about current societal context- people won’t like being worse at things; resources will be hoarded rather than distributed.

SilasX · on Oct 18, 2022

>If human beings had much more accurate memories, copyright would be quite a bit more intrusive (and/or quite a bit less effective) because, following any exposure to some kinds of works, we could use our own memories to reproduce those entire works from scratch for our own use or pleasure without obtaining authorized copies from elsewhere.

I don't know the name, but I remember some sci-fi story about some academy where humans were trained from birth without exposure to music others had written and had to reinvent it on their own. Some would cheat and access the outside world's music, but they would always be caught by their later compositions all having obvious influence from conventional music.

Eisenstein · on Oct 18, 2022

There is a short story by Orson Scott Card which has that theme called 'Unaccompanied Sonata'.

You can read it here: https://b-ok.cc/book/4395497/b2fb2e

SilasX · on Oct 18, 2022

Yes! Thank you! That’s the one!

batmanturkey · on Oct 18, 2022

Plenty of living musicians today have this capability.

It turns out that humans can extrapolate generalisms to a degree we are currently unable to explain clearly enough as a model to imitate.

It turns out that much ML is merely referenced regurgitation.

Marketing and hype are rather advanced skills in 2022, however…

skohan · on Oct 18, 2022

I don't know why we should be concerned with the status quo of copy-write law at all with respect to AI. ML is categorically new in how it applies to these domains, and it's not clear to me at all that rules that apply to humans have much to do at all with rules which should apply to machines.

Imo it is very simple: IP law is intended to incentivize creative work, so that it remains possible to profit from one's creation in an environment where it might be easier to copy than it is to create. We just need to figure out what outcome we want to create: one which incentivizes human creations, or AI "creations" - and build a legal framework to support it.

A4ET8a8uTh0 · on Oct 18, 2022

It is an interesting take and it reminds of the thoughts that made crypto what it is today, which made something the lines of:

"Old systems suck and our new system is great and it is new technology. Therefore old rules do not apply to it."

Not surprisingly, the moment crypto started gaining traction, everyone was quickly made to understand that rules do indeed apply even if it is a new a facet of finance regulations ( or in the case of Copilot copyright law ).

For the record, I am sympathetic to your sentiment, but you can't really expect existing interests to accept a major change if it happens to undermine someone's way of life and, possibly, alter the current legal landscape. And this may end up a much bigger change than expected and may finally usher in the era management always wanted.

skohan · on Oct 18, 2022

I'm not sure if I expressed myself well: I'm not making the argument that we should just accept a future where AI has free rein. Quite the contrary: I would argue we should examine how to preserve the spirit of the precedent - which is to protect creators - which may require us to create new laws.

peyton · on Oct 18, 2022

They’re clearly producing derivative works.

brian_cloutier · on Oct 18, 2022

I think you're saying any work created by a model trained on copyrighted data is a derivative work of that copyrighted data.

But this can't be right, it is inconsistent with how copyright has worked so far. Artists and musicians and engineers all learn from each other and have seen and learned from, "trained on" many other examples of works from their field. Even when works are clearly inspired by other works we tend not to give them the legal status of derivative work.

You're suggesting we treat models with a much stricter copyright regime than has previously existed.

adgjlsfhk1 · on Oct 18, 2022

courts in the US have repeatedly ruled that humans and machines aren't the same in the eyes of copyright. for example under current case law, nothing created exclusively by a machine is copyrightable.

concordDance · on Oct 18, 2022

This is not sane, sustainable or justifiable. E.g. what about that future when we have actual AI people?

batmanturkey · on Oct 18, 2022

I hope I shall never be forced to call a Microsoft product a person. I find your entire implied world view to be a cheap parody of the rights living beings inherently should have.

If we get to the point that capitalist maximalist-utilitarianism insists upon hijacking the very concept of what a living organism is, I can only compare it to a teddy bear vs an actual bear.

It’s not enough to merely put fur on it and an internal rom for it to regurgitate prefabricated roars upon contextual prodding.

Respect the life you are only one instance of, for hubris has always brought suffering and pain in its wake.

bugfix-66 · on Oct 18, 2022

Yes. The copyright law must change. This is different.

brian_cloutier · on Oct 18, 2022

why must it change, why is this different? genuinely curious

bugfix-66 · on Oct 18, 2022

Because the scale is different.

You have a mechanism that can regurgitate (digest, remix, emit) without attribution all of the world's code and all of the world's art.

With these systems, you're giving everyone the ability to plagiarize everything, effortlessly and unknowingly. No skill, no effort, no time required. No awareness of the sources of the derivative work.

My work is now your work. Everyone can "write" my code, without ever knowing I wrote it, without ever knowing I existed. Everyone can use my hard work, regurgitated anonymously, stripped of all credit, stripped of all attribution, stripped of all identity and ancestry and citation.

It's a new kind of use not known (or imagined?) when the copyright laws were written.

Training must be opt in, not opt out.

Every artist, every creative individual, must EXPLICITLY OPT IN to having their hard work regurgitated anonymously by Copilot or Dall-E or whatever.

If you want to donate your code or your painting or your music so it can easily be "written" or "painted", in whole or in part, by everyone else, without attribution, then go ahead and opt in.

But if an author or artist does not EXPLICITLY OPT IN, you can't use their creative work to train these systems.

All these code/art washing systems, that absorb and mix and regurgitate the hard work of creative people must be strictly opt in.

That's how the law needs to be.

brian_cloutier · on Oct 18, 2022

What would you think if models were bundled with a second model, the "copyright filter". Just as humans know to keep their creations sufficiently far away from copyrighted material, you could distribute models which are trained on copyrighted materials but know well enough not to produce anything so close so something copyrighted that it infringes.

This would prevent anybody from accidentally infringing when using these tools. Does that seem like a reasonable solution, or is your concern greater than accidental infringement?

bugfix-66 · on Oct 18, 2022

Explicit opt in, period.

concordDance · on Oct 18, 2022

But why? It really seems to me that things like Copilot will save millions of man hours and make the world a better place. The only harms people have come up with are highly speculative and far smaller in magnitude.

belorn · on Oct 18, 2022

That is the underlying theory of copyright law. We make the speculation that if we don't have copyright law then people won't have the incentive to create future works.

A world without copyright would however save more than just a few millions of man hours that copilot might do. Allowing people and companies to freely use the best software available, view the best art, enjoy the most relaxing music, have the best recreational time with the best films. The only harm is the highly speculative claim that people won't be creating the best software, the best art, the best music or the best films.

concordDance · on Oct 18, 2022

That analogy doesn't work because unlike an art work, you can't sell a 20 line snippet.

macrolime · on Oct 18, 2022

Your arguments actually suggests that AI is making copyright less relevant and we should make it less strict, not the other way around.

causi · on Oct 18, 2022

A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning (e.g. reading fantasy novels and learning concepts, tropes, or vocabulary that one uses to produce other fantasy novels;

Yes, but a human being isn't allowed to copy that work before learning from it, even if they destroy the copy afterward. AIs don't watch Youtube or browse Github. People download copies of content stored there, analyze and categorize it, and then feed it into AIs. Copyright is broken at step 1.

rzwitserloot · on Oct 18, 2022

> It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture

This doesn't seem to translate well to code. You can't copyright 'rembrandt's style', which is what dall-e and co learn from analysing those paintings. But what copilot gives you is a sizable chunk of code. That code is (mostly) exactingly precise: It's not like the AI learned a style and recreates the style. It learns what you intended to do and then verbatim copies a code chunk in. I'm pretty sure the AI part comes in to determine what it is you were likely trying to do, so that it knows which code to copy. Not to generate the code out of the AI model. At least, that's how I understand it works.

That is the fundamental difference.

> OK, if you said "what word comes next? MR. AND MRS. DURSLEY OF NUMBER FOUR PRIVET DRIVE WERE PROUD TO SAY" ... same thing

Let the AI fill in the rest of that sentence and we can debate whether that is copyright infringement or not. However, if the AI system is capable of finishing that sentence, it can presumably fairly trivially be asked a slightly different question. Instead of 'finish the sentence', how about: "Suggest the next likely sentence". That system would presumably generate the next exact sentence straight from the book, and keep going and - voila you recreated the entire book.

Which is clearly copyright infringement.

AI systems have a 'volume dial' to configure how much they mix and match. Turn it down low enough and asking DALL-E for 'a girl with a blue bandana and an earring in the style of Vermeer' will just give you a copy of Girl with the Pearl Earring, reproduced sufficiently accurately that it's trivially a copyright infringement (let's leave out the notion that the painting has moved past its copyright date of course).

Point is, for copilot, the volume dial has to be kept extremely low, because you can't just mash 5000 snippets together, unless those snippets are identical. Which is its own intriguing copyright infringement conundrum (500 artists each individually paint the same thing, and they can each prove they weren't influenced by the others, thus, not copyright infringement. Then, you reproduce the averaging of the 500, which results in yet another painting. Did you just infringe copyright? Surely the answer is 'yes', but whose copyright did you infringe? All of them? Is it 'yes'?)

enson110 · on Oct 18, 2022

I think the big diffrent here is human have paid for the copyrighted work they learned from, and the AI not.

the-anarchist · on Oct 18, 2022

Sir, you and this comment section should get a room.

thesz · on Oct 18, 2022

Your comment is full of good and deep analysis and it uncovers a deep flaw in the Copilot: Copilot helps create new code.

All the problems and confusions mentioned above is due to this concrete inherent property of the Copilot.

If Copilot was made to help rearrange [1] existing code to satisfy new or changed needs, there would be no need in such a deep and explanative analysis of yours.

[1] https://www.folklore.org/StoryView.py?story=Negative_2000_Li...

Code is a liability. Less code is less liability. New code is a new liability.

Even tools to create new code is a new and unknown liability, it seems.

woah · on Oct 18, 2022

It would be sad if someone succeeded in shutting down CoPilot for this kind of copyright stuff. It is genuinely useful. I don't care that it reproduces copyrighted content. The only way you can get it to do that is to bait it with the function names of functions that have already been copy and pasted thousands of times onto GitHub without proper licenses.

Luckily, someone will probably come out with a "renegade" version trained on whatever makes it a useful assistant to my coding. I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.

jakelazaroff · on Oct 18, 2022

> It would be sad if someone succeeded in shutting down CoPilot for this kind of copyright stuff. It is genuinely useful. I don't care that it reproduces copyrighted content.

Sure — in the same way that hacking into a competitor's GitHub account and copying their private source code is "genuinely useful" to you. As the person benefitting from unlawfully using their source code, of course you wouldn't care that it reproduces it. But you're not really the person we're trying to help here.

dmix · on Oct 18, 2022

> in the same way that hacking into a competitor's GitHub account

That's like comparing grand-theft auto to someone stealing a pack of gum from a convenience store. It's not a useful analogy. The latter is still a problem, but we don't need to be FUDy about it.

And OPs right, this will keep happening until we come up with better ways of solving this problem.

Whether that's educating companies on the legal (and moral) risks their developers IDE tools are exposing them to, better licensing database/indexing, working with future OSS devs building these tools instead of treating them like criminals, suing the for-profit companies like Microsoft who seek to profit from this until they invest in this problem, etc.

jakelazaroff · on Oct 18, 2022

OP can correct me if I'm wrong, but they don't seem particularly interested in solving anything. They literally said "I don't care that it reproduces copyrighted content." So the problem, as I see it, is the people who see the laundering of open source and proprietary code as a draw, rather than a drawback.

nearbuy · on Oct 18, 2022

OP said they don't care that it can reproduce copyrighted content because they're not going to do so with it. That's roughly the opposite of what you're implying.

smaudet · on Oct 18, 2022

No they are just going to let other people commit crimes and claim they will never do it...

Do you really believe Microsoft employees aren't going to be using this, illegally or unofficially?

"Yes! We (Microsoft) aren't doing anything illegal, but we are going to turn a blind eye to everyone using it illegally as we directly benefit from it - and here's the kicker! Our employees are legally liable not us evil laughter all the way to the bank"

Of course the legal execs aren't using it, this is classic Microsoft (Embrace, Extend, Extinguish).

Ajedi32 · on Oct 18, 2022

If accidental reproduction of copyrighted material by AI systems is illegal under current law then we should change the law immediately so that it's not.

These AI systems are highly novel, transformative, and useful. Their development is exactly the sort of thing copyright law was originally created to encourage. If it's hindering them instead, that's a problem.

(And no, I'm not saying people should be allowed to use AI to intentionally launder stolen code; use some common sense here.)

jakelazaroff · on Oct 18, 2022

Why is it so outlandish to expect the people who make money by selling AI systems to only train them using material for which they have a license?

As many commenters have pointed out, no one would have a problem had Microsoft trained Copilot on the Windows source code. The fact that they intentionally left it out of the training set is a huge red flag.

Ajedi32 · on Oct 18, 2022

Because AI systems require large amounts of training data, the more the better, and requiring manual review of those datasets to ensure compliance with copyright would consume significant resources and slow down the pace of innovation across the entire AI industry.

Now let me flip that question around on you: What benefit would society gain from that forcing AI developers to do all that extra work?

im3w1l · on Oct 20, 2022

If you are going to use my work for free and without attribution and turn it around to compete with me, then it decreases my incentive to produce anything, and if I do it decreases my incentive to publish it. This goes directly against the intentions behind copyright law.

Ajedi32 · on Oct 21, 2022

That's the best argument I've heard so far, but still doesn't make sense to me. It's not like your individual project is going to make any significant difference to the capabilities of the resulting AI that's "competing with you" one way or the other. So really all you'd be doing by not releasing your code is shooting yourself in the foot for no gain.

Granted, people are not necessarily rational actors, so maybe you could argue it still makes sense to have some protections in place to assuage people's irrational fears. Maybe like some kind of robots.txt for determining whether a page can be used in an AI dataset could serve that purpose. I'd be hesitant to support anything more burdensome than that.

jakelazaroff · on Oct 18, 2022

The benefit is that our collective genius isn’t mined by mega corps and rented back to us. That we exist as more than mindless resources to be tapped for profit.

Again, if (for argument’s sake) we want to maximize the effectiveness of the AI, why are we okay with Microsoft intentionally omitting one of the most important codebases in human history — which it unambiguously has the right to use — from its training set?

Ajedi32 · on Oct 19, 2022

> The benefit is that our collective genius isn’t mined by mega corps and rented back to us.

That sounds like a downside to me, not a benefit. You're basically arguing it would be better if Copilot, Stable Diffusion, GPT-3, etc (which all included copyrighted works in their training set) didn't exist. I'm just not seeing that.

nearbuy · on Oct 18, 2022

They are only using material for which they have a license (at least debatably). Open source software licenses usually require attribution if you reproduce the source code or use the source code in a program.

Some other uses are allowed without attribution. Someone can read and learn from open source software without needing to put an attribution anywhere. You could run an analysis of the code on GitHub to find out what percent of code is written in C++. You wouldn't need to attribute every project on GitHub.

Now the debate is whether this applies to training ML models.

comex · on Oct 18, 2022

Not sure if they edited their comment, but the end of it contradicts your interpretation:

> I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.

jakelazaroff · on Oct 18, 2022

No, that was there, but it doesn’t contradict my interpretation. Copyright doesn’t only cover reproducing code verbatim. It also includes derivative works.

dmix · on Oct 18, 2022

Maybe in court which interprets software dev very strictly, but in practice a developer automatically copying a single function from some 'freemium' style licensed library [1] posted publicly on Github - getting autocompleted into a different codebase with many thousands of lines of custom code isn't the same as going into some proprietary codebase and stealing their code to compete / build the same product as another company.

We could come up with scenarios where there might be some fancy algorithms posted on some public Git repo that's super efficient or unique, and that somehow fits into the size of individual functions that could be auto-inserted into some other person's codebase. But IRL that sort of thing is rarely ever going to be the thing that these IDE tools do. At least in a way that meaningfully contributes to another project.

That is still a concern yes, but it's still a niche usecase, which doesn't justify killing off otherwise extremely useful tools.

Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed by these tools + spreading awareness of the issue is enough to address the licensing problems.

The more accurately we explain the problem, the quicker we'll find good solutions.

[1] usually licensing saying commercial projects need to either pay or not use it at all. Or some attribution clause

smaudet · on Oct 18, 2022

"Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed + spreading awareness of the issue is enough to address the licensing problems."

I think you are, though. You have to automate the justice as well, traditional courts can't keep pace. You'll just end up with more automated DMCA-style takedowns, not less.

dmix · on Oct 18, 2022

I think you misunderstood my comment then (or how these tools work IRL)... because I'm not saying that it's even worthy of a court case in the vast majority of cases. So why would you need to automate such a thing?

And I don't even see how an automated DMCA system could exist because I doubt they'd win monetary damages in court over a 'stolen' function or two (or detect it in most commercial applications in the first place).

Regardless a single class action should be enough to make Microsoft either shut down their project or adapt (via whistleblowers, leaked code, public repos, etc). And regardless if they don't adapt by investing in the possible solutions here, an OSS project could take it's place eventually and the courts wouldn't even be a useful solution.

Ideally a capital-backed company will help solve this, with the obvious legal incentives that already exist. But even if it doesn't this isn't going away.

SantalBlush · on Oct 18, 2022

>That's like comparing grand-theft auto to someone stealing a pack of gum from a convenience store.

They're both more like grand-theft auto, but one involves the valet driver leaving with your car, and the other involves smashing a window.

oefrha · on Oct 18, 2022

I use Copilot all the time and I’ve never once used it to generate a whole prepackaged function that’s more than maybe three lines. So no, I don’t benefit from its reproducing other people’s code at all. Tell me you don’t use Copilot without telling me about it.

remram · on Oct 18, 2022

> Tell me you don’t use Copilot without telling me about it.

You don't accept arguments against the use of Copilot from people unless they... use it?

That's a nifty way to ignore any and all criticism of Copilot, or indeed any discussion about any ethical issue ever.

cdrini · on Oct 18, 2022

I believe the argument is that you shouldn't accept arguments against the use of copilot from people unless they have tried to use it. In a realistic context. That seems reasonable to me. It's the bare minimum to make an informed opinion. I think the wording was perhaps poor, but I think your interpretation is a little reductive/disingenuous.

ranguna · on Oct 18, 2022

I don't see how using copilot make the copyright question any less serious.

Because it's useful, then it's not a problem?

Well, it's also useful to send our non recyclable trash to 3rd world countries and every 1st World country should try it. It will definitely make the consequences less serious if everyone does it.

Not apples to apples but I guess you get the picture.

imdsm · on Oct 18, 2022

As someone who has used copilot since the early beta days, what I think people are saying is that nobody uses Copilot to generate full functions like this. It's more of an intelligent auto complete. It's fantastic for repetitive autocomplete where certain things have to be changed, and for quickly getting out boilerplate code. You can put a list in a comment along with a format and generate data structures quickly and easily. You can solve small problems quickly, allowing you to focus on the bigger picture.

It's sort of like a power tool — sure, you could use a screwdriver, but a drill with a screwdriver attachment will be quicker. Hammers are good, nail guns are quicker. You'd never expect someone to use a drill with sd attachment if they'd never used a screwdriver before.

There are for sure things to be improved, such as the recent post on how you could put in a very specific seed and get out a specific function that it shouldn't. The answer here isn't to shut down the project, at a net loss for everyone, but to find ways to improve it.

As others have said, with Copilot gone and the new demand created, the vacuum will bring in community projects that will happily scrape every public repository they can get their hands on.

ranguna · on Oct 19, 2022

Now I understand what you mean, still pretty crappy because those minor auto complete strings only exist because Microsoft used code without permission and/or without crediting the original owners and/or breaking licenses from the original corpus.

I use copilot everyday, I love it, but it still leaves a bad taste in my mouth knowing that people out there worked really hard on their code and harder on building OSS licences just for Microsoft to throw all that out of the window.

Feels like licenses don't matter anymore. My own code doesn't matter much, but it's about principals dude, licenses are there and they should be respected, if not, then it's just anarchy, and we all know anarchy only works in very specific scenarios, Microsoft is not apart of any of those scenarios.

cdrini · on Oct 18, 2022

I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. In order to make an informed decision as to whether you agree with that, you can try copilot to reach an informed conclusion. There is no cost to try it. Unlike your example, where "trying" has an immediate cost -- which is why that example doesn't make too much sense here.

ranguna · on Oct 19, 2022

Yep know it's much more clear, thanks for clarifying.

See your sister comment's child for my reply.

concordDance · on Oct 18, 2022

Because its a net good to the world its not a problem. It the benefit is orders of magnitude greater than the harm then its good.

batmanturkey · on Oct 18, 2022

It’s a net harm for the programmers whose code is being willfully plagiarized.

It’s a net boon for Microsoft in their efforts to rule the world.

It’s a net loss for society and ethics.

Open up copilot code, Microsoft, if you are so sure that everyone must wear transparent underwear let’s see you wearing some. Train copilot on windows 11 code. It’s not public domain.

Truth matters. Lies matter

onethought · on Oct 18, 2022

Expand on the unethical part. So people published code that could be referenced and copied on GitHub. There was no ethical problem, the world, society were happy.

Github make a convenient way to search and contextualise this publicly available code and paste it into your code (adjusting local scope, format, language along the way). Suddenly we have crossed an ethical line!?

Which ethical line? Are you pretending people never copy and pasted open source code before copilot? Are you pretending open source code never copy and pasted other open source code? That we were in an ethically pure world until copilot came along?

shuger · on Oct 18, 2022

> So people published code that could be referenced and copied on GitHub. There was no ethical problem, the world, society were happy.

This code has different licenses. You can't just copy code randomly without checking license first.

Copilot serves it stripped of the license to unaware users. Even if copilot user wants only to reuse code licensed in a way that allows it copilot will serve him code from restrictive licenses without him being aware.

onethought · on Oct 19, 2022

You can just copy and paste code without checking the license. People do it all the time.

GitHub doesn’t force you to accept the license in the repository before showing you the code.

concordDance · on Oct 18, 2022

> It’s a net harm for the programmers whose code is being willfully plagiarized.

What's the harm, specifically?

Say it copies that snippet of workflow scheduling code I made at work yesterday or the greasemonkey script I made in my own time.

How is my life worse?

gizzlon · on Oct 18, 2022

> I believe the argument is that you shouldn't accept arguments against the use of copilot from people unless they have tried to use it.

I disagree, and this does not hold up generally: We can, and should, argue things we have not tried or experienced, like heroin and murder. What makes it so that this has to be tried?

> It's the bare minimum to make an informed opinion.

Only if the usefulness is what is in question. But it is not.

anigbrowl · on Oct 20, 2022

> We can, and should, argue things we have not tried or experienced, like heroin and murder. What makes it so that this has to be tried?

It's not that you absolutely have to have experience with something, but you'd be foolish to discount the input of people who do. In debates about drug policy I try to be polite to people with zero first hand experience, but their contributions are rarely of interest. Murder is a bit more abstract insofar as anyone who has fully experienced it by definition didn't survive to testify, but I give a lot more weight to the views of people that have first-hand knowledge of violence and crime.

It's not that you shouldn't weigh in on a topic without first hand experience, but that it's a good idea to specify the scope of your understanding, or frame uncertainties as open questions rather than assumptions.

cdrini · on Oct 18, 2022

Correct, it doesn't hold up generally. But it doesn't need to. It holds up here. We do not try things when there is exceptional risk or cost in the trying. Here there is no cost to trying, so it does not make sense not to try.

I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. So it's not just about usefulness.

gizzlon · on Oct 18, 2022

> I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. So it's not just about usefulness.

How would you know though? The burden of proof is on Copilot. Especially now that it has been shown to spit out copyrighted code.

yunohn · on Oct 18, 2022

You’re right, trying Copilot is equivalent to committing murder.

/s

tablespoon · on Oct 18, 2022

>> Tell me you don’t use Copilot without telling me about it.

> You don't accept arguments against the use of Copilot from people unless they... use it?

> That's a nifty way to ignore any and all criticism of Copilot, or indeed any discussion about any ethical issue ever.

"I only listen to people who agree with me, but to make that sound legitimate, I have a somewhat indirect way of saying so."

oefrha · on Oct 18, 2022

They should at least try to understand how it’s actually used, not imagining how it’s simply used to steal their largely replaceable code.

xigoi · on Oct 18, 2022

It doesn't matter how it's used. Do you think Microsoft would be happy with someone training a model on Windows source code, as long as they didn't use it to reproduce the code?

sangnoir · on Oct 18, 2022

If Microsoft were confident Copilot doesn't produce infringing code, they would have included the Windows and Office codebases in the training data. I wonder what will come out of discovery

manmal · on Oct 18, 2022

You think MS‘ code quality is high enough to train an AI on?

AlexandrB · on Oct 18, 2022

Do you think they audited every open source code base that was used in training for quality?

batmanturkey · on Oct 18, 2022

“Their largely replaceable code”

Smells like: “ I stole this lousy apple that wasn’t any good” Then why did you steal it?

Put your money where your mouth is, Microsoft, train copilot on your own code!!!

Don’t wanna train it with windows 11 code? Prefer to hijack others projects and use their for your needs and then pretend thst insulting others and calling their code worthless will get you off the hook????

Backfire

woah · on Oct 19, 2022

> Smells like: “ I stole this lousy apple that wasn’t any good” Then why did you steal it?

The lousy code trained copilot in what a switch statement looks like so it can autocomplete mine for me

LtWorf · on Oct 18, 2022

On a different website I argued with a microsoft employee who said that copilot is great and so on and would not discuss unless I tried it.

I tried telling that it requires a credit card number to try it but he didn't believe me… I guess the thought that non-microsoft employees have to pay for microsoft stuff didn't occur.

jakelazaroff · on Oct 18, 2022

That isn't sufficient to get you off the hook. Copyright covers derivative work, not just code that's reproduced verbatim.

Too · on Oct 18, 2022

Derivative work requires a major part of the original, before it’s considered by copyright.

A 3 line boilerplate is neither novel nor a major part of the original.

smaudet · on Oct 18, 2022

I think if copilot is heavily restricted to three line code samples, perhaps I could agree with this.

The example cited by the OP is not a three line code sample - if you've ever done matrix coding, you know that sparse matrix operations are not simple.

Sure you can reduce it to a function call, but then you have library usage instead of code theft.

I think actually perhaps this is a way copilot could ethically move forward - instead of lifting code verbatim if it merely suggested libraries and approaches "here is an example of sparse matrix filtering and some libraries which do it", that would be both useful and ethical, presuming it does not obscure the license.

sdenton4 · on Oct 18, 2022

...But the example cited by the OP isn't how anyone actually uses copilot...

From where I sit, the complainant has found an extremely convoluted (and buggy) way to copy-paste their own code and is very upset about it. By similar logic, we should restrict the use of ctrl-c and ctrl-v, because they allow very simple infringement of open source licenses. Find a sparse matrix multiplication library which uses the copied code without attribution and you can take them to court; the law is already sufficient for this.

cornel_io · on Oct 18, 2022

"Derivative work" is a very specific thing, and it's contrasted with "transformative work" in a way that matters a lot, and fair use intersects heavily with both.

Even when it comes to stuff that seems reaaaaally close to pure derivative: Googling "How long does it take to boil water?" => "If you're boiling water on the stovetop, in a standard sized saucepan, then it takes around 10 minutes for the correct temp of boiling water to be reached. In a kettle, the boiling point is reached in half this time."

That's a verbatim snippet pulled directly from https://unocasa.com/blogs/tips/how-long-to-boil-water, and yet Google exists and continues to do stuff like this under the fair use doctrine despite massive efforts to attack/monetize their service. [To be fair, Google does link results, which probably insulates them because it's less hurtful to the commercial interests of the source; that said, with open source there generally are no commercial interests to hurt (open source attribution will be a tough sell as an actual commercial interest), and that's specifically called out in the law as a factor]

Copilot is even less explicitly at risk IMO, in that it never even stores the text, nor can it reliably retrieve it. I have no idea what makes anyone think it should be more vulnerable than Google.

From the copyright.gov page on fair use (https://www.copyright.gov/fair-use/, worth reading in detail for anyone who cares about this stuff, also has links to a monumental number of cases with shockingly intelligible summaries): "Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work."

Copilot without any shadow of a doubt does add something new, with a further purpose, and does not substitute for the original use of any codebase on Github (it can't create any of the codebases in full, without manual guidance so extreme that you'd have to be using the actual original codebase as a reference, so it clearly cannot substitute for a single one of them, and that's what a lawyer will argue, likely successfully).

In the Google vs. Oracle case (see https://www.copyright.gov/fair-use/summaries/google-llc-orac...), a big piece of the fair use finding was that "its value in significant part derives from the value that those who do not hold copyrights, namely, computer programmers, invest of their own time and effort to learn." and "further[s] the development of computer programs”. It's hard to see where Copilot wouldn't fall into that category, as well, and that's precedent on (multiple) appeal.

By my reading this should be a slam-dunk fair use ruling, unless precedent gets really upended, and Butterick is wasting a ton of time and effort for absolutely zero potential gain other than some bragging rights, but to each his own...I guess we all have to grind our axes from time to time.