Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Copi­lot inves­ti­ga­tion (githubcopilotinvestigation.com)
1847 points by john-doe on Oct 17, 2022 | hide | past | favorite | 1219 comments



Here are a few thoughts I haven't formulated before:

It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works. It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture (grammar, concepts, facts, cultural tropes, and many others).

A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning (e.g. reading fantasy novels and learning concepts, tropes, or vocabulary that one uses to produce other fantasy novels; reading a newspaper and learning facts that one incorporates into an essay; learning artistic techniques or stylistic conventions from studying existing artworks and using them when producing new artworks). Current AI systems are (amazingly) becoming capable of all of these things and may do them in ways that are somewhat akin to how human beings do them. (although I guess Jaron Lanier would object "that's what they want you to think")

But there are also examples in existing copyright doctrine where people accidentally repeat enough of a prior work to get in trouble for infringement -- most often with song composition (like George Harrison's "My Sweet Lord") because relatively small pieces of melody (which a person might easily memorize) may be considered copyrightable.

If human beings had much more accurate memories, copyright would be quite a bit more intrusive (and/or quite a bit less effective) because, following any exposure to some kinds of works, we could use our own memories to reproduce those entire works from scratch for our own use or pleasure without obtaining authorized copies from elsewhere.

Computers do have such accurate memories, and machine learning systems, which are optimized for things like maximum likelihood estimation, can and do reproduce both copyrightable and non-copyrightable elements of works that they've been trained on. After all, the maximum likelihood continuation of a fragment of a text or a song is ... the complete original work. And the ability to reproduce the complete original work would, other things being equal, reduce loss in training. After all, that's something someone might specifically ask for, and if the system could oblige, it would be doing a better job of providing what the user wanted.

It's relatively foreseeable that machine learning systems would potentially be able to reproduce both copyrightable and non-copyrightable elements of various works, because the distinction between the two isn't especially clear from an algorithmic or mechanical point of view. (For instance, facts aren't copyrightable, but the notion of what constitutes a "fact" for this purpose is a culturally-bound legal notion and not at all straightforward to make precise.)

But if you had a human author or artist or scholar or programmer who was "trained on" exposure to an enormous body of works, and that person had an exceptional eidetic memory, you could imagine that he or she would be perfectly capable of recreating many of those works from memory (and that other people might request such recreations). (Again, in music in particular, it's already routine that someone could have unambiguously copyrightable material memorized and be subject to copyright restrictions on performing songs. Like if a singer or band performs a cover from memory.)

If you wanted to avoid this ability then you might need to build in an explicit notion of copyright that limits the accuracy or level of detail inside of the model in some way. This is tricky because (1) I don't think people have really tried to do this much so far, (2) copyright applies very differently to different categories of work, (3) it obviously wouldn't satisfy critics even if it mitigated the most extreme examples of "regurgitation", and (4) it would be kind of weird because you would be intentionally limiting the quality and extent of learning that the system was allowed to do. (I imagine Jaron Lanier getting mad again about my repeated comparison between human learning and machine learning, and between human memory and machine memory)

Some of the weirdness in point (4) is that accurate prediction is usually cool / great / impressive / accepted as an appropriate goal or capability, but if it's too accurate in certain contexts, it may be deemed a copyright infringement. Like if you said "what word comes next? FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS", there's a clear correct answer and knowing it requires having a certain text memorized. OK, if you said "what word comes next? MR. AND MRS. DURSLEY OF NUMBER FOUR PRIVET DRIVE WERE PROUD TO SAY" ... same thing, but Bloomsbury Publishing may be unhappy if you have a system that can get all such questions right.


I see this basic logic in almost AI ethics threads, and it starts with a big assumption: "humans learn from copyrighted source material without copyright violation". This then gets tenuously extended to "ai also learns, so it must not be in violation of copyright law".

The first assumption is highly flawed though. Humans routinely do violate copyright law. Plagiarism is a huge problem in many sectors; un-cited direct copies of people's work in violation of fair use is a regular every day occurrence in the human world. It doesn't matter if you memorized the source material or if you transcribed it or if you copy and pasted it, if it isn't your source material and is someone else's, you've committed a violation of the law. Learning to produce original work and reproducing someone else's work is not the same thing. If an AI is ingesting and perfectly reproducing someone else's copyrighted works, it is in violation of copyright law in the same way a human would be if they reproduce someone else's copyrighted works.


+1. And let's not forget too that "AI", that is, ML models, are not "autonomous" in the way that humans are autonomous. Sure, we use the word "learn" to describe what they do, which is one word that we also use to describe what people do. But ML models are always wielded by people or corporations for particular purposes.

If a corporation was to directly publish some copy that appears plagiarized, we'd call that plagiarism. I don't see how adding a piece of code—one that's fully created, owned, and wielded by the corporation—as an intermediary changes anything. If anything, it looks like plagiarism-as-a-service, which seems worse (at least to my eyes).

Of course, this matter is a bit confusing. Because, for example, (1) it's not always plagiarism, (2) defining what exactly is plagiarism even in the purely non-technological realm is difficult (and likely somewhat subjective), and (3) there is a lot of corporate marketing which suggests this "AI" is "autonomous" (presumably to distract from who exactly is autonomous in this picture). And of course ML art is quite useful for many things. But I mean, so are artists.

Not long ago, a lot of Silicon Valley rhetoric was that the purpose of "technology" was to free up time so that people could be more incentivized to "do what people love to do" like, for example, artistic creation. But now it seems that rhetoric was just that: rhetoric, or what was needed to be believed/said at the time.

And now at our present time, when technological "progress" has been followed a bit further (that is, when we've developed our machinery a bit further under the incentives of our present economic system), much rhetoric has conveniently shifted to something else, something largely contradictory, but again precisely to what is needed to be believed/said to continue following the same incentive structure.


A lot of really good points.

>Sure, we use the word "learn" to describe what they do, which is one word that we also use to describe what people do. But ML models are always wielded by people or corporations for particular purposes.

This is extremely important. "Learning" in machine learning is an aspirational label, not a descriptive one. People who claim otherwise either drank too much of their own Kool-Aid or are simply dishonest. This isn't just "wrong" in some taxonomical sense, this is dangerous in a very practical way. Conflating machine "learning" and human learning will inevitably lead to various kinds of sabotage of human learning.


I mean, at what point will this change? When the AI has to first be trained by being in a robot in the physical world for 10 years learning human concepts before it can start looking at art in the ultimate goal of learning how to draw?


The main reason AI will be reproducing copyrighted works while the original license is not trivial to identify will be that in those instances, humans are already violating copyright at a high rate. It's just flown under the radar thus far as required machinery to so easily surface violations was not available.

Copilot is capable of going beyond retrieval and is competent at using variables, comments, types and local context to infer intention and generate appropriate code and even comment on it. Whenever copilot correctly predicts code of yours that's a novel combination of concepts, copilot has originated novel code.

For esoteric concepts, you usually already have to know how to prime it but Copilot is especially useful when it helps you bump into things you didn't know you didn't know (one way to increase the odds of this happening is to write out your thinking so far in markdown or comments. You'd be surprised how helpful and clever Copilot can be in some instances). My point here is Github isn't charging $10/month for run of the mill retrieval. My opinion is code-gen LLMs contribute value and more open versions are worth building.


Indeed, the "learning". To my mind, the most simple (but still speculative) explanation of the "learning" phenomena - working examples and limitations / failures - we see is that the large models implicitly memorize the training inputs (or some derived features that can be used to approximately reconstruct the inputs) and then do something between interpolation and rather simple non-parametric learning. The effect is outputs are basically a somewhat sensical agglomeration of copy-pasted" snippets.

That said I think the results are often useful and sometimes fascinating. We should not fool ourselves about the learning that these large neural nets do, though.


There is a common argument that "human just a better neural network", I don't know, did they mean we need to give GPT-3 or Dalle basic human rights?


I am not sure human rights fit the case, but it is one of the most remarkable developments. It is a self replicating distillation of our culture. If our human-based culture grew up and had a baby-culture...


yes, this should happen, or we just build a new slavery system. but before this happen, using this argument to escape from question about copyright is dishonest.


The law does not concern itself with trifles.[0]

Programmers tend to think of copyright as a Boolean valued function. Either something is infringement or it isn’t.

Judges think of copyright infringement as a real-valued function of many arguments corresponding to the circumstances of the parties (e.g. what actual damage was done?).

A human quoting a human without attribution, without any profit made or identifiable damage, returns an infringement value very close to zero. Such cases, if anyone is petty enough to bring them, are likely to get dismissed.

[0] https://en.m.wikipedia.org/wiki/De_minimis


I'm not convinced of what you wrote. Is it actually true that cases like that are dismissed? Or is it that the infringing party is ordered to stop, but not damages are awarded to the rightsholder? You've offered no examples of dismissal to back up your statement.

I absolutely agree with you that copyright is not a boolean, but I don't buy the idea that a judge will just shrug and allow infringement to continue just because there was no commercial harm.

I also think your example is just irrelevant to the case at hand. Sure, someone "performing" someone else's copyrighted words once may not be a big thing. But if Copilot is actually found to be infringing, these infringements will keep happening, over and over and over.

Bottom line is that none of this has been tested in court. I think it's great that someone is working on doing just that. Maybe the end result will be that Microsoft's use is indeed fair use, and that Copilot users have no further obligations. But I'd like to hear a court decide that, not a bunch of armchair non-lawyers (myself included) on a random web forum.


https://www.lexisnexis.com/community/casebrief/p/casebrief-n...

> CONCLUSION:

> The court of appeals affirmed the district court's judgment. The court held that the Performers' use of the composition, as distinct from the use of the composer's performance, was de minimis and therefore not actionable. Considering only the compositional elements, the brief and relatively simple segment of the composition used by the Performers was neither quantitatively nor qualitatively significant when viewed in relation to the composition as a whole. Thus, despite the high degree of similarity from the actual use of the recorded composition, the scope of the similarity was not sufficiently substantial to support Newton's infringement claim.


Ok, sure, but that's not what I was asking for. I'm not surprised that there is one (or even two or three or a handful) of examples where this happened. But is it common?

Also I'm looking for a case where the judge acknowledged that copyright infringement was occurring, but decided not to do anything about it. From the bit you quoted, it sounds to me like it's implying that the judge believed there was a valid fair-use defense? Or even stronger, that the judge just did not believe the use was infringing at all?


Louisiana Contractors Licensing Serv., Inc. v. Am. Contractors Exam Servs., Inc., 13 F. Supp. 3d 547, 554

(sorry i wrecked the citation and lost where i found it)


Not necessarily where you found it, but sounds like https://willamette.edu/law/resources/journals/wlo/ip/2014/04...


This is pretty interesting, and IMO, really disappointing to see that was the ruling in that case.


> A human quoting a human without attribution

... as opposed to AI. At the heart of the matter lurks a debate whether AI is an independent phenomenon which behaves in its own right, or a just a tool that's created and wielded by humans against a backdrop of clear incentives and motivations.

The argument isn't about whether or not the law deals in a absolutes - it's a basic principle that law is tested in courts through interpretation - the argument is that co-pilot can be perceived as merely a means to an end and that GitHub / Microsoft have created a massive mountain of liabilities for themselves.


I disagree with your statement about ai being an independent behavior.

Suppose we add a button to a visual studio plugin called 'Copy me a function' and when you click it, it 100% grabs some random code from github and plops it as-is into your code base.

I don't have to argue the ethics of if the button is 'thinking for itself'


Well, I just pointed out it's an ongoing debate, I didn't connect any particular value attribution to that statement.

> Suppose we add a button to a visual studio plugin called 'Copy me a function' and when you click it, it 100% grabs some random code from github and plops it as-is into your code base.

Personally, that's exactly how I see co-pilot. To my mind, it's a tool that sits in the same category as p2p platforms, copying machines or video recorders. They are just tools.

How, for what purposes and by whom they are leveraged makes all the difference here.


That makes it easier, at least, in the US.

P2P platforms and those who violate copyright are routinely shut down (or attempt to be shut down) in the US. If co-pilot sits in that same space, it seems the books been written already... we know how it ends.


> If an AI is ingesting and perfectly reproducing someone else's copyrighted works, it is in violation of copyright law in the same way a human would be if they reproduce someone else's copyrighted works.

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

Consider the impact on innovation if Microsoft or Oracle were allowed to claim a copyright over utilitarian aspects of their works such as the Java or Windows API!

BTW, Copilot seems to be reproducing copyrightable material when the tool reproduces comments verbatim!


In my opinion this is a key comment in this thread and everyone subject to United States law should read the Abstraction Filtration Comparison (AFC) legal test when refining their opinion. Also, I have no legal background, but as far as I know patent law != copyright law.

Specifically within AFC, note the "idea/expression dichotomy" [1] which clearly states:

"copyright law protects an author's expression, but not the idea behind that expression"

Thus, if this tools spits out someone else's code verbatim it is a definite copyright infringement. If it outputs code that is similar but not verbatim then it "could" be an infringement, at your own risk, and to be determined by the courts. Simply expressing the same idea in a different way is not a definite infringement.

Program code is naturally copyright-risky because the keyword/grammar space is constrained. It is far more difficult to accidentally duplicate verbatim the expression of one's ideas in a full language, such as English, than in C. And what of two separate programs (or constituent sub-parts such as functions) that by chance emit the same compiled binary?

Personally, for now I won't use this tool due to the risk of accidental plagiarism, and because it is a black box: I can't examine any lineage or attribution metadata to understand the source(s) of what I would then be incorporating into my own body of work. Of course I doubt I could get that type of traceability information for any other trained ML model I might use, so perhaps I need to re-examine my policies heading into the future.

[1] https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...


> Thus, if this tools spits out someone else's code verbatim it is a definite copyright infringement.

That is not true. It can be verbatim and not a copyright violation if it can be shown that the expression in question is strictly utilitarian! I literally provided a quote from that AFC article that says this!

There's even precedent that prior art nullifies a copyright claim, as seen in Johannsongs-Publishing, Ltd. v. Rolf Lovland:

"Johannsongs failed to offer admissible evidence to rebut Ferrara’s analysis, so there is no genuine dispute of material fact as to his conclusions that Söknuður and You Raise Me Up are not substantially similar and most of their similarities are attributable to prior art."

And this was about music, not software, which has always sat uncomfortably between utility and expression, if only because it is some kind of writing. No one is claiming copyrights over Photoshop filter settings or other inputs manipulated by sliders or buttons!


Good point - the wording I chose should have qualified the scope in that first statement: how about something like "as the scope that is reproduced verbatim increases, the likelihood of infringement approaches 100%?" (excluding prior art, public domain, etc). Obviously no one could reasonably claim copyright on a variable declaration - the scope is too small, and in some languages there is only one way to express it.

However, the statement was only for cases of verbatim copies produced by Copilot. The AFC Wikipedia article states that "Proving copyright infringement requires proving both ownership of the copyright and that copying took place." The 3 detailed tests developed in that case appear to be "expand" the determination of infringement to close potential loopholes where, while there is not a verbatim copy, infringement is still deemed to have occurred because of "substantial similarity". e.g. someone copies a program but changes the variable names.

So where is the line between infringement and not, in cases where there is an exact copy of a code fragment? Can we still use the utilitarian defense or is that only used by the court to exclude portions of the code in the tests for "substantial similarity"?


Personally I use common sense to determine if something is utilitarian in nature. The real issue is with the “overall shape”, like, class structure, specific data types, etc, which is somewhat arbitrary in nature.

At this point Copilot is awful at this higher-order level of abstraction but I can see a time where this is not the case!

Microsoft will have to put more work into filtering out responses that are indeed copyright violations if they want people to use their tools.

I doubt that MS will ever be held liable for the violations themselves as there is precedent in themselves and their legal department has plenty of cash to burn.


As for why we should allow verbatim copies of utilitarian features...

First, let's preface this with the substantial similarity of the structure, sequence and organization as established in Whelan v. Jaslow which amongst other things says that you cannot merely change the variable names if the expressive structure of the code remains the same.

Now let's imagine 10,000 software developers who all implement Dijkstra's algorithm in C and then run it through clang-format. Aside from variable names, isn't it safe to assume that many of the implementations are going to be exactly the same?


> It can be verbatim and not a copyright violation if it can be shown that the expression in question is strictly utilitarian!

So if it's identical it might or might not be a copyright violation and even if is totally different it still might or might not be a copyright violation... and only after spending insane amounts of time and money in the court system can you ever be sure if what you (or your AI) created has made you a criminal. This is increasingly sounding like a very very broken system.


I agree it's like people are not aware that clean room design is a thing (https://en.wikipedia.org/wiki/Clean_room_design).


Not really a fix to the problem since as your own link shows you can still be dragged into court to defend against a copyright lawsuit which can cost someone tens if not hundreds of thousands of dollars and there are no guarantees when you're up against a team of lawyers representing an entity with far more money and resources than you have. In the end, you can do everything right and still easily end up being screwed which isn't how anything should work.


Yes, it's a thing but how often is it actually done? I can imagine circumstances where a company might do so--say a cloud provider wants to clone some open source software and offer it as a service. But I'm pretty sure the practice isn't routine.


>Humans routinely do violate copyright law. Plagiarism is a huge problem in many sectors; un-cited direct copies of people's work in violation of fair use is a regular every day occurrence in the human world.

And we need to accept that and get over it, not get better at outlawing it.


I'd say we need to reject it, refuse to get over it, and demand changes to fix copyright law so that humans and AI don't need to risk violating the law in order to learn things and create new works based on what they've learned from what others created before them.


Seems like 'learning' and 'producing something with that learned information' are two different things. I can 'learn' all day long from copyrighted materials. I'm not violating anything, because I've not produced a copy that would be distributed or used. First/knee-jerk reaction to that above - no doubt there's more nuance buried someplace.


Humans do violate copyright if they use copyrighted passages directly in their work and pass it off as their own without any attribution, which is what copilot has been show to sometimes do, though not always. Copilot will sometimes offer chunks of code that can be found verbatim in open source code bases and passes it off to users without attribution. I agree it is ok to learn from copyrighted work and reproduce new different work from a human or a machine learning algorithm, but it isn't ok to pass along exact copies as your own without attribution. Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.


> Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.

A user once replied to one of my comment[0] about this with the following:

> It's not really an issue when you're a large software corporation; you already have mechanisms in place to check for license compliance in everything that ships, including F/OSS plagiarism checks [1].

IOW, from my understanding, they don't care. Big players do their own checks anyway, and small fish won't be creating problems because it's too convenient for them. Classic Microsoft (as I know from 90s).

The bigger thread can be seen in [2].

[0]: https://news.ycombinator.com/item?id=32534697

[1]: https://news.ycombinator.com/item?id=32539467

[2]: https://news.ycombinator.com/item?id=32533531


I'm curious about the endgame of copyright with respect to software. At some point, enough people will have written enough code that you can't write code anymore because some fragment of it violates a copyright. Where does the line get drawn? There's only so many ways to do certain algorithms, like DFS or BFS.


>you can't write code anymore because some fragment of it violates a copyright

Copyright (unlike patents in general) allows for independent creation. If I sit down to write a quicksort routine, it is going to look extremely similar to a zillion other quicksort routines out there.

The other question (IANAL) is whether writing a quicksort routine is even a creative act at this point.


At some point it'll have to come home to roost that code is a subset of discrete mathematics first, a literary/artistic work second. There is really no way around it.


> Microsoft will likely need to add checks to prevent copilot from offering verbatim copies of code going forward to try to avoid copyright violations here.

Or they could integrate a way to find the produced output back in the corpus if it's sufficiently close and provide a reference/attribution. Basically whatever tool a copyright lawyer would use to track down original work.

And that's just the engineering solution. The AI researcher solution would be to extend AI learning algorithms to attach attribution metadata to the learned data so that the output could already come annotated with information about the source.

But the latter is much harder to do, so maybe the engineering solution would suffice.


A Twitter thread linked yesterday showed that a keyword and the name of the original code's author in the Copilot prompt produced an almost exact copy of that developer's code. Copilot already does know the origin sometimes.

Edit: here's the related tweet: https://twitter.com/DocSparse/status/1581632706693079042


> Or they could integrate a way to find the produced output back in the corpus if it's sufficiently close and provide a reference/attribution. Basically whatever tool a copyright lawyer would use to track down original work.

That assumes that the licenses of your code and the original code are compatible which often isn't the case.


No, it doesn't assume that. Ensuring that they are compatible would be the next step. Either manually by the user or automatically by showing a fat warning or retracting the suggested code completion.


> which is what copilot has been show to sometimes do

In those cases it seems that humans are already copying code without also propagating licenses appropriately. LLMs are more likely to memorize things which occur a lot (and I'd bet rare things that are representative of some conceptual axis).

The main examples presented so far, Davis and Carmack, have the property of having been copied a lot. The generative model is only surfacing an existing pattern of ignoring attribution. Sort of like the code-gen version of generating bigotry if appropriately prompted.

I'll also note that this pattern of retrievable memorizing of copyrighted and sensitive material is present in GPT-3 too and not just for code. As the situation is equivalent, a lawsuit should address the concerns of non-programmers too.


That seems like a weak defense: "sure, we violated copyright but only because many other people do, too".

Kinda the same problem as YouTube. Lots of people copy movies on the high seas, but if you are as big as yt you cannot easily get away with it.


No, this is not meant as a defense. My point is that it's an issue that is already rampant and what Copilot (or any model) does is make it more readily visible.

This is not like youtube because Github is already hosting those violations and people are already inappropriately copying or including such code. It matters not whether the local inclusion was fetched by copilot or a human fetched it using more manual steps through search.


The difference is if you build tools that can be used to violate existing copyright laws your tool will get taken offline by the same corporations. Yet they are selling one that can be used to do so.


Yes, I agree there's a measure of double standards to this. It's why I feel it's important that AI does not remain in the control of just a handful of corporations. The decks are stacked against though, given how data and compute intensive SOTA is.

But in defense of copilot, code regurgitation is uncommon in routine use. An editor extension allowing search of github would be at least as easy to use to violate licenses but I do not think it'd be taken down since that would not be its core offering.

Copilot goes far beyond mere search and provides a useful service. GPT-3 can also be prompted into generating copyrighted works of writers but I do not see people talking as if that is its primary utility nor as much clamoring in these forums to end that service.


> An editor extension allowing search of github would be at least as easy to use to violate licenses

Well, try doing that for music or movies or proprietary leaked codebase.

If you think copilot is uniquely producing things that are not that different from humans then surely no one would have any problem with feeding it massive amounts of corporate programs?

I am not aware what writers are doing but there have been plenty of uproar regarding stable diffusion. I have a feeling that if any tools like this get built for musicians/film-makers, it will look vastly different from the current situation.


First, the music and to an extent movie industry enforcement of IP are uniquely pathological. But I am not talking about music or movies. I am contending that a simple search extension being much less capable than Copilot and so even more scopeable as aiding copyright violation would not be taken down.

> surely no one would have any problem with feeding it massive amounts of corporate programs?

There is a similar gymnastics done by human engineers today due to the issue of patents. I don't think this is a good trend to uphold.

> I am not aware what writers are doing but there have been plenty of uproar regarding stable diffusion

Yes but mostly in the art community. On HN there were plenty of arguments just the other day how it is not the same for art and programmers have a stronger case. I disagree but regardless, the case is exactly equivalent for GPT-3 and writers but it wasn't an issue generating about a thousand comments on respecting IP and ceasing deployment of LLMs until copilot.


> There is a similar gymnastics done by human engineers today due to the issue of patents. I don't think this is a good trend to uphold.

Are you arguing that copyright laws should be abolished? I have no problem with that as long as it's clearly defined, you can't not respect copyright of open source code but enforce it for proprietary code, just because the value is arguably non-monetary.


Without a sample case I can't say for certain, but couldn't it also be a defense that some code is generic enough that it shouldn't be copyrighted?


One thing I worry about is if the uncertainty around copyright violation cools down activity in open models while raising the price of commercial offerings. Commercial entities can afford devoting resources towards mitigating copyright violations such as eating the cost of maintaining a database of frequently copied code and identifying most likely origin combined with a large semantic database of code snippets.

An open equivalent might be wary of being accused of contributing to copyright violations since in that scenario, there is no way to force people to respect it.


> not just for code

This is quite important, actually, and I don't think enough people realize this. I am a photographer sometimes and it would be really cool if I could share my photos online under a copyright license that forbids their use in training AI.


The problem seems to be that someone could just copy your photo and repost it without that license and we're back to the same spot.


I really appreciated the argument in the book "The New Breed," which is that we should adapt ideas around the governance of animals to governance of ML: You can train your dog to attack random passersby, but if you do, you're a monster and ultimately responsible for the dog's actions.

Likewise, you can tell Copilot to crank out code specific algorithms written by specific people, but if you do so, you're still creating infringing code, same as if you'd taken the more direct route of ctrl-c+ctrl-v. The fact that you /can/ make the algorithm misbehave through adversarial input is irrelevant to the primary use cases which lead to boring non-infringing code completions.


This just sounds like blaming the researchers to me. How would i ever know if my "boring code completion" was actually copyright infringement?

Your argument just disallows discussing the problem while doing absolutely nothing about it.

If you train your dog to NOT attack random passersby and it still does, that dog is euthanized no matter your intentions.


If you train your dog to NOT attack random passersby and it still does, that dog is euthanized no matter your intentions.

Of course, but you will not face manslaughter charges in that case.

So, following the same logic, if you train your copilot NOT to infringe on other people's copyright and it still does, it should be destroyed no matter your intentions. But at least you won't be charged with copyright violation yourself.

That said, I don't believe Microsoft's actions to be benign. I think this copyright whitewashing scheme is fully in line with their old MO, purposefully creating a legal quagmire surrounding all open source code.


Personally, I think you are right distrusting MS ( as we should be we any corporation really ). I will admit that this attempt is working in a sense that it is a lot less clear to a non-computer person as to:

- whether there are any damages - what the big deal is

In my mind, the entire thread identified a lot of those, but I think someone already said that it will likely be tested in court ( and I have zero idea, which way it will turn ).

For the record, I personally think Copilot is a cool tool ( frankly, it is not that different from automated stack exchange in terms of results ). If I worry about anything, it is that the overall standards will decline even further.


Tim Davis doesn't actually have any instance of copyright infringement to complain about; he was able to induce Copilot to /mostly/ recreate his code through careful prompting, but no one has actually deployed the code. By the same token, we don't outlaw ctrl-c and ctrl-v buttons on computers.

There is plenty of space here to discuss developing tools to check for unintentional infringement. I would guess, though, that such tools would sweep up a whoooole lot of non-copilot human usage and make it much harder to deploy anything new.

So, maybe a better discussion to have here is how to make the animal safer, not the total outlawing of the animal. Single-line completions (the majority of co-pilot usage) aren't infringing. Probably true for almost-any few line completion. So, capping the amount of consecutive auto-completed code might be a reasonable 'muzzle' on the model to keep it reasonably safe.


I think we have an ideological disagreement here. I'm not part of the "open source" movement, I believe in free software. Although I'm not prolific, I have authored some free software and shared it widely. I want people to have it, use it, and share it, so long as they extend the same rights to their users.

Now my software has been assimilated into a proprietary blob. Had that blob been free, like my software within it, I would have accepted it, but it's not. It's controlled exclusively by Microsoft and OpenAI, two entities which I place no trust in.

For me the dog has already bitten. The free software I extended to an audience I believe would show the same generosity has instead been made into a proprietary product.

The "copyright" question for me is not a question of "fairness" or ability of Microsoft or anyone else to make a product. For me it's a tool to protect my contribution from proprietary business.

Basically. I dont want the animal safer, I want it free (according to the FSF freedoms).


> The free software I extended to an audience I believe would show the same generosity has instead been made into a proprietary product.

That's exactly my complaint about Copilot. And since all code hosted on GitHub is now subject to this land-grab, my only recourse is not to use GitHub any more if I want to publish a project of mine.


Code ingestion is not limited to github or copilot. Your best recourse is to make your code publicly inaccessible.


I am not sure your software has been assimilated into a proprietary blob. Rather, it's been sliced into its constituent parts and those parts have been tagged by the proprietary blob.

Your code isn't being run by Copilot, as such; it's been categorized in a way that allows partial retrieval without the license or attribution. This might seem like a distinction without a difference, but it's kind for a ship-of-theseus problem; probably nobody is running any of your programs in their entirety, but it's very possible that bits of your code have found your way into other people's programs. How do you distinguish between contributions that are uniquely yours, and those which are just helper functions or cobbled together from other example code, eg in documentation or from a book or Q&A website?


I am not a lawyer, but I don't think anyone needs to deploy the code in order to infringe copyright: they just need to distribute the code to a third party (hence copyright -- the right to copy). And on the face of it, Microsoft would appear to have distributed Tim Davis's code, in compressed form, as part of the trained language model in Copilot.


But in this case copilot is not equivalent to copy-paste. When doing copy-paste, you are acting with knowledge of the source of the copied code and with intent to copy code.

With copilot, you are not acting with knowledge of the source and not with intent to copy, in fact I'm sure the users would have a reasonable expectation of the tool not copy-pasting existing code verbatim.

IANAL, but I'm pretty sure that intent matters a lot.

Popcorn time was also just a tool to allow you to stream data from torrents. That didn't seem to help them put up a legal defence (nor should it have, because the intent was pretty clear on that one).

And seriously, if cases exist, where the only thing a tool does (albeit via a VERY complex implementation path) is to strip a license from a piece of code and serve that code up via an API, then that really does sound like the creators of the tool are at fault.


If you build a system that has a high likelihood of breaking the law in normal expected use, and then it's found to break the law, shouldn't we disincentivize that in some way? Is that just blaming the researchers/developers, or is that just making people respect the law?

I think the important thing to note in both dog attack scenarios presented is that the owner is responsible in both cases. Either they purposefully created an unsafe situation or they were negligent in protecting the public from their property. Whether the dog is euthanized is about preventing it from happening again. Preventing it from happening in the first place is done by making the owner liable to disincentivize it.


I'd argue that the law was already broken when my free and viral software was included in a non-free package.

Personally, i don't care about the end users. If you want to read my source i welcome that. I just want the CoPilot model and system open, since it was based (in part) on my work. Otherwise they are free to remove my work.


And you are free to sue them.

What was your plan when someone eventually infringed on your work?

If you want people to abide by your license, you have to enforce it yourself.


This article is about pursuing a class action lawsuit…


"Not knowing" doesn't free you from responsibility.

If you took a bunch of copyrighted and non-copyrighted books, cut them into pieces, shuffled them all together, then picked a passage at random from a hat; "not knowing" what you are going to get doesn't mean you aren't violating copyright.

That's essentially what copilot is doing: it's taking a bunch of code - some of it copyrighted without license - and using it as a dataset. The ML algorithm then tries to pattern match against that data to provide the user with something they want. That's just copyright violation lottery with extra steps.


What stops humans from reproducing thinly disguised copies of their influences is, essentially, their ethical judgement.

Which amounts to saying, humans are trained with a model that they can use to recognize when something they are thinking of producing is 'too similar' to something they have seen before.

And, of course, some humans choose not to apply that filter and go ahead and plagiarize anyway; some humans try to apply that model but get back a false negative, thinking they're producing something original when they aren't. And we have ways of dealing with humans who do that.

In the case where an AI is coming up with the work, perhaps the mistake is in relying on humans to try and apply their own trained judgement to figuring out if the result is unoriginal. We need an AI that scores work for how likely it is to be infringing on a prior copyright.

Then you use that AI to train the creator AI, and teach it 'originality'.


That's a great point about what we naturally do.

> In the case where an AI is coming up with the work, perhaps the mistake is in relying on humans to try and apply their own trained judgement to figuring out if the result is unoriginal. We need an AI that scores work for how likely it is to be infringing on a prior copyright.

Isn't that latter AI going to be more likely to need to contain verbatim copies of original works? Or maybe not?

This in turn (and the SFC's and the law firm's concern about the GPL) makes me think that there are several different things that people may be concerned about machine learning systems doing:

* they could allow you to access verbatim copies for "consumptive" use (like if you asked an AI a question about what the text of a chapter of a Harry Potter novel was, and it answered you correctly)

* they could facilitate intentional or unintentional plagiarism, and, in the case of publicly-available works that are published under a license, intentional or unintentional reproduction or creation of derivative works contrary to that license

* they could contain something like a verbatim representation and allow you to use that in various ways that themselves are not extracting or literally copying that representation, but where the original copyright holder might complain that the existence of an unlicensed copy inside the model is already objectionable

* they could contain representations of uncopyrightable subject matter which was learned through training on copyrighted works, which can then be used to compete with the original creators for jobs, prestige, or attention, or can be used to produce works that the original creators would have found offensive or objectionable (this case isn't supposed to be restricted by copyright at all, but that doesn't necessarily stop people from caring!)

Not only will the same measures not prevent or avoid these cases, but if you wanted to prevent the first and second situations, one of the easiest ways to do it might be to literally include verbatim copies of lots of works inside a machine learning model! (along with software specifically trained or programmed to warn you against unintentionally making uses the user or copyright holder finds objectionable ... to facilitate the exercise of "essentially, their ethical judgement", as you put it)


> copyright holder might complain that the existence of an unlicensed copy inside the model is already objectionable

I'd be wary of this one. There doesn't seem much distance between this and the same claim against a human's memory of a work in their own brain. Yes, that sounds like dystopian fiction. So do some things that have already happened.


We have such AI already that scores work for how likely it is to be infringing on a prior copyright. It is written by Google and operates on Youtube.

The big question is if we think that Google made a poor work of that AI and if more money and more data rich company can make a better AI that teach originality.


> Here are a few thoughts I haven't formulated before:

> It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works. It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture (grammar, concepts, facts, cultural tropes, and many others).

I said it already in a previous discussion, I would be very careful with comparing ML with how humans learn. To me there are still a lot of examples that show that AIs don't understand prompts (see e.g. the discussions around the "horse riding astronaut" prompts för stable diffusion et al.) and it seems like they really are just doing sophisticated pattern matching. If that is what they do aren't they themselves covered by the licenses/ restrictions placed on the "patterns" they "choose" from?


I think any argument against generative AI should not hedge on there being a fundamental difference in how humans and generative models work.

I mean sure, maybe humans aren't "just" doing sophisticated pattern matching, but there are good reasons to suspect this is some part of what we are doing. (Even if its not implemented with back-prop).

e.g) consider the work of people like Anil Seth, who propose that our brain is basically a generative model of the world, which aims to minimize the likelihood of perceptual data. (see also: Karl Friston's free energy principle). What's up for debate is how it is structured, what priors are built in, what is the learning algorithm etc.

Anyway, for all their limitations, it seems clear that current artificial generative models can: 1. learn hierarchies of abstractions, which 2. explain the observed data in the fewest possible number of bits, and 3. generate new, novel data based on the patterns that have been learned

If you want to describe this as "just sophisticated pattern matching".. then sure I guess? But I think there's a clear qualitative difference between this and searching for code in a discrete database (which imo would not be okay).


https://arxiv.org/abs/2106.06981 - Thinking like Transformes

The paper suggests that transformer work on a set of select, aggregate and element-wise operations. Which seems pretty close to the SQL statements i write from day to day.


AI copyright drama is my favorite gossip these days because it can’t be reconciled until we accept that intelligence is created and held by societies, not individuals. Recent AI is a new way to exercise that intelligence, but it presents a major conflict with capitalism.


Not a conflict with capitalism - it's another trajedy of the commons - the robber barons of old stole the owned commons land and started said 'capitalism'.

Capitalism is still very alive, and will continue to be. It's in conflict with the general welfare of the people...


I assume they meant “capitalism as it is currently implemented”. In any case, not like capitalism is suggested as being under threat - just that an economic model based on competitive markets will have a lot of issues with fairly allocating resources to all individuals, instead accumulating most of it to a few industry leaders.

Something will have to change.


the octopus would like a word with you


What is an octopus if not a society of cells?


Just about two-thirds of an octopus's neurons is found outside its central brain, distributed across its 8 arms. An octopus can be likened to a hive-mind of sorts.


> It seems clear enough to me that training AIs on copyrighted works is typically or commonly a fair use under existing law

Which laws are considered in this case? I understand that fair use is a US concept. For example how does that apply to my projects, published and licensed by a European living in a European country? I would expect the majority of GitHub contributors to not be based in the US, so what laws should be considered?


From GitHub’s Terms of Service [0]:

> Except to the extent applicable law provides otherwise, this Agreement between you and GitHub and any access to or use of the Website or the Service are governed by the federal laws of the United States of America and the laws of the State of California, without regard to conflict of law provisions. You and GitHub agree to submit to the exclusive jurisdiction and venue of the courts located in the City and County of San Francisco, California.

[0] https://docs.github.com/en/site-policy/github-terms/github-t...


I have code of mine which has been uploaded to GitHub without my permission (other than it being licensed under GPL or MIT, no contributor agreement). I cannot see how that would be covered.

Additionally copyright infringement can be a criminal matter in my country and the Swedish prosecutors have certainly not signed these agreements.


The way GitHub is acting here, it seems to be a case of "if no-one takes us to court and sticks through to the end, then we can do whatever the hell we want". aka "Most people complaining are just making noise".

CoPilot doesn't seem to be a terrible implementation, instead it seems to be relying on it operating in a grey area. So they're going for broke, to try and get wide enough adoption that it becomes a fait accompli.


Anyone can say that, but that doesn’t make it real, especially with regards to European consumer protection.


Is there some EU-USA treaty that would prevent jurisdiction clauses in a normal contract between an EU resident and a California company?

The mechanisms limiting this are mostly about privacy. Not whether you can agree to adjudicate copyright or TOU in California


It's real because our laws (including those of European countries) make it real.

Some seem to assume there's some general "If it's American it's invalid" law in Europe. This is not the case. With the exception of specific laws, such as GDPR regarding privacy, this is a perfectly valid clause.


Copyright law can be a criminal matter in my country and then it will be handled by a Swedish court. You cannot just write a contract which makes you immune to criminal prosecution.


In Sweden you can set the jurisdiction for civil disputes in a contract.

Which is what everyone is talking about here.

For criminal, there is only jurisdiction in Sweden if the crime happened in Sweden. I would need you to link me a case where Sweden criminally convicted someone for copyright infringement who wasn't Swedish and wasn't in Sweden.

For example, I am not Swedish and do not travel there. Sweden has no power to enforce its laws against me. No matter what I do, I shouldn't be able to be convicted of criminal copyright infringent in Sweden.


As a human, I am likely to have the specific goal of not violating copyright when I write code. That means either deliberately not writing things exactly the same way that I know I've seen them written in other places or else deliberately complying with their license policy if I really want to lift code verbatim. Perhaps Copilot needs some sort of feedback loop to avoid over-similar code by default & an extra "allow (compliant) copying of open source" toggle to make it behave similarly.


> Computers do have such accurate memories, and machine learning systems, which are optimized for things like maximum likelihood estimation, can and do reproduce both copyrightable and non-copyrightable elements of works that they've been trained on. After all, the maximum likelihood continuation of a fragment of a text or a song is ... the complete original work. And the ability to reproduce the complete original work would, other things being equal, reduce loss in training. After all, that's something someone might specifically ask for, and if the system could oblige, it would be doing a better job of providing what the user wanted.

This actually depends on how you train the model. Techniques such as using unlikelihood to penalize plagiarizing models exists. Microsoft/OpenAI are of course aware of those techniques but have chosen not to use them. The reason why is not difficult to figure out. Because the model hasn't learned how to implement sparse matrix multiplication in C, it has learned how to spit out someone else's code with a few variable names changed. Not unlike how many CS students not cut out for software development try to pass their entry-level programming courses. Professors use anti-cheating software to catch cheating students. Such software would catch Codex too and expose it as incompetent. Hence why it is not used.


It seems pretty clear to me, training an AI on copyrighted materials is not fair use. I'm not sure why you seem to think it is fair use


And it seems pretty clear to me that it is fair use, because it's not merely reproducing or creating a derivative work, but actually extracting patterns and modeling the works in a way that is intended to be used to create new and unrelated works. The fact that an occasional piece of code here and there might be reproduced verbatim is no different than e.g. Cliffs Notes occasionally quoting a passage, and Cliffs Notes are a well established case of fair use that to me, at least, seems even closer to "the line" than Copilot or Stable Diffusion.

FTA: "On the other hand, maybe you’re a fan of Copi­lot who thinks that AI is the future and I’m just yelling at clouds. First, the objec­tion here is not to AI-assisted cod­ing tools gen­er­ally, but to Microsoft’s spe­cific choices with Copi­lot. We can eas­ily imag­ine a ver­sion of Copi­lot that’s friend­lier to open-source devel­op­ers—for instance, where par­tic­i­pa­tion is vol­un­tary, or where coders are paid to con­tribute to the train­ing cor­pus."

This is the same argument that people use about Stable Diffusion, and it's kinda meh to me...I guess it'd be nice to allow people to opt-out, like Stable Diffusion is doing with their next versions, especially since a negligible percentage of people will do so and it won't affect the models at all. But yes, it basically is yelling at clouds. Opt-in would cripple models, and some people would make them anyways and just keep them secret, which is worse for the world. And at the end of the day, this really does just seem to me like a fair use of stuff that you've published on the Internet for anyone with a browser to look at. The AI models of the future are going to gobble the whole net up, and if you don't want them ingesting your stuff and learning from it, then you just shouldn't make it freely available.

If OpenAI/GitHub/MS really wanted to get ahead of this and head off any potential legal conflict, they could always just open source the models and weights, which would be in line with the name "OpenAI"...it would be a minor project to scrape all the correct headers to add to a license file(s), but negligible compared to the many millions of dollars spent on training.


Cliffs Notes adds commentary and critique for educational purposes, they are doing what fair use is intended for. Copilot does not.

Also, it's pointless to say "But X does Y" in copyright discussions. You never know if they license the content properly or if they infringe the rights. In the Cliffs Notes case, they might not need fair use at all, because the old works are already in public domain.


It depends on what the AI is learning.

If the AI is learning to repeat text (e.g. Copilot) or images (e.g. Dall-E), then that makes it possible to reproduce the copyrighted works, so I would agree that that case is not fair use. -- It would be akin to compressing and distributing those works.

If the AI is learning patterns -- such as "muggle" being a noun that relates to Harry Potter, or that the lemma for "muggles" is "muggle" -- then that is less clear. You can avoid the situation by creating your own sentences with those terms in them, and annotating those sentences instead of the copyrighted ones. That way, the AI is still learning the same information.


You actually just convinced me of the exact opposite.

Because copilots "use" of the works _was_ the learning.

So it would seem to me that Microsoft needs to apply "fair use" to copy and redistribute _the entire works_ they used for training.

In which case lack of fair use my well be the least of their problems, they are really crossing into Computer Fraud and Abuse Act territory similar to when Aaron Swartz "borrowed" MITs data.


I'm not sure how Copilot works, but I don't believe Dall-E repeats images. From my understanding it creates visual concepts of words and uses them to create entirely new images. If Copilot works in the same way for code, I honestly don't see that there should be any copyright issues here.


It just so happens that, sometimes, parts of these entirely new images are exact copies of those used for training.


Do you have a source for this?

This issue has been claimed many times and I've heard that DALL·E 1 & 2, Stable Diffusion & Midjourney all can create images that are exact copies of the training material.

This doesn't make sense considering the compression ratio of training images to model is about 1:25,000.

Further investigations I have made show that all these cases can be explained via the following:

1) The prompt included an image, so some form of image2image was used. Of course if you use an image as a base, and tell the model to stick closely to that image, the output will largely resemble that image.

2) The example was completely made up.

So far I have seen no evidence, given a text prompt, the output of an image containing some portion of any image from the training set.


Your comment includes exact copies of words and phrases which I have also used prior to you, so you are violating my copyright even if you didn't intend that.

Well, I don't really think you are violating my copyright. But by focusing on parts, you go down a rabbit hole of equating an element with the whole thing. This would render all collage art illegal. Lawyers and art pundits love ruminating on the uncertain legality of collage art (because it's not a binary question, so they can churn out endless articles that boil down to 'it depends'), but this glosses over 2 important realities:

1. Nobody gets sued over collage art largely because any case is doomed to end up with lawyers measuring the size of collage elements with rulers and then arguing about what small percentage is too much, and uncertain exercise few law firms wish to gamble their reputation on, and

2. nobody gets sued because collage art isn't worth very much to begin with; collages aren't valued very highly because they aren't as hard to make as painting or other art forms. 'Appropriation artists' like Richard Prince get rich and famous partly because their art is less about the image than the cultivation of notoriety for artistic effect; they are artists of scandal rather than pictures.

In general, bits of things are just not that important, and I'd argue that the same applies to code. If part of your code matches a prompt (excluding highly specific prompts like '# insert Woodson's unique XYZ algorithm here') and is then deployed in another program without alteration, isn't that most likely to be because it performs some generic function?


I've generated thousands of images on Stable Diffusion Dall-e 2 and Midjourney by now, and what you say here simply doesn't make any sense.


> I'm not sure why you seem to think it is fair use

I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.


> I think OP explains clearly, in many paragraphs, why it's fair use. That's literally what their whole post is about.

Actually, what the OP said is, "is typically or commonly a fair use under existing law, because the AIs can and commonly do learn non-copyrightable elements and aspects of those works". The rest of the eight paragraphs had nothing to do with fair use.

It's honestly a ridiculous argument to say that learning one non-copyrighted thing means that the regurgitation of another copyrighted thing, after stripping the license, will magically be fair use.


The comment is a quintessential HN comment: all tone, little substance. It just claims that it's fair use because the AI learns things, which is not a criterion for fair use at all. People here just throw around fair use as a catch-all term for everything that should be allowed based on their personal gut feeling.


The concept of fair use applies to small volumes of work.

Clearly, training on large volumes of data is not small volumes in any sense of the word. The argument that it is fair use is itself flawed.


Absolutely incorrect, fair use applies to *reproducing* small volumes of work, not analyzing it. If I published an article gleaning some conclusion based on an analysis of 10,000 issues of the New York Times, that would still 100% be fair use; similarly, Google is absolutely allowed to publish word count metrics based on their scanned book repo, even though publishing the books themselves is not fair use. You are trying to read something into the fair use doctrine that is absolutely not there (to the extent that anything is there, which very little is other than "I'll know it when I see it" and prior case law, unfortunately).


When I fair use a small quote from a book, I may have read the whole book.


Now I'll go the other way and wonder if it should still fall under fair use if I respond to requests for small quotes programmatically and eventually quote the entire book.

Or here is the real analagous question:

Fair use is about more than just the size of the excerpt.

If you write an article about good writing, and quote a choice paragraph from someone else's work to show an example, and credit that quote, that is fair use.

Is it fair use if you read an awesome paragraph, something that really is the result of the authors unique intellect and effort and craftsmanship, and makes you think "damn", and then drop that same jewel into your book?

The difference is, the paragraph isn't being included for examination or comment or transformation, it's being included to directly copy and perform it's original function as part of what makes a work a great work, and, it's not being credited in any bibliography or footnotes or directly.

The reader reads the paragraph and is impressed by your deep insight, which you never had, and the original author did.

I think all in all, this sort of copying & re-use should be allowed to happen somehow, because software is more like a machine than a novel, and humanity benefits when machines work well. There just needs to be some sort of rules around it about what gets included in the training sets and how both the input and the output are credited and acknowleged.

Right now, I think Github are simply outlaws. 100% of the output is violating the copyright of the code in the training set, because 100% of the input is copyrighted one way or another and none of it is being declared on the output. And it's allowing incompatiple sources to mix and the origonal terms to be stripped. The training set includes both proprietary and open source, and the output is being used in both proprietary and open source.

And there is no way that Github does not have this same understanding that I just described. I refuse to believe I am that special that I can see this and no one at Github did.

So they are not merely possibly inadvertant outlaws, they are deliberate knowing intentional outlaws.


I think a key thing here is your identification of a paragraph. Nobody would think to exert copyright over individual words. Phrases and epigrams are considered worthy of attribution, but only in exceptional cases. Copying sentences is starting to get into plagiarism, though single sentences would usually be forgiven because noting or remembering a single sentence while forgetting the source is an easy mistake to make. Copying whole paragraph, by contrast, is unlikely to be casual.

I think in programming therms a useful parallel might be copying at the module rather than the statement or function level. For example, if I write some code prompts to do the following:

  - validate my API key with Twitter
  - solicit the input of a Twitter username
  - download the up to 500 of that user's tweets
  - convert the json to a dataframe
  - plot the derivative of the intervals between tweets
...many of those tasks can be fairly described as helper functions, either taken directly from documentation (like interfacing with an API) or being so elementary as to be generic. If any one of these tasks happened to come from your code or mine, and the rest from other programs, it wouldn't feel like much of an infringement. If all of them came from the same body of code, it would.


> A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning

Not really.

A human being learns by doing, it takes a lot of time, their knowledge is not transferable, and, above all, they buy the material they learn from (most of the time) It's not fair use, it's "I paid for the entire opera" , sometimes multiple times: different editions, movies, tv shows, etc.

Secondly, it's not true that derivative material is automatically copyright free.

It is in all honesty the contrary, most derivative work that reached popularity is plagued with plagiarism, lack of attribution, undisclosed ghost authors etc. all things that get settled with a contract or in court if the publisher thinks it's worth it.

Otherwise the publication simply disappears.

In other cases the work is licensed, so that the publisher can use someone else's IP and literally resell other people's ideas and/or change them the way they like (or the license permits), without having to create new material and take the risk that nobody will notice it.

Case in point (among too many)

https://en.m.wikipedia.org/wiki/Legal_disputes_over_the_Harr...


Copyright no longer (and perhaps never) serves as a tool to further the creative/productive output of the society. It should be demolished and rewritten, and when in doubt, it should allow rather than disallow.


I find that this comment reduces humans to elaborate Markov chains and then uses that misconception to make a point.

Many of humanity's best works (paintings, classical music, golden age of physics) have been created before humans voluntarily reduced themselves to automata.

AI hasn't produced anything apart from mashing together other people's creations, usually with a somewhat creepy result.

In programming this may work because quality does not matter, only LOC and social capital with the rest of the brogrammers. The objection that therefore real programmers do not have to be afraid is false. They either have to join the mediocrity or clean up the mess that the brogrammers make (while being disrespected by them of course).


> A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning

You have to be very careful with this line of thinking. I remember SCO versus RedHat began on much smaller premises. I also remember it took years for ReactOS to audit their code after mere suspicions arose about code that seemed to be inspired by something like asm to C translation.

The GPL is clear: derived works must be under the GPL too. The license must be respected, it doesn't matter if it was copied "from inspiration because it was learned" by an algorithm or a person.


I think the problem is that a powerful entity is profiting off of other people's work without their consent and gives nothing to the exploited members in return. Sure, an individual human learns from copyrighted works and reincorporates to make something slightly new all the time. And then they may also profit from it and not give anything in return to those that came before.

The problem here is the scale and the power that enables that scale. This is industrial level mining of non-consenting humans, exploiting their life's work in many cases.


Maybe the solution here is do adopt the approach from humans: if you independently produce someone's copyrighted work and then discover about it - you'll drop it. The same approach could be used here, they can add a check for the similarity between the output and the original training material, if it is above a threshold, they'll drop the suggestion (maybe they are already doing that).


That doesn't work, though - the problem is you have automated crime. When you do this, you can no longer handle this on a case by case basis - you have to resort to automating justice.

And at this point, you are getting both attacked and supported by AI, and not really better off in a meaningful way.

There's no good way of solving this issue, without general intelligence, and the problems that will bring (what reason does a generally intelligent AI have for supporting us or not enslaving us).

This is why all AI research IMO is unethical. Point me to a single AI use that has not already been abused, maybe I will change my mind, as it stands though we should be prosecuting the people misusing this technology, or at least irresponsibly releasing it, as fast and as quickly as possible before we get the point we are no longer fighting bad actors but the machines themselves.


AI research is basically just “the history of computer science”. Alan Turing’s Imitation Game being an apt example.

I don’t think AI research in a vacuum is as deeply unethical as you suggest. It’s about current societal context- people won’t like being worse at things; resources will be hoarded rather than distributed.


>If human beings had much more accurate memories, copyright would be quite a bit more intrusive (and/or quite a bit less effective) because, following any exposure to some kinds of works, we could use our own memories to reproduce those entire works from scratch for our own use or pleasure without obtaining authorized copies from elsewhere.

I don't know the name, but I remember some sci-fi story about some academy where humans were trained from birth without exposure to music others had written and had to reinvent it on their own. Some would cheat and access the outside world's music, but they would always be caught by their later compositions all having obvious influence from conventional music.

<Insert obvious joke about how I'd remember the name if I had computer-like memory.>


There is a short story by Orson Scott Card which has that theme called 'Unaccompanied Sonata'.

You can read it here: https://b-ok.cc/book/4395497/b2fb2e


Yes! Thank you! That’s the one!


Plenty of living musicians today have this capability.

It turns out that humans can extrapolate generalisms to a degree we are currently unable to explain clearly enough as a model to imitate.

It turns out that much ML is merely referenced regurgitation.

Marketing and hype are rather advanced skills in 2022, however…


I don't know why we should be concerned with the status quo of copy-write law at all with respect to AI. ML is categorically new in how it applies to these domains, and it's not clear to me at all that rules that apply to humans have much to do at all with rules which should apply to machines.

Imo it is very simple: IP law is intended to incentivize creative work, so that it remains possible to profit from one's creation in an environment where it might be easier to copy than it is to create. We just need to figure out what outcome we want to create: one which incentivizes human creations, or AI "creations" - and build a legal framework to support it.


It is an interesting take and it reminds of the thoughts that made crypto what it is today, which made something the lines of:

"Old systems suck and our new system is great and it is new technology. Therefore old rules do not apply to it."

Not surprisingly, the moment crypto started gaining traction, everyone was quickly made to understand that rules do indeed apply even if it is a new a facet of finance regulations ( or in the case of Copilot copyright law ).

For the record, I am sympathetic to your sentiment, but you can't really expect existing interests to accept a major change if it happens to undermine someone's way of life and, possibly, alter the current legal landscape. And this may end up a much bigger change than expected and may finally usher in the era management always wanted.


I'm not sure if I expressed myself well: I'm not making the argument that we should just accept a future where AI has free rein. Quite the contrary: I would argue we should examine how to preserve the spirit of the precedent - which is to protect creators - which may require us to create new laws.


They’re clearly producing derivative works.


I think you're saying any work created by a model trained on copyrighted data is a derivative work of that copyrighted data.

But this can't be right, it is inconsistent with how copyright has worked so far. Artists and musicians and engineers all learn from each other and have seen and learned from, "trained on" many other examples of works from their field. Even when works are clearly inspired by other works we tend not to give them the legal status of derivative work.

You're suggesting we treat models with a much stricter copyright regime than has previously existed.


courts in the US have repeatedly ruled that humans and machines aren't the same in the eyes of copyright. for example under current case law, nothing created exclusively by a machine is copyrightable.


This is not sane, sustainable or justifiable. E.g. what about that future when we have actual AI people?


I hope I shall never be forced to call a Microsoft product a person. I find your entire implied world view to be a cheap parody of the rights living beings inherently should have.

If we get to the point that capitalist maximalist-utilitarianism insists upon hijacking the very concept of what a living organism is, I can only compare it to a teddy bear vs an actual bear.

It’s not enough to merely put fur on it and an internal rom for it to regurgitate prefabricated roars upon contextual prodding.

Respect the life you are only one instance of, for hubris has always brought suffering and pain in its wake.


Yes. The copyright law must change. This is different.


why must it change, why is this different? genuinely curious


Because the scale is different.

You have a mechanism that can regurgitate (digest, remix, emit) without attribution all of the world's code and all of the world's art.

With these systems, you're giving everyone the ability to plagiarize everything, effortlessly and unknowingly. No skill, no effort, no time required. No awareness of the sources of the derivative work.

My work is now your work. Everyone can "write" my code, without ever knowing I wrote it, without ever knowing I existed. Everyone can use my hard work, regurgitated anonymously, stripped of all credit, stripped of all attribution, stripped of all identity and ancestry and citation.

It's a new kind of use not known (or imagined?) when the copyright laws were written.

Training must be opt in, not opt out.

Every artist, every creative individual, must EXPLICITLY OPT IN to having their hard work regurgitated anonymously by Copilot or Dall-E or whatever.

If you want to donate your code or your painting or your music so it can easily be "written" or "painted", in whole or in part, by everyone else, without attribution, then go ahead and opt in.

But if an author or artist does not EXPLICITLY OPT IN, you can't use their creative work to train these systems.

All these code/art washing systems, that absorb and mix and regurgitate the hard work of creative people must be strictly opt in.

That's how the law needs to be.


What would you think if models were bundled with a second model, the "copyright filter". Just as humans know to keep their creations sufficiently far away from copyrighted material, you could distribute models which are trained on copyrighted materials but know well enough not to produce anything so close so something copyrighted that it infringes.

This would prevent anybody from accidentally infringing when using these tools. Does that seem like a reasonable solution, or is your concern greater than accidental infringement?


Explicit opt in, period.


But why? It really seems to me that things like Copilot will save millions of man hours and make the world a better place. The only harms people have come up with are highly speculative and far smaller in magnitude.


That is the underlying theory of copyright law. We make the speculation that if we don't have copyright law then people won't have the incentive to create future works.

A world without copyright would however save more than just a few millions of man hours that copilot might do. Allowing people and companies to freely use the best software available, view the best art, enjoy the most relaxing music, have the best recreational time with the best films. The only harm is the highly speculative claim that people won't be creating the best software, the best art, the best music or the best films.


That analogy doesn't work because unlike an art work, you can't sell a 20 line snippet.


Your arguments actually suggests that AI is making copyright less relevant and we should make it less strict, not the other way around.


A human being doesn't violate copyright in learning from a copyrighted work, including when that human being is later more able to produce other works based on that learning (e.g. reading fantasy novels and learning concepts, tropes, or vocabulary that one uses to produce other fantasy novels;

Yes, but a human being isn't allowed to copy that work before learning from it, even if they destroy the copy afterward. AIs don't watch Youtube or browse Github. People download copies of content stored there, analyze and categorize it, and then feed it into AIs. Copyright is broken at step 1.


> It's very obvious from enormous numbers of examples that current AI systems are capable of learning much more abstract features of human culture

This doesn't seem to translate well to code. You can't copyright 'rembrandt's style', which is what dall-e and co learn from analysing those paintings. But what copilot gives you is a sizable chunk of code. That code is (mostly) exactingly precise: It's not like the AI learned a style and recreates the style. It learns what you intended to do and then verbatim copies a code chunk in. I'm pretty sure the AI part comes in to determine what it is you were likely trying to do, so that it knows which code to copy. Not to generate the code out of the AI model. At least, that's how I understand it works.

That is the fundamental difference.

> OK, if you said "what word comes next? MR. AND MRS. DURSLEY OF NUMBER FOUR PRIVET DRIVE WERE PROUD TO SAY" ... same thing

Let the AI fill in the rest of that sentence and we can debate whether that is copyright infringement or not. However, if the AI system is capable of finishing that sentence, it can presumably fairly trivially be asked a slightly different question. Instead of 'finish the sentence', how about: "Suggest the next likely sentence". That system would presumably generate the next exact sentence straight from the book, and keep going and - voila you recreated the entire book.

Which is clearly copyright infringement.

AI systems have a 'volume dial' to configure how much they mix and match. Turn it down low enough and asking DALL-E for 'a girl with a blue bandana and an earring in the style of Vermeer' will just give you a copy of Girl with the Pearl Earring, reproduced sufficiently accurately that it's trivially a copyright infringement (let's leave out the notion that the painting has moved past its copyright date of course).

Point is, for copilot, the volume dial has to be kept extremely low, because you can't just mash 5000 snippets together, unless those snippets are identical. Which is its own intriguing copyright infringement conundrum (500 artists each individually paint the same thing, and they can each prove they weren't influenced by the others, thus, not copyright infringement. Then, you reproduce the averaging of the 500, which results in yet another painting. Did you just infringe copyright? Surely the answer is 'yes', but whose copyright did you infringe? All of them? Is it 'yes'?)


I think the big diffrent here is human have paid for the copyrighted work they learned from, and the AI not.


Sir, you and this comment section should get a room.


Your comment is full of good and deep analysis and it uncovers a deep flaw in the Copilot: Copilot helps create new code.

All the problems and confusions mentioned above is due to this concrete inherent property of the Copilot.

If Copilot was made to help rearrange [1] existing code to satisfy new or changed needs, there would be no need in such a deep and explanative analysis of yours.

[1] https://www.folklore.org/StoryView.py?story=Negative_2000_Li...

Code is a liability. Less code is less liability. New code is a new liability.

Even tools to create new code is a new and unknown liability, it seems.


It would be sad if someone succeeded in shutting down CoPilot for this kind of copyright stuff. It is genuinely useful. I don't care that it reproduces copyrighted content. The only way you can get it to do that is to bait it with the function names of functions that have already been copy and pasted thousands of times onto GitHub without proper licenses.

Luckily, someone will probably come out with a "renegade" version trained on whatever makes it a useful assistant to my coding. I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.


> It would be sad if someone succeeded in shutting down CoPilot for this kind of copyright stuff. It is genuinely useful. I don't care that it reproduces copyrighted content.

Sure — in the same way that hacking into a competitor's GitHub account and copying their private source code is "genuinely useful" to you. As the person benefitting from unlawfully using their source code, of course you wouldn't care that it reproduces it. But you're not really the person we're trying to help here.


> in the same way that hacking into a competitor's GitHub account

That's like comparing grand-theft auto to someone stealing a pack of gum from a convenience store. It's not a useful analogy. The latter is still a problem, but we don't need to be FUDy about it.

And OPs right, this will keep happening until we come up with better ways of solving this problem.

Whether that's educating companies on the legal (and moral) risks their developers IDE tools are exposing them to, better licensing database/indexing, working with future OSS devs building these tools instead of treating them like criminals, suing the for-profit companies like Microsoft who seek to profit from this until they invest in this problem, etc.


OP can correct me if I'm wrong, but they don't seem particularly interested in solving anything. They literally said "I don't care that it reproduces copyrighted content." So the problem, as I see it, is the people who see the laundering of open source and proprietary code as a draw, rather than a drawback.


OP said they don't care that it can reproduce copyrighted content because they're not going to do so with it. That's roughly the opposite of what you're implying.


No they are just going to let other people commit crimes and claim they will never do it...

Do you really believe Microsoft employees aren't going to be using this, illegally or unofficially?

"Yes! We (Microsoft) aren't doing anything illegal, but we are going to turn a blind eye to everyone using it illegally as we directly benefit from it - and here's the kicker! Our employees are legally liable not us evil laughter all the way to the bank"

Of course the legal execs aren't using it, this is classic Microsoft (Embrace, Extend, Extinguish).


If accidental reproduction of copyrighted material by AI systems is illegal under current law then we should change the law immediately so that it's not.

These AI systems are highly novel, transformative, and useful. Their development is exactly the sort of thing copyright law was originally created to encourage. If it's hindering them instead, that's a problem.

(And no, I'm not saying people should be allowed to use AI to intentionally launder stolen code; use some common sense here.)


Why is it so outlandish to expect the people who make money by selling AI systems to only train them using material for which they have a license?

As many commenters have pointed out, no one would have a problem had Microsoft trained Copilot on the Windows source code. The fact that they intentionally left it out of the training set is a huge red flag.


Because AI systems require large amounts of training data, the more the better, and requiring manual review of those datasets to ensure compliance with copyright would consume significant resources and slow down the pace of innovation across the entire AI industry.

Now let me flip that question around on you: What benefit would society gain from that forcing AI developers to do all that extra work?


If you are going to use my work for free and without attribution and turn it around to compete with me, then it decreases my incentive to produce anything, and if I do it decreases my incentive to publish it. This goes directly against the intentions behind copyright law.


That's the best argument I've heard so far, but still doesn't make sense to me. It's not like your individual project is going to make any significant difference to the capabilities of the resulting AI that's "competing with you" one way or the other. So really all you'd be doing by not releasing your code is shooting yourself in the foot for no gain.

Granted, people are not necessarily rational actors, so maybe you could argue it still makes sense to have some protections in place to assuage people's irrational fears. Maybe like some kind of robots.txt for determining whether a page can be used in an AI dataset could serve that purpose. I'd be hesitant to support anything more burdensome than that.


The benefit is that our collective genius isn’t mined by mega corps and rented back to us. That we exist as more than mindless resources to be tapped for profit.

Again, if (for argument’s sake) we want to maximize the effectiveness of the AI, why are we okay with Microsoft intentionally omitting one of the most important codebases in human history — which it unambiguously has the right to use — from its training set?


> The benefit is that our collective genius isn’t mined by mega corps and rented back to us.

That sounds like a downside to me, not a benefit. You're basically arguing it would be better if Copilot, Stable Diffusion, GPT-3, etc (which all included copyrighted works in their training set) didn't exist. I'm just not seeing that.


They are only using material for which they have a license (at least debatably). Open source software licenses usually require attribution if you reproduce the source code or use the source code in a program.

Some other uses are allowed without attribution. Someone can read and learn from open source software without needing to put an attribution anywhere. You could run an analysis of the code on GitHub to find out what percent of code is written in C++. You wouldn't need to attribute every project on GitHub.

Now the debate is whether this applies to training ML models.


Not sure if they edited their comment, but the end of it contradicts your interpretation:

> I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.


No, that was there, but it doesn’t contradict my interpretation. Copyright doesn’t only cover reproducing code verbatim. It also includes derivative works.


Maybe in court which interprets software dev very strictly, but in practice a developer automatically copying a single function from some 'freemium' style licensed library [1] posted publicly on Github - getting autocompleted into a different codebase with many thousands of lines of custom code isn't the same as going into some proprietary codebase and stealing their code to compete / build the same product as another company.

We could come up with scenarios where there might be some fancy algorithms posted on some public Git repo that's super efficient or unique, and that somehow fits into the size of individual functions that could be auto-inserted into some other person's codebase. But IRL that sort of thing is rarely ever going to be the thing that these IDE tools do. At least in a way that meaningfully contributes to another project.

That is still a concern yes, but it's still a niche usecase, which doesn't justify killing off otherwise extremely useful tools.

Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed by these tools + spreading awareness of the issue is enough to address the licensing problems.

The more accurately we explain the problem, the quicker we'll find good solutions.

[1] usually licensing saying commercial projects need to either pay or not use it at all. Or some attribution clause


"Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed + spreading awareness of the issue is enough to address the licensing problems."

I think you are, though. You have to automate the justice as well, traditional courts can't keep pace. You'll just end up with more automated DMCA-style takedowns, not less.


I think you misunderstood my comment then (or how these tools work IRL)... because I'm not saying that it's even worthy of a court case in the vast majority of cases. So why would you need to automate such a thing?

And I don't even see how an automated DMCA system could exist because I doubt they'd win monetary damages in court over a 'stolen' function or two (or detect it in most commercial applications in the first place).

Regardless a single class action should be enough to make Microsoft either shut down their project or adapt (via whistleblowers, leaked code, public repos, etc). And regardless if they don't adapt by investing in the possible solutions here, an OSS project could take it's place eventually and the courts wouldn't even be a useful solution.

Ideally a capital-backed company will help solve this, with the obvious legal incentives that already exist. But even if it doesn't this isn't going away.


>That's like comparing grand-theft auto to someone stealing a pack of gum from a convenience store.

They're both more like grand-theft auto, but one involves the valet driver leaving with your car, and the other involves smashing a window.


I use Copilot all the time and I’ve never once used it to generate a whole prepackaged function that’s more than maybe three lines. So no, I don’t benefit from its reproducing other people’s code at all. Tell me you don’t use Copilot without telling me about it.


> Tell me you don’t use Copilot without telling me about it.

You don't accept arguments against the use of Copilot from people unless they... use it?

That's a nifty way to ignore any and all criticism of Copilot, or indeed any discussion about any ethical issue ever.


I believe the argument is that you shouldn't accept arguments against the use of copilot from people unless they have tried to use it. In a realistic context. That seems reasonable to me. It's the bare minimum to make an informed opinion. I think the wording was perhaps poor, but I think your interpretation is a little reductive/disingenuous.


I don't see how using copilot make the copyright question any less serious.

Because it's useful, then it's not a problem?

Well, it's also useful to send our non recyclable trash to 3rd world countries and every 1st World country should try it. It will definitely make the consequences less serious if everyone does it.

Not apples to apples but I guess you get the picture.


As someone who has used copilot since the early beta days, what I think people are saying is that nobody uses Copilot to generate full functions like this. It's more of an intelligent auto complete. It's fantastic for repetitive autocomplete where certain things have to be changed, and for quickly getting out boilerplate code. You can put a list in a comment along with a format and generate data structures quickly and easily. You can solve small problems quickly, allowing you to focus on the bigger picture.

It's sort of like a power tool — sure, you could use a screwdriver, but a drill with a screwdriver attachment will be quicker. Hammers are good, nail guns are quicker. You'd never expect someone to use a drill with sd attachment if they'd never used a screwdriver before.

There are for sure things to be improved, such as the recent post on how you could put in a very specific seed and get out a specific function that it shouldn't. The answer here isn't to shut down the project, at a net loss for everyone, but to find ways to improve it.

As others have said, with Copilot gone and the new demand created, the vacuum will bring in community projects that will happily scrape every public repository they can get their hands on.


Now I understand what you mean, still pretty crappy because those minor auto complete strings only exist because Microsoft used code without permission and/or without crediting the original owners and/or breaking licenses from the original corpus.

I use copilot everyday, I love it, but it still leaves a bad taste in my mouth knowing that people out there worked really hard on their code and harder on building OSS licences just for Microsoft to throw all that out of the window.

Feels like licenses don't matter anymore. My own code doesn't matter much, but it's about principals dude, licenses are there and they should be respected, if not, then it's just anarchy, and we all know anarchy only works in very specific scenarios, Microsoft is not apart of any of those scenarios.


I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. In order to make an informed decision as to whether you agree with that, you can try copilot to reach an informed conclusion. There is no cost to try it. Unlike your example, where "trying" has an immediate cost -- which is why that example doesn't make too much sense here.


Yep know it's much more clear, thanks for clarifying.

See your sister comment's child for my reply.


Because its a net good to the world its not a problem. It the benefit is orders of magnitude greater than the harm then its good.


It’s a net harm for the programmers whose code is being willfully plagiarized.

It’s a net boon for Microsoft in their efforts to rule the world.

It’s a net loss for society and ethics.

Open up copilot code, Microsoft, if you are so sure that everyone must wear transparent underwear let’s see you wearing some. Train copilot on windows 11 code. It’s not public domain.

Truth matters. Lies matter


Expand on the unethical part. So people published code that could be referenced and copied on GitHub. There was no ethical problem, the world, society were happy.

Github make a convenient way to search and contextualise this publicly available code and paste it into your code (adjusting local scope, format, language along the way). Suddenly we have crossed an ethical line!?

Which ethical line? Are you pretending people never copy and pasted open source code before copilot? Are you pretending open source code never copy and pasted other open source code? That we were in an ethically pure world until copilot came along?


> So people published code that could be referenced and copied on GitHub. There was no ethical problem, the world, society were happy.

This code has different licenses. You can't just copy code randomly without checking license first.

Copilot serves it stripped of the license to unaware users. Even if copilot user wants only to reuse code licensed in a way that allows it copilot will serve him code from restrictive licenses without him being aware.


You can just copy and paste code without checking the license. People do it all the time.

GitHub doesn’t force you to accept the license in the repository before showing you the code.


> It’s a net harm for the programmers whose code is being willfully plagiarized.

What's the harm, specifically?

Say it copies that snippet of workflow scheduling code I made at work yesterday or the greasemonkey script I made in my own time.

How is my life worse?


> I believe the argument is that you shouldn't accept arguments against the use of copilot from people unless they have tried to use it.

I disagree, and this does not hold up generally: We can, and should, argue things we have not tried or experienced, like heroin and murder. What makes it so that this has to be tried?

> It's the bare minimum to make an informed opinion.

Only if the usefulness is what is in question. But it is not.


> We can, and should, argue things we have not tried or experienced, like heroin and murder. What makes it so that this has to be tried?

It's not that you absolutely have to have experience with something, but you'd be foolish to discount the input of people who do. In debates about drug policy I try to be polite to people with zero first hand experience, but their contributions are rarely of interest. Murder is a bit more abstract insofar as anyone who has fully experienced it by definition didn't survive to testify, but I give a lot more weight to the views of people that have first-hand knowledge of violence and crime.

It's not that you shouldn't weigh in on a topic without first hand experience, but that it's a good idea to specify the scope of your understanding, or frame uncertainties as open questions rather than assumptions.


Correct, it doesn't hold up generally. But it doesn't need to. It holds up here. We do not try things when there is exceptional risk or cost in the trying. Here there is no cost to trying, so it does not make sense not to try.

I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. So it's not just about usefulness.


> I believe the argument being made is that in _actual, real-world_ use of copilot, no copyright infringement happens. So it's not just about usefulness.

How would you know though? The burden of proof is on Copilot. Especially now that it has been shown to spit out copyrighted code.


You’re right, trying Copilot is equivalent to committing murder.

/s


>> Tell me you don’t use Copilot without telling me about it.

> You don't accept arguments against the use of Copilot from people unless they... use it?

> That's a nifty way to ignore any and all criticism of Copilot, or indeed any discussion about any ethical issue ever.

"I only listen to people who agree with me, but to make that sound legitimate, I have a somewhat indirect way of saying so."


They should at least try to understand how it’s actually used, not imagining how it’s simply used to steal their largely replaceable code.


It doesn't matter how it's used. Do you think Microsoft would be happy with someone training a model on Windows source code, as long as they didn't use it to reproduce the code?


If Microsoft were confident Copilot doesn't produce infringing code, they would have included the Windows and Office codebases in the training data. I wonder what will come out of discovery


You think MS‘ code quality is high enough to train an AI on?


Do you think they audited every open source code base that was used in training for quality?


“Their largely replaceable code”

Smells like: “ I stole this lousy apple that wasn’t any good” Then why did you steal it?

Put your money where your mouth is, Microsoft, train copilot on your own code!!!

Don’t wanna train it with windows 11 code? Prefer to hijack others projects and use their for your needs and then pretend thst insulting others and calling their code worthless will get you off the hook????

Backfire


> Smells like: “ I stole this lousy apple that wasn’t any good” Then why did you steal it?

The lousy code trained copilot in what a switch statement looks like so it can autocomplete mine for me


On a different website I argued with a microsoft employee who said that copilot is great and so on and would not discuss unless I tried it.

I tried telling that it requires a credit card number to try it but he didn't believe me… I guess the thought that non-microsoft employees have to pay for microsoft stuff didn't occur.


That isn't sufficient to get you off the hook. Copyright covers derivative work, not just code that's reproduced verbatim.


Derivative work requires a major part of the original, before it’s considered by copyright.

A 3 line boilerplate is neither novel nor a major part of the original.


I think if copilot is heavily restricted to three line code samples, perhaps I could agree with this.

The example cited by the OP is not a three line code sample - if you've ever done matrix coding, you know that sparse matrix operations are not simple.

Sure you can reduce it to a function call, but then you have library usage instead of code theft.

I think actually perhaps this is a way copilot could ethically move forward - instead of lifting code verbatim if it merely suggested libraries and approaches "here is an example of sparse matrix filtering and some libraries which do it", that would be both useful and ethical, presuming it does not obscure the license.


...But the example cited by the OP isn't how anyone actually uses copilot...

From where I sit, the complainant has found an extremely convoluted (and buggy) way to copy-paste their own code and is very upset about it. By similar logic, we should restrict the use of ctrl-c and ctrl-v, because they allow very simple infringement of open source licenses. Find a sparse matrix multiplication library which uses the copied code without attribution and you can take them to court; the law is already sufficient for this.


"Derivative work" is a very specific thing, and it's contrasted with "transformative work" in a way that matters a lot, and fair use intersects heavily with both.

Even when it comes to stuff that seems reaaaaally close to pure derivative: Googling "How long does it take to boil water?" => "If you're boiling water on the stovetop, in a standard sized saucepan, then it takes around 10 minutes for the correct temp of boiling water to be reached. In a kettle, the boiling point is reached in half this time."

That's a verbatim snippet pulled directly from https://unocasa.com/blogs/tips/how-long-to-boil-water, and yet Google exists and continues to do stuff like this under the fair use doctrine despite massive efforts to attack/monetize their service. [To be fair, Google does link results, which probably insulates them because it's less hurtful to the commercial interests of the source; that said, with open source there generally are no commercial interests to hurt (open source attribution will be a tough sell as an actual commercial interest), and that's specifically called out in the law as a factor]

Copilot is even less explicitly at risk IMO, in that it never even stores the text, nor can it reliably retrieve it. I have no idea what makes anyone think it should be more vulnerable than Google.

From the copyright.gov page on fair use (https://www.copyright.gov/fair-use/, worth reading in detail for anyone who cares about this stuff, also has links to a monumental number of cases with shockingly intelligible summaries): "Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work."

Copilot without any shadow of a doubt does add something new, with a further purpose, and does not substitute for the original use of any codebase on Github (it can't create any of the codebases in full, without manual guidance so extreme that you'd have to be using the actual original codebase as a reference, so it clearly cannot substitute for a single one of them, and that's what a lawyer will argue, likely successfully).

In the Google vs. Oracle case (see https://www.copyright.gov/fair-use/summaries/google-llc-orac...), a big piece of the fair use finding was that "its value in significant part derives from the value that those who do not hold copyrights, namely, computer programmers, invest of their own time and effort to learn." and "further[s] the development of computer programs”. It's hard to see where Copilot wouldn't fall into that category, as well, and that's precedent on (multiple) appeal.

By my reading this should be a slam-dunk fair use ruling, unless precedent gets really upended, and Butterick is wasting a ton of time and effort for absolutely zero potential gain other than some bragging rights, but to each his own...I guess we all have to grind our axes from time to time.


> Copilot is even less explicitly at risk IMO, in that it never even stores the text, nor can it reliably retrieve it. I have no idea what makes anyone think it should be more vulnerable than Google.

That doesn’t matter, IMHO. Once Copilot manages to copy a work, a copy has been created and copyright has been violated. If this occurence is reasonably likely, then Copilot is wittingly assisting in violating copyright.


How likely that situation is is disputable. People have cherry picked some cases where by specifying a very particular comment you can get it to (unreliably) reproduce well-known pieces of code, sure. But that is neither the intended use of the product nor the actual way that a single person using the product really employs it, and that really, really matters. Judges are not automatons, and the fact that when and if this goes to court the developers will be able to get up and honestly say "this is a tool developed to help developers create new code that is not merely recreating the functionality of the code we trained the tool on, and all our users use it to create entirely new things" is going to matter when they argue that it is fair use/transformative work.

I do completely understand that a lot of people disagree with me morally, and think that extracting insights from scraping the public web should be illegal. You're free to have that opinion, but I'd recommend you start lobbying your congressman to change the law, because though I'm not a lawyer I hang out with a few who do copyright stuff, and I don't think the law as it stands is on your side. That said, who can say, maybe this will end up bubbling up through many layers of appeals and end up at the Supreme Court someday, this stuff is all certainly wildly outside the bounds of what anyone writing the copyright laws was thinking about back in the 70s (which I think was the latest significant iteration?) so it's fair to say it's a complete gray area.


>But that is neither the intended use of the product nor the actual way that a single person using the product really employs it, and that really, really matters

And yet, someone has done so, and found that to be the case. Therefore, it falls under "normal use".

So make up your mind. Is software engineering only about assembling working programs, or assembling working programs + navigating license minutiae?

One of these, Copilot has a place in. The other it does not.


You may hope that a judge considers a really weird and cherry-picked weird edge case to be "normal use", but opposing counsel will argue the opposite, and have a lot of evidence if they actually look at real use.

"I can use a gun to shoot someone" doesn't make guns illegal, even if people do so with some regularity. "I shot someone to make the point that guns can kill people" is worth even less in the eye of the law, and that is literally what you're pointing to here.


And I don't know what you're even getting at with the "assembling working programs" vs "navigating license minutae" stuff. Copilot helps the former, and you're apparently trying to argue that it should be banned because some people are feeling angsty about the latter?


Google then handily provides you A LINK to the original ATTRIBUTED SOURCE.


Yes, I mentioned that. And I think it would be great if GitHub added some sort of scan to see if it was accidentally returning some verbatim code and gave a link to the repo it came from, but the situation with Google was much more clearly harmful: by scraping and presenting verbatim quotes, they are directly siphoning web traffic away from the sites they scrape from, and many of those sites make money based on page views. That is direct economic harm, and the actual impact was significant for many companies, not merely theoretical.

With open source code, the harm is much less tangible, since negligibly few open source projects make money from people going to their GitHub pages because they're searching for code snippets (not zero, but almost). My guess is that an honest quantification would put the lost revenue due to Copilot's existence in the tens to maaaaaybe hundreds of dollars. Courts look at that type of thing, which is why I don't think this will end up being an issue, at least in the US. Europe is wild, who knows what they'll do there, and that's where activists on this topic should most wisely apply pressure, you can always convince someone in government there to throw a spear at a BigCo. You won't take them down, but you may get them to negotiate, and I don't even necessarily think that's a bad thing.

That said, even in the US, if enough people make noise then things could change, so I encourage you to speak to your congressperson (I will be as well, but arguing the other side, because I really do think this is fair use and I'd like to see it enshrined as such explicitly, because this fight is going to be extremely common over the next few decades).


Ok so you don't care about the second part of the argument, you don't care about the danger of violating copyright.. The first question though is if training on large bodies of OSS is fair use. Are you saying you don't care about the copyright of your own work either? Or you don't care because you don't publish your work OSS?


I’m sure I’ve done more to OSS than 95% of commenters here. I publish my code under MIT when possible (and WTFPL for smaller projects), and yes, please train on my work or split out my functions verbatim, they are far less valuable than some people seem to believe. I don’t even care about the attribution part of MIT, it’s simply a nice to have when decent people use my code.


Wait... Shouldn't you use a different license if you don't care about attribution?

Creative commons maybe.

Just because you, as someone who self proclaimed to have done more OSS than 95% of the commenters here, does not know how to use OSS licenses, doesn't mean that the copyright question being discussed here is a non issue.

The issue is that you don't care about what the licenses in your code mean:

> I publish my code under MIT when possible (...) please train on my work or split out my functions verbatim (...) I don’t even care about the attribution part of MIT


I choose MIT because it’s a widely used permissive license most people are confident about using, not because I will personally pursue every clause in it. I will not take action against anyone using my MIT-licensed code without attribution.

Having been a member of a very high profile permissively licensed project and having started a few relatively popular ones of my own, I’d say I don’t need to take licensing advice or be called “does not know how to use OSS licenses” from someone who laughably advises using Creative Commons, when Creative Commons itself advises against using CC licenses for source code, except CC0, which is entirely different from CC.


My good, sorry to move away from the topic, but do you realise how childish you look with your comments?

I was considering not replying, but here goes nothing....

> not because I will personally pursue every clause in it.

Then you are out of the game, because all clauses should be respected, or else you are committing something very close to an illegality when you violate said clauses. If you don't want a specific clause, consult a lawyer and remove it, or use another license.

> Creative Commons itself advises against using CC licenses for source code

So before your didn't care about respecting clauses and now you do?

I could argue: there are some bits of text in CC that make it not a good license for code, but I don't care because I don't respect those clauses. I'm not going to make that argument, because it doesn't make any sense. You either use a license and respect it or you don't.


There is MIT-0, a no-attribution variant of the same license.


This isn’t just about OSS. If I have a public repo on GitHub without a license, you can’t rewrite that code and put it in your project. I own the copyright. The issue is that Copilot will still launder it into your project for you.


Oefrha just said that his, and probably yours, code is not worth that much if splitted into functions and commentary blocks. The value of software comes from its whole purpose, not your clever email-validation-regex you are so proud of.


>code is not worth that much if splitted into functions

Some algorithms in scientific computing require lots of effort to implement as nice, reusable, performant function. Those functions often more important than whatever the whole is doing because it's what most other people will be interested in using.


> not your clever email-validation-regex you are so proud of

Really? I challenge you to write a correct email validation regex.


Then why use OSS licenses in the first place?

Let's just copy each other's code without attribution.


I am totally for this! Will bring everyone forward in general without costing the society as much as this copyright bs.


Either that or people will stop posting their code publicly.


Maybe I’m the worst programmer in the world. It doesn’t matter. My code is still my code, and if I don’t explicitly license it such that you can copy it, you’re not allowed to.


How many lines of code was the Oracle v Google case? 9?


Just so happens I think the case is bullshit. You would have a point if I supported Oracle.


Your lack of support of them doesn't change the facts.


Says you.

If it’s all just disposable code, WHY ISNT MICROSOFT TRAINING COPILOT ON WINDOWS AND OFFICE CODE?


Because that code isn't publicly available, just like every private repo on GitHub, and any other company's/person's proprietary/private code?


Again, if I put code in a public repo on GitHub and don’t include a license that allows it, you cannot copy that code. It doesn’t matter that you’re able to access it.


I'm allowed to read it and learn from it. To distill the core concepts and recreate it.


You as a human being, not you as a for profit company building a for profit product.


Cannot is a strong term. Telling me there may be consequences would be more accurate.


I can claim fair use and short of an injunction there's nothing you can do to stop me.


Claiming fair use is not a fix all card you can play like that, there are a lot of nuences for that to hold on trial.


Just because you don't care does not mean others should not care, even if they are less valuable coders than you.

I mostly contribute by finding issues and reporting them, does that make me less of a OSS contributor than you?

And yet I do care if my private project is used by behemoth like Microsoft without my consent, even if it's only poorly written fizzbuzz. Why? because if I wanted to share it I would publish.


I was answering the questions “Are you saying you don't care about the copyright of your own work either? Or you don't care because you don't publish your work OSS?”

Feel free to care.


I personally don't care about the copyright of my own work when it comes to code. I publish the majority of my work anonymously with a clause that just says something along the lines of "feel free to use some or all of this without any attribution." Denying others from using my code feels like it goes against the essence of OSS itself. It's like if I found the solution to some complicated math problem but didn't let anyone else see it because it's "mine."


You are both a scholar, and a gentleperson. Thank you for treating the field like it is instead of some social value extraction platform.


I've yet to see anyone come up with a harm current CoPilot could do that is within an order of magnitude of the good it does. Why on earth should we care that it resulted in 20 lines of code being copied in some rare circumstances? The transaction costs alone make that code worthless as a saleable thing.


Straw man. That isn't what Github/Copilot is doing.

If open source communities are worried about having their source code copied... then don't open the source. Keep it closed, keep it off GitHub... I mean the genie is already out of the bottle, so doesn't really matter what they do now.

You can't prompt Copilot with things like: # Function that detects spam accurately

And get anything useful/sensitive/competitive

Is there really super sensitive algorithms out there that Copilot is exposing that are otherwise unknown?


> The only way you can get it to do that is to bait it with the function names of functions that have already been copy and pasted thousands of times onto GitHub without proper licenses

What? You left out the second line of the quote. It never reproduces copyrighted content for me because I'm not trying to bait it into doing that.


Hi! I'll give some insight onto why I benefit tremendously from CoPilot

I have a very severe ADHD and as a result terrible memory, I've been working as a dev for almost a decade now and it always was almost entirely in the browser with a search bar, not IDE, and me working with infrastructure doesn't help as I encounter more than one programming language at a time, daily.

CoPilot helps me not making dozens of jr. dev level search queries daily as I can formulate query right in the IDE and it will fill in the basic algo's, data types, language abstractions that I know exist (as I use them all the time), but I don't remember how to actually invoke them, despite doing so just an hour ago. This is the most useful part of copilot, not the complex and very specific code.


> It would be sad if someone succeeded in shutting down CoPilot for this kind of copyright stuff.

It enables the large scale theft of code. It completely ignores licenses. There are plenty of open source licenses that allow code use with proper attribution yet Copilot doesn't (and probably can't) figure a way to comply with all of them.

Copilot, as the article suggests, is a marketing stunt. To me it's more than that. It's Microsoft pushing the boundaries of law using it's money muscle again. I have been screaming from the rooftops that VSCode was just M$ spyware and I get legitimately made fun of for it. Now we have Copilot as well and they aren't even hiding it.

To address your point more directly if you do any work for monetary gain Copilot is a defacto liability. You can't just take 20 lines of completely stolen code, modify a few things, and call it your own. That's why legal reverse engineering has an entire black-box method of development. QA, researchers, and developers aren't even allowed to talk to each other directly.

I hope Copilot does get shutdown. Along with everything like it. It is one thing to have an AI trained on your workplace's code, or specific code following specific licenses, but the blatant theft of not only licensed open source code, but also private code, is a terrible precedent to have.


> you can't just take 20 lines of completely stolen code, modify a few things, and call it your own.

Now I hope that Copilot sticks around for this exact reason: cause endless inane lawsuits claiming that actual original code was stolen and laundered through Copilot or reverse engineers going cowboy. Make it enough of a problem clogging the courts that they start dropping copyright cases.


Copyright has a purpose. I don't mean a purpose in the Disney Hegemony sense but a real, actual purpose.

Dropping copyright cases is great if you hate proprietary code. That's fine. Open source is also powered by copyright. If we start dumping copyright cases we don't get the "well bob we may as well open source it!" We get large companies like Google just completely ignoring copyright and using open source code without attribution. What you've described (causing enough a problem in the courts) is the exact purpose of GPL. If we start dropping copyright cases altogether the open source movement may as well be dead in the water.


> You can't just take 20 lines of completely stolen code, modify a few things, and call it your own.

That's exactly what I said in my comment. I wouldn't take 20 lines of code, since it wouldn't actually work. Even if it was able to spit out 20 lines of correct code, they would be tailored to my codebase, and not violating copyright.

The only time you see CoPilot violating copyright is when someone coaxes it into that, in a completely empty codebase with no context. The violation of copyright is not possible when it is used as intended.


So many comments like this, but none list a tangible harm that's anywhere near the millions of man hours systems like this will save. Heck, basically none (including this one!) list a harm at all!


Honestly, kinda fair. How is it different to pirating anything, and the moral qualms that come in there? Some people see it as a massive issue, others don't believe the harm is significant. Probably makes an OSS dev enjoy their working being open source less, I imagine, which may have larger ramifications in terms of the ecosystem it promotes, but outside of that I'm guessing (I'm also trying to understand) it's just a matter of who's seeing the benefit from this tech. I think that's a reasonable thing to be concerned about, when wealth seems to have a tendency to centralize, and the average worker doesn't have the level of power we would ideally like.


You could say this about all software piracy. So Microsoft can work to end software copyright instead of trying to corner the market on pirated software and asking forgiveness after the fact.


No, you can't.

Because most software piracy is of saleable software. 20 line snippets are not for sale, mostly because the transaction costs are higher than the snippet value.


Eventually (not in the next 5 years but probably in the next 15) a system like this is going to lower the market salaries of developers (or make some of them outright unemployable). While using their own code as input to achieve that, without ever obtaining actual permission.

Seems like a pretty tangible harm to me.


If it's going to do that much "good" for humanity, then the dataset should be 0BSD or CC0, not proprietary. Project Gutenberg does good by archiving and allowing access to a lot of books and everything is under Creative Commons to continue to allow people the freedom to use the data as well as attribute the original authors. GitHub isn't doing this at all and is instead selling a proprietary product built on scanning open source that didn't consent.


I don't think most people are concerned that Copilot is going to be reproducing verbatim copyrighted code, it's more that it sucks that a giant corporation is going to make a billion dollars from a tool that is entirely built off of millions of peoples' work who were never asked permission and will never be compensated.


That's hardly a new thing! For instance, Google search makes billions of dollars by indexing content that other people make.


Google Search links to the original content. Copilot doesn't.


Maybe but it also uses those index cards that show a summary of information so you don't need to navigate to the actual website that contain the original content, there might be a link there but it's usually small and practically unnoticeable.


If Copilot provided a “small and practically unnoticeable” attribution to the code used, it would definitely improve the situation, especially for licenses like MIT that require attribution and nothing else.


As I responded to someone else, this isn't always true. Google "when was George Washington born".


George Washington's birthday is hardly copyrightable.


So are most of your functions and methods.


When you ask it a question, it will often simply construct an answer from the pages it indexed, so people don't have to click. Sure, it links it, but for what? Thankfully, the answers are almost always useless.


It should be noted that some juristictions are starting to restrict this (e.g. Australia). Also I would argue if Google would randomly display content of full websites and never post links to the original content it would be in a lot more legal trouble.


> Also I would argue if Google would randomly display content of full websites and never post links to the original content

Google does do this though. Just Google for an easy to answer question, like “when was George Washington born”


The answer to a factual question is not copyrightable, so while you may have moral problems with that, there is no legal argument.

The same applies to all realistic use-cases for copilot by the way. Whatever is produces is not copyrightable.


> The answer to a factual question is not copyrightable, so while you may have moral problems with that, there is no legal argument.

Correct.

>The same applies to all realistic use-cases for copilot by the way. Whatever is produces is not copyrightable.

That's a pretty bold statement to make. How do you know how people use copilot? Also IIRC Oracle vs Google essentially determined 3 lines of code can be copyrightable. So I think you statement fails on two points, you can't really predict how people use copilot and you cannot predict what a court would decide is copyrightable (this is much less straight forward than statement of facts).


Google helps you find someone's content. Copilot helps you rip off someone's content.


My guess is the act of production that is passed as original content tends to have more avenues to harm producers than consumption.

Alarm bells of this magnitude haven't been rung about people torrenting films for decades; it's a given that some people are just going to do it and there's little that can be done to stop it.

Producing new data from original data of questionable lineage makes the questionable acts visible. Copilot and the like actively encourage this creation.

If it were possible to peek into the rooms of everyone who downloaded a torrent to admonish them then maybe pirating would have been made a modicum more taboo. But those consumers never intended to leave their rooms. Copilot forces them to leave their rooms if they want their derivative work to be used.


That's true, and when Google started extracting information from web pages and displaying it in results pages without driving any traffic to the original websites, the authors of those pages were justifiably upset.

Google Search is ethically acceptable because for the most part website creators like being in search results and are "compensated" in the form of more visitors, and if they don't like it they can easily exclude themselves. Website creators famously do NOT like it when Google indexes their content and then serves it up independently.


Google makes money from ads. When you strip those away, purely indexing the web and offering a search engine is probably costing them money, not earning.


I don’t think anyone really cares about that kind of stuff, though. Like, there are loads of examples of similar things which no one (rightfully) bags an eye at. Like if I take a photograph of you out in the public and sell the photo for $1 million, would you expect compensation? Or if someone compiles a list of the best restaurants in the world and sells that list, do you think the restaurants should be compensated?

The value that is being derived here is in the curation of the material, not the material itself.


Bad example. A better example is I am a vendor across the street giving away free books. However, to comply and get a free book I require you to keep the book's bibliography intact.

You don't do this. You get my books, cut out the bibliography, glue all the pages together, and then sell the book as your own.

It is my book and all you did is derive some work from it.

Curation companies have the same problem and there are plenty of high profile lawsuits about it.


> if I take a photograph of you out in the public and sell the photo for $1 million, would you expect compensation?

I mean I wouldn't expect it, but I think I'd be pretty annoyed if you didn't ask permission and then made a bunch of money off my image. It's easy to find stories from the subjects of famous photographs who feel like they've been exploited. Just off the top of my head there's Afghan Girl, the kid from the Nirvana album, Harvard's collection of photos of enslaved people, and Henrietta Lacks is sort of a similar case.

> if someone compiles a list of the best restaurants in the world and sells that list, do you think the restaurants should be compensated?

No, but here's a better example: you make friends with a bunch of food critics, collect their thoughts and opinions and favorite secret spots, and then publish a book based on that stuff without ever telling them what you were doing or compensating or crediting them.

I'll give a concrete example: I was rock climbing recently and met an old guy who was sort of the local expert, and he told me how some other non-locals had come in and kind of mined him for information about the area, all the routes, etc. and then published a guidebook without crediting him at all. He felt pretty upset and exploited by that, and I felt bad for buying the guidebook because I had assumed it was written by some local climbers and didn't realize they got most of their info from someone else.

It's not illegal, but it is unethical.


If Copilot makes a billion dollars, it is only because it is generating at least a billion dollars worth of value to the community of developers who want to use it.

The people painting Microsoft as a big, greedy trust conveniently ignore that Copilot would actually be empowering the ecosystem of tech companies to develop services that compete with Microsoft faster and more easily.


Isn't that the corporate dream though, to make all your competitors dependent on you?


So what? Anti progressive luddites, it makes my blood boil.


When OSS code gets ripped off and people are mad: "Anti progressive luddites".

When closed source code leaks: "Copyright infringement by criminals".

What's the difference? There's plenty the world could learn from the source code of Windows or GTA6 and having access to the source of these large projects would move society forward faster. So why are OSS contributors protecting their rights "Anti progressive luddites", while the large copyright owners who guard their proprietary code like a dragon guarding gold are let off the hook?


You're assuming the person you replied to holds both those opinions.


GitHub’s free code storage, static site hosting, etc. is compensation

If you aren’t paying for the product, you are the product.


You can't give someone a dime (that they could have easily picked up from any of your competitors too) and then break into their house claiming they had been compensated. In this case they even steal code authored by people who never used GitHub at all, but had someone else mirror it or publish it on GitHub.


Not fair, while giving you the dime I'm pretty sure they quietly whispered something about them getting to live in your walls as compensation


Uh? What about projects that are mirrored on github? Why are their original authors being punished if they don't even own a github account?


GitHub’s business model is simple, they use the free tier to stay the main platform and easily attract paying customers.


> It is genuinely useful. I don't care that it reproduces copyrighted content.

I feel like the title of the article is literally written for you: """Maybe you don’t mind if GitHub Copi­lot used your open-source code with­out ask­ing. But how will you feel if Copi­lot erases your open-source com­mu­nity?"""

If you want to keep having useful tools based on open source code in the future, it is in your interest that people still want to write open source code. It is still too early to say how much of a chilling effect projects like Copilot will have on that. But clearly many (just read this comment section, myself included) are having second thoughts.


But that part of the argument is far more nebulous and bullshitty than the copyright argument. The idea that copilot is killing any substantial open-source project just isn't true right now. Copilot doesn't generate libraries worth of functionality, it generates small functions or less. Open-source projects remain as important as they were before copilot.


Personally I'm not worried about the end user using copyrighted code. That is their responsibility. If you have verbatim GPL code in your commercial closed source code base that is a liability and it might be dangerous to use copilot.

What I have more of a problem with is Microsoft charging for copilot which was trained on copyrighted code without any permission whatsoever which they really have no right to utilize/charge for.


As a human, if I learn how to program by studying copyrighted code, is it unethical for me to use that knowledge to make a living ?


I'll repeat something I asked elsewhere here.

From what I understand, it is not proven that the AI uses the knowledge of concepts and logic to write the new code. It is likely that it actually performs instead a very optimized stitching of code it previously saw.

Is my understanding outdated here?

From the ethical point of view, I'd say you're making some assumptions here that result in it being ethical when a human does it, and those assumptions might not hold for an AI.

For example, you're assuming a win/win outcome, where your learnings from copyrighted open source code don't harm the original authors ability to find work, or the value of their code.

With an AI I think there are possibilities we're looking at a win/lose situation, where Microsoft wins big, and maybe some other developers that also profit of their use of copilot, but where the original authors of the code that went to train it see their skills be devalued over time as a direct consequence.

In my opinion, a win/lose is unethical. What I'm not convinced is that we're looking at a win/lose, but I think there's a possibility.


> It is likely that

Here's the real problem:

We're on a forum in which most of the participants are in at least the top 0.1% of technical ability, and yet here we are waving our hands and speculating on "what AIs think" and how "they probably/likely compute" things.

Last week I met with a director of a new "AI" research group, chuffed to the nines with a massive research grant he just landed. I'm happy for him, with only one little concern - that he knows nothing whatsoever about machine learning or mathematics and outspokenly doesn't believe that "knowledge is that important in the new reality".

Copyright infringement is an inconvenience for people who worry about that sort of stuff. Sure. But don't you all see a much more serious issue? It's bad enough that code is already so precariously bloated and over-complex nobody bothers to debug critical applications any more. Now we want to add "assistance" from tools that nobody understands.


> a very optimized stitching of code it previously saw.

I have seen this line repeated many times, but I never saw it actually explained. A lookup table is dumb and easy to understand/interpret. Deep models are not that. They are also not a linear interpolation of … something. What exactly is the claim being made here? Yes, deep models don’t generalize too well on ood data. How does this make them a “very optimized” lookup table?


I mean an NN is, almost by definition, a weighted lookup table. The process to generate that table is not relevant.


Not at all. That would only be true if there weren't nonlinearities between the layers.


If you go around pasting code ad verbatim, a lot of people would be upset too.


The difference is scale. You will get old and die before you ever even reach 10 percent of the corpus.


That only shows that humans are better at learning than machines are right now. I may not have read as much code as Copilot, but I have read quite an amount of it, and it strongly informs the code I write, with respect to style and structure and algorithms. (I don't copy functions verbatim from memory, but neither does Copilot the vast majority of the time.)


Setting aside the very real question of whether AI "learns" the same way we do, it's bad enough we treat corporations as "people" in a legal sense, let's not start extending the same courtesy to software tools.

People have special rights and responsibilities under our legal system - we don't send an airplane that has a mechanical fault to prison for a crash nor do we extend right to life to a web server. Humans have an implicit the right to learn by reading copyrighted material, machines have no such right.


You eventually, at some point, write your own code which is not reproduced. From what I see Copilot authors state that copilot is reproducing code, it is not inventing new code like human being would. It seems to be big and efficient database of somebody else's code that it reproduces.


If copolot has the ability to write a whole application instead of suggesting based on its lookup and context, you might have a point. Unless proven otherwise I'd consider it a giant lookup table with (self)adjusting weights.


Unless Microsoft is secretly powering Copilot with Mechanical Turk, that isn't what's happening here.


Why not complete that analogy?

What if instead of Copilot, it was a bunch of humans who were searching all the source code they could access and then copying/autocompleting that code, regardless of the license.

Is that still OK? If yes, why?


A paid service where tens of hundreds of people search for publicly available code snippets and send them to you? I believe they call that outsourcing in some circles.


In this case we'd be sending subpoenas to those people to make sure they hadn't been instructed to disregard licenses or copy large pieces of code verbatim.


You're reading too much into my comment. All I'm saying is that Copilot isn't "learning" like a human does.


That wouldn't be ok, I'm assuming that's what you're implying as well? In any case I would say that's not ok.

Copying code, even if its from a mix of many different places, and the results look like a mosaic, would still be copying.

If the Mturk worker just suggested an implementation they came up with that be fine.


Why not make the a ology the other way. If I put massive amounts of code into a database and develop some sort of query language that spits out various parts of the contents of the database based on the query am I covered by fair use?


No, as long as you respect the original authors' terms of use of that "knowledge"


The difference is this, you’re likely working for the company whose proprietary code you’re working on and using as a “training model” while contributing to the greater good of that codebase.

You’ve been authorised to see this code.


I learned to code when a misconfigured CGI server spit out an application's code (it was written in Perl) instead of executing it. While that starts going down the long and complicated road of whether or not the machine is acting definitively, I think for our purposes here we can assume that the intent was for me to not have access to the code.


So you were probably illegally accessing another person or companies systems.

Nice you got something out of it, I'm not judging you either, but it was probably not the correct way to operate. What you should've done was notify the owners of the incorrectly configured system and left it at that.

You're also not a massive international conglomerate who should know better than to read every ones code and use it to turn a profit without first asking for permission.

I use Github like a bank, not a public library (unless I'm working on open source). I never would've allowed them to read through all my code and use it for profits without at least asking.


> So you were probably illegally accessing another person or companies systems.

Illegal would imply some kind of intent or malice. I was legitimately trying to access the executed result, which I would have been authorized to do if the service was operating normally.

> What you should've done was notify the owners of the incorrectly configured system and left it at that.

Seems unrealistic to "leave it at that". I had to read the output to understand it wasn't what I expected, and once I read it I knew how to code, at least to a cursory degree. The code was simple and it was a service I used frequently, so it was immediately clear how the code translated to the results I was accustomed to. Maybe that would be harder to do that now in my old age, but I was just a kid so I had neural plasticity on my side.

> I use Github like a bank, not a public library (unless I'm working on open source). I never would've allowed them to read through all my code and use it for profits without at least asking.

I don't know what kind of banks you deal with, but banks normally do read through your banking records and use that information to sell services to their clients – notably loans, which require knowledge of your deposits to offer.


> So you were probably illegally accessing another person or companies systems.

Misconfigured CGI handlers in Apache were very common in the late 90s, treating Perl as text/plain. There's no laws being broken, just a bad httpd.conf and no one is getting locked up for malicious intent.


If I leave my door unlocked, is it ok for you to come into my home and have a party ? Could I do that at your house or place of business ?

What if you intentionally or unintentionally took down a server that controlled important infrastructure which people depended greatly on? Flood warning system for example ?

Grow up.


So I write you a letter asking for information and you accidentally copy me your notes on how to gather the information in your response. Nothing illegal is happening when I read your notes. Maybe I should not read them for ethical reasons, but it's not illegal.


This only applies if you were reading the code, not executing any code on the remote system (which I thought you were doing). It sounds like you were doing something different.

Either way, I still think you're in the wrong, kind of like checking out a naked person getting changed because they accidentally left their blind open. It was available, maybe it was clever, but it's a strange way to learn how to code. Why didn't you just buy a coding book, or borrow some from the library ? Was the code really of good quality if the server was configured so badly?

Obviously we have a difference of opinion and that's ok.


The issue as described by the original poster was that the code was not executed but displayed. They read it and understood how it works. This set them on a trajectory to try it themselves. This is how they started. Maybe a book was involved at a later stage.

Sure, you can argue that they were not supposed to read the code, so they shouldn't have. But without some tangible harm I don't see why we're supposed to disapprove of it. Maybe allow some hacker spirit while posting on Hacker News :-)


I’ve done similar things in the past so I said I’m not judging them, but after some time working with computers myself I’ve become more compassionate and I think the better thing to do is help a fellow sys admin and report the problem. That’s the hacker spirit.


If it was also free and open source, sure. But it's not, it's a paid product that one party reaps the profits from.


Fine, but what happens if Copilot is so successful that it ends up actively harming the very projects that it requires for training material? If you value Copilot then you should be concerned about that, even if you don't care at all about any ethical considerations.

Seems very similar to a "tragedy of the commons" type of situation.


> I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.

Maybe you can exercise some discipline when using copilot, but what about your coworkers? Many companies might not want their employees to insert copyrighted codes into their projects accidentally.


> Many companies might not want their employees to insert copyrighted codes into their projects accidentally.

They can opt not to pay for this entirely voluntary service.


Yeah I think it just means a non-commercial alternative would be made to replace it


> It is genuinely useful. I don't care that it reproduces copyrighted content.

Plagiarizing is already understood to be a genuinely useful practice.


No, its mostly isn't. It's mostly done to cheat on academic contests (and thus getting a less skilled student/researcher for positions). Copying entire works of fiction doesn't really add anything to the world.

Copying a passage here and there when making a new work? I don't think the courts have ever ruled that plagiarism.


>The only way you can get it to do that is to bait it with the function names of functions

I get that using function names is an obvious way to get copilot to generate contested code, but has someone tried to get copilot to generate contested code in a way that users might sincerely be using copilot for productivity ? How do you get to the claim it is the "only way" ?


> I don't care that it reproduces copyrighted content.

Do you also just rip any open source code, violating the licenses? Nice.


>I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples

You don't have to bait it.

>I won't use 20 lines of its output with zero modification.

Depending on changes that may still be derivative work. The entire concept reminds me that can I copy your homework meme.


>The only way you can get it to do that is to bait it with the function names of functions that have already been copy and pasted thousands of times onto GitHub without proper licenses.

How do you know that this is the only way for copilot to reproduce copyrighted code?


I call these people open source haters. They selectively choose what they want open source to mean, and are against the fundamental ideas of open source.

Long live Copilot. It’s an amazing product that shows what we are capable of thanks to crowdsourcing and bleeding edge technology. We live in the future, and progress never remembers those who tried to stop it.


> the fundamental ideas of open source

A bit ironic that Copilot itself is not open source.


Really had to laugh at this one...


> I call these people open source haters. They selectively choose what they want open source to mean, and are against the fundamental ideas of open source.

B..but, Copilot isn't open source though?


The plea seems to be against proprietary tech built off open source efforts.

If they open sourced Copilot then it would probably comply with most of the licenses anyways.

Like at the very, least, respect the licenses, that means, give attribution, and provide your own source as open source and under the same terms.

Open source is what allowed this progress in the first place, and the way I see it, commercial interest is actually simply trying to slow it down by keeping it behind proprietary trade secrets.


The definition of "open-source" will be given by the license included with the software (or lack thereof). It could mean that we adhere to the Open Source Initiative, or it could mean that the source code is freely available even though its use is not permissive. The license will tell, not your pre-defined conception of "open-source".


A sizable, possibly plurality cohort of fully adult tech people is young enough to not know about United States v. Microsoft Corp. This would explain a lot of comments I see on this topic.

If you don't know Microsoft's history, a lot of what more informed people are worried about seems overblown. Copilot was Microsoft's first test of people's trust after the GitHub acquisition. It's going very, very, very poorly. There were ways to do this with consent and collaboration with the people and projects it takes code from, but they're acting like classic Microsoft here.

Too many people are focused on what's legal. It's fine to think of, but law is the last stop before the breakdown of society. Microsoft skipped society and went straight to sparking an inevitable test of and possible reshaping of copyright law.


>If you don't know Microsoft's history, a lot of what more informed people are worried about seems overblown.

Or maybe they do know about it, and don't agree with you. Do you allow for such an option?

https://github.com/features/copilot

"What can I do to reduce GitHub Copilot’s suggestion of code that matches public code?

We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment."


> You have the choice to turn that filter on or off during setup.

Notice that Copilot often gives code that verbatim matches opens source software, even when that filter is on. For example: https://twitter.com/DocSparse/status/1581461734665367554?s=2...

Their approach of "matches or near matches (ignoring whitespace)" is clearly inadequate, and it's honestly insulting that they think this is enough. Even if Copilot just changed the case of a single letter, their filter wouldn't catch it.


>Notice that Copilot often gives code that verbatim matches opens source software, even when that filter is on.

I saw a few examples, but I don't see how that extrapolates to often. It's quite possible I've missed something in the article since I kinda skimmed it. :)

>and it's honestly insulting that they think this is enough.

They don't. - "We plan on continuing to evolve this approach and welcome feedback and comment."


>They don't. - "We plan on continuing to evolve this approach and welcome feedback and comment."

That is corporate speak for "we plan to do nothing about this".


I am willing to give people a chance. I don't think everyone at Microsoft is identical in their motivations/intentions.


Note that they opened both in the same vs code instance. And copilot uses other files in your vs code project as context to make predictions, so it could have reproduced this code without knowing it before.


I didn't say anything to dismiss or discount that some people just don't care or have a different view. I carefully qualified my comment to only address people who don't know, and they do exist in large numbers.


Or some of us do, and just have our own opinions. I literally don't care if someone steals my code, and I think the current state of digital copyright is nonsense that does not benefit society in any way.


You're free to put your code under something like CC0 if you feel this way. Everyone else who puts their code under a license that requires at least attribution and expects Microsoft to follow it can continue taking Microsoft to task for ignoring that responsibility at scale.


That is certainly your right, but very interesting in this discussion.

Copyright laws should change if needed, but this is not the process.


Microsoft has owned Github for how many years... and this is the _first_ test?


Embrace, Extend, Extinguish. Takes a while to get to the Extinguish phase.


Yep. Four years might be a long time in Silicon Valley, but not in Redmond. Satya Nadella has worked at Microsoft 30 years. This is a company that can think past the next quarter.


Microsoft forgot that we live in a society.

Bottom text provided by copilot.


> Too many people are focused on what's legal. It's fine to think of, but law is the last stop before the breakdown of society. Microsoft skipped society and went straight to sparking an inevitable test of and possible reshaping of copyright law.

Maybe it's illuminating of a trait of human nature. On the stable diffusion webui repo many people have stated that they would continue to use the code even if it were stolen or unlicensed. These people aren't a part of a corporation; they are average netizens handed a technology essentially indistinguishable from magic with nothing in place to prevent its use.

If the tech is simply so impeccable as to be irresistible then a higher order framework needs to be in place to teach people not to bite because they will be bitten back.


When someone presents some information publicly, copying it is a natural right. Copyright can only be legitimized as the state sponsoring short-term monopolies in select areas to subsidize industries that benefit the public (from receiving the said subsidies). Obviously, a technology that let’s the public creative the desired output with little cost is the sort of thing that can eliminate the need for such subsidization in the first place.


One issue I see with Copilot is that they get free access to all open-source data on GitHub, but using GitHub APIs to download the data yourself isn't possible (rate limiting). This is an unfair advantage. Copilot is not only making money off of open-source, they are making money off of open-source in a way others can't.

I would love to see a lawsuit which requires GitHub to provide their full Copilot dataset.


Github does plenty of stuff that you can't do. e.g. the contributions graph as just one example that comes to mind.

That's not unfair, and not the basis for a lawsuit, it's just business.

> Copilot is not only making money off of open-source, they are making money off of open-source in a way others can't.

Of course! That's why MS paid squillions to buy Github.


You're not supposed to be able to use dominance in one market (git hosting) to gain dominance in another (AI powered code suggestions).


You might be confused but that's literally what you do as a business. You leverage your domain area expertise to expand into new areas of business.

For instance, Apple already knew about the portable hardware market and extended their reach into the portable music market via iPod. It used the iPod to reach the music marketplace via iTunes. Used its market dominance to create iPhone and the rest is history.

Maybe that's a personal maxim on your part but there's no law that says if you have dominance (what does that even mean?) in one market you cannot enter another. Think about what you're saying. We'd have no multi-product company if that were the case. What does "dominance" or "market" mean anyway?


That's what you do when you are not at-risk of being a monopoly in a sector. Apple could do these things back then because they were still a relatively small company in the tech sector. But now a days things do start to get more complicated.

As you point out, establishing what dominance is in a market is a tricky thing and it is why governments will investigate any potential signs of it. Now, personally I don't think this is really an issue for Github, or microsoft (the parent company) but let's not pretend market dominance and abuse of dominant position are not a real things.


>You might be confused but that's literally what you do as a business. You leverage your domain area expertise to expand into new areas of business.

>or instance, Apple already knew about the portable hardware market and extended their reach into the portable music market via iPod. It used the iPod to reach the music marketplace via iTunes. Used its market dominance to create iPhone and the rest is history.

Look at what you're saying here. Apple had domain experience in hardware and launched a new product (good). Then it used the dominance of that product to muscle into an entirely different market (bad). And the combination of hardware and market has led to Apple being able to extract their tax on half the music market, or whatever they have.

This is exactly what we don't want and why anti-trust laws exist.


I think the parent comment is referring to https://en.wikipedia.org/wiki/Tying_(commerce).

TLDR: If you have a monopoly in a market, you can't use your position to get a heads up in another distinct market by bundling products together. Ex: Microsoft can't use their monopoly position in the OS market to get an advantage in browsers by bundling IE with Windows.

It's my understanding that this doesn't apply if you don't have a monopoly, and it also doesn't apply unless you're actually bundling the sale of multiple things together.

Doesn't seem to be relevant here IMO.


"git hosting" is too narrow a market. Code repository may be someone interested in, there Github won't have a practical dominance. Most probably by nature of complete Github, it will come under "software development tools"


That might be true, if GiHub were a monopoly. But they are not.


In the same vein that Google are not a monopoly in search, Microsoft are not a monopoly in consumer and business OSes, Chrome is monopoly in browsers, etc.

Just because there is some existing competition which has a few percent marketshare and technically it's not a monopoly doesn't materially change anything besides a pro forma excuse. Which is why Google have been propping up Mozilla, they want the excuse "but technically there's another browser". However for consumers and the market it doesn't matter that technically there's an option that practically nobody uses.


a lot of people confuse popular with monopoly.


And even more know that you don't need to be a monopoly to violate anti-trust laws.


What do you mean "You're not supposed to"? Is there some law that forbids this? From my (potentially naive) POV this seems to be roughly equivalent to asking physics professors to stay away from mathematics since they are likely to have some relevant cross-domain expertise.


> Is there some law that forbids this?

Yes. It's called the Sherman Act, and it's the basis of anti-trust enforcement in the US.

https://www.ftc.gov/advice-guidance/competition-guidance/gui...

<<<The Sherman Act outlaws "every contract, combination, or conspiracy in restraint of trade," and any "monopolization, attempted monopolization, or conspiracy or combination to monopolize." >>>

I know lots of people here don't like it, but it is the law and that was the question; "this" in parent clearly meant "use dominance in one market to gain dominance in another" in grandparent, regardless of whether that's actually the central issue here or not.


Microsoft has been nailed for this exact kind of thing in the past. This isn't hypothetical.


it only runs afowl of antitrust laws if it solidifies an unfair monopoly and/or uses market position as a monopoly holder to prevent fair competition from emerging.

most monopolies are completely legal. small town with only one gas station or one grocery store? 100% monopoly, 100% legal.

there are plenty of alternatives to github, paid, free, as a service, and self-hosted. GitHub has a large market share, but no monopoly.


This would, of course, need to be determined in court. Microsoft argued it had plenty of competition when the DOJ came for it over Windows and bundling. Like I said, this isn't unexplored territory for Microsoft. This is the company of Embrace, Extend, Extinguish and the Halloween documents.

This isn't a company with plenty of goodwill in its sails launching a hip new product. They're not Do No Evil era Google or Apple riding high on the iPod. We can't just pretend there isn't a history.


Microsoft got done because they were threatening vendors with punitive action if they didn't bundle Windows and IE by default.

Which is to say: Microsoft was taking specific action which wasn't a natural consequence of their software, but was being actively enforced to keep out competition. Vendors didn't organically discover consumers weren't interested in getting a PC without Windows and IE, they were prevented from even offering the option lest they be completely denied the ability to offer that at all.


yep. people forget this (including me)


you're making a big assumption here, that Microsoft did not learn from their mistakes.

I posit that they have indeed learned from their mistakes.

1) They train Copilot with repos hosted on github.com. 2) Users who upload code to github.com grant GitHub an explicit license[0] granting GitHub the right to show that code to others. 3) GitHub do not specify which technologies or techniques they may use to show this code to others, meaning they may use any technique they like.

Microsoft have learned. And they have covered their collective asses. Users who host code on github.com agreed to these terms.

Users who don't like their code showing up in Copilot should not be hosting their code on GitHub, because they agreed to have their code delivered to others when they signed up.

[0]: https://docs.github.com/en/site-policy/github-terms/github-t...


From the linked terms of service:

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service[...]

I imagine the crux of this case will be what constitutes "the Service", and whether that includes Copilot. Also whether licensing Copilot counts as selling Your Content.


GitHub provides the code free of charge, you pay for the GPU time to train the AI model, I would assume, since that is what costs GitHub money.


Github likely does have what would be considered a monopoly marketshare, and strong network effects now keep the other competitors from breaking out.


"Github likely does have what would be considered a monopoly marketshare"

Citation needed. I'm interested to know of examples of how they have stifled competitors. AFAIK Gitlab came of age well after Github had already grown super-large. If Github could have killed Gitlab, wouldn't they have done so?

"GitHub has around 56 million users, whereas GitLab has over 31 million users." - https://radixweb.com/blog/github-vs-gitlab

That is also excluding Azure, Bitbucket, AWS and the plethora of other git repository hosting companies and services. AFAIK, gitlab has more payed (AKA, private or premium) customers than does Github (though I lack a citation there, but that was my understanding when researching this a couple years ago on where to host a private companies code. My take-away then was that Gitlab is actually more popular amongst private companies and Github is more popular for open source projects).

What would be considered Monopoly marketshare? I would suggest that Facebook, Twitter and Amazon have monopolies. If you get shut down as an Amazon seller, that can be 90% of your revenue. Given so many people can leave, and have left Github voluntarily, is implicit evidence that there are very decent alternatives (if not, there would be no alternative and you would be forced to stay with Github for lack of alternatives. That is not the case though, it's easy to just go over to Gitlab. From my perspective, I have no idea how Github can kill Gitlab, I'm curious if there is a vector where Github could use it's network effects to diminish Gitlab, as a specific example. So, how could Github do that?)


maybe, but it is only an illegal monopoly if they are using their monopoly position to prevent others from entering the market.

they are not threatening to block access to github.com for all Comcast users if Comcast chooses not to block access to gitlab.com, for example. That would be an illegal activity for a monopoly. simply existing as a monopoly is not itself illegal.


Yeah, it isn't illegal to be a monopoly, but being one can restrict practices like bundling.


Microsoft used its OS dominance to... solidify its OS dominance?

Does the law really say you can't include a free web browser if someone else created a paid one?

I don't want to live in a world where potential improvements for consumers get companies sued for antitrust.


I highly recommend looking into the history if this really is new to you.

https://en.wikipedia.org/wiki/United_States_v._Microsoft_Cor....


Yes, this is the basis of antitrust law


But who gets to draw the lines? Why can't both of these be in a single market "tools for programmers"?


In what sense? Certainly not my understanding of us antitrust law.


Its anticompetitive and illegal (if the government ever decides to start prosecuting people under those laws again).


There's a sizable fully adult cohort in tech who are young enough to, at best, have vague memories of Microsoft's antitrust troubles and not get why this is such a big deal. Tech history education could be better.


Not a supporter of Copilot, but I think it's pretty easy to access the same data through BigQuery:

>The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query.

https://cloud.google.com/blog/topics/public-datasets/github-...


There is a distinction between being able to access the source code, and a tool giving it to you without any context of the underlying license it is governed by.


GP was saying that GitHub has an unfair advantage in that they have instant access to all GitHub code, whereas everyone else is rate limited.

I'm pointing out that this limitation is not meaningful because everyone can access all GitHub hosted source code through BigQuery, where they won't be rate limited.

I'm not comparing BigQuery to Copilot.


Anyone can start their own GitHub competitor and do whatever they want with the source code that ends up on it. GitHub pays the bills and lets us freely upload whatever we want to their service, so it seems a bit entitled to complain about what features or data they provide or don't provide.


I pay for Github because I thought it was a nice, reputable service who wouldn't go through my stuff without asking me first.


Whew... that username checks out.


You've done nothing to refute my point and so it still stands.

I pay GitHub / Microsoft to host my code, and that's all I expect them to do with it, host it, as securely as possible. It sounds like Microsoft are doing more than this so what's your actual deal...if you have one?


?

Yeah, Microsoft has bamboozled users for years... that's the whole point.


Read the EULA, not the marketing copy, next time.


I did, what did I miss? I just re-read it and it's full of statements which are totally inline with what I expected:

GitHub considers the contents of private repositories to be confidential to you. GitHub will protect the contents of private repositories from unauthorized use, access, or disclosure in the same manner that we would use to protect our own confidential information of a similar nature and in no event with less than a reasonable degree of care.


yup...


Ah sorry -

"You have bested me in this epic battle of the minds."


That’s the problem with big tech. When they own all the roads, it makes organic competition almost impossible, as they will always be 2 steps ahead.


Not to be anti-capitalist, but you’re describing capitalism lol


It’s not making money off of open source, it’s making money off of hosting open source.


Yes. But the people who agree to let GitHub host their code are doing so with the expectation that it's "open source", i.e. freely accessible.

If we define "open source" as "you can't necessarily use this to train an AI", then Copilot itself is illegal because it's using code without permission.

If we define "open source" as "you can use this to train an AI", then Copilot it legal, but GitHub may be illegally misrepresenting itself as a host for "open source", as the policy it hosts code under isn't truly open-source.

If we define "open source" as "you can't necessarily use this to train an AI" but then GitHub's policy explicitly states "by using us as a hosting provider, you give us permission to use your code to train AI" then they are in the clear. But I doubt they have that clause or at least had it when Copilot was first revealed.


No? How much money would it make without using open source?


why use the API? why not just use git to get the code? All you need the API for is repository discovery


I'm gonna guess that Microsoft GitHub (tm) would shut you down pretty quickly if you tried to clone tens or hundreds of thousands of repos in a short window of time, b/c of course that's sketchy/abusive use of their infrastructure, right?

But of course if the data is already sitting in object storage inside your cloud environment and all you have to do is run some MapReduce jobs to get at it...

Hence: unfair, anticompetitive, intellectual-property-right-abusing behavior. Microsoft GitHub (tm) can prevent anyone else from running the kinds of analysis they do by simple "operational security", while running literally any kind of analysis, model training, etc. they want. Don't like it? But their commercial services and products so you can run Microsoft GitHub (tm) on your very own Microsoft Azure (tm) infrastructure, using Microsoft Visual Studio Code (tm) and Microsoft GitHub Codespaces (tm) so work on _your_ code privately.

Best of all, you can still still take advantage of the huge library of "free" code offered by Microsoft GitHub Copilot (tm) to ensure your private, proprietary codebase still has all of the advantages of Open Source Software, brought to you exclusively by the Microsoft GitHub Platform (tm).


Actually they don’t. I’ve cloned thousands of repos before (tried to archive conda-forge org for a project).

I’ve also built many parallel repo downloaders for CI reasons. You can clone repos all day pretty much with little rate limiting. I haven’t pushed parallelism past 64 per host though


I don't understand. Your favourite boba joint can email every one of their customers a coupon. That's "unfair" to the other boba joints without access to their mailing list too, right? You're just describing a regular old competitive advantage


> I'm gonna guess that Microsoft GitHub (tm) would shut you down pretty quickly if you tried to clone tens or hundreds of thousands of repos in a short window of time, b/c of course that's sketchy/abusive use of their infrastructure, right?

ArchiveTeam has a distributed Github archive project[0]. It's unclear what the status is right now. It seems like a worthwhile idea.

[0] https://wiki.archiveteam.org/index.php/GitHub#Archive_Team_p...


There are accessible code datasets that contain massive scrapes of Github.


Don't forget that GitHub is a closed-source proprietary commercial service.

Their freemium product is useful to many open-source projects and communities, but you do not have any more rights to use Microsoft's GitHub than you have to Microsoft's Windows.


> using GitHub APIs to download the data yourself isn't possible

Is it? Data storage would be prohibitive, but I can see ways to download the entirety of Github in a few weeks/months (assuming my size estimate is accurate).


It's probably the reverse - the data can probably fit in a few commercially available hard drives, the API calls to discover and download all repository without running afoul of the rate limiters and whatever other anti-crawling strategies can take years if you're running it single threaded (a single IP can effectively be considered single threaded).


There are publicly published datasets, updated regularly, and you can stream all the events happening on Github (which is how secret leaks happen) so you can stay abreast of new repositories.

https://www.gharchive.org/ http://ghtorrent.org/

also available on GCP as a dataset, provided by Github itself: https://console.cloud.google.com/marketplace/details/github/...


I think the very least they could do is dump the list of URLs from which they took the source files. The community will figure out the rest.


competitive advantage is not a monopoly, nor is it unfair.


I'm in favor of this. You can't ingest code that says "you cannot use this without attribution", put it through a bunch of if statements that strip the license, and then say it's "AI-generated". I don't care about most of our generic CRUD apps or the 15th rewrite of a sorting algorithm, but I do care about those smart enough to advance the field and come up with novel solutions. If we take away the incentive for attribution and recognition, people won't be as willing to share and we'll all be worse for it.

Like someone else said, there was a version of this where they asked people to opt-in and got community involvement. In true MS fashion, they just did it without asking and people are rightfully pissed.


Totally disagree. Training is fair use. It is akin to learning. Code licenses do not restrict you from reading or learning.

ML training needs to be fair use of copyrighted works, or most machine learning and AI projects will be impossible.


You are assuming that "training" and "learning" in ML means exactly the same thing as "training" and "learning" in humans. It doesn't. The processes are completely different, with only an apparent resemblance.


But the real question is why should fair use include both human and machine type of learning?


No, they aren't completely different. Learning is learning.


> Learning is learning.

No, learning (by ML) is not learning (by humans). It's just a same word used in different context, and doesn't by itself imply that the meaning is the same. The underlying process is completely different. Neural networks, despite their name, don't share anything in common with human brains.

Do you have any other argument besides "the word is the same"?


The concept is also the same.

> Neural networks, despite their name, don't share anything in common with human brains.

They share enough. Current ANNs are different than human neurons, however, the principle of learning they are operating on is very similar.

Machine Learning is learning. You can't look at ImageNet or just about any other model and say otherwise.

I'm informed on how current artificial neural networks work, how they are made, what they are and aren't capable of, etc. Thanks.


The concept is shallowly similar, not nearly the same.

Human learning, at the very least, consists of maintaining internal model of the world around us and integrates all data into a coherent structure. Current neural networks are extremely limited in their capabilities - each model operates on a strict class of data, and is not able to comprehend the world outside of that class of data, and as such cannot be compared to human understanding. CoPilot does not understand how the computer works - it only understands that connections between pieces of text exist.

If you're as informed on how ANNs work as you claim to be, then perhaps you should inform yourself more on how human mind works.


Ok thanks


You're welcome.


Like posts have said, whether training is fair use is not a matter of opinion, it is a matter of law. You can't use an appeal to your authority to make grand statements like this.

Frankly, I don't care that ML/AI _needs_ this to work. That's not my problem. You don't get to circumvent existing agreements (and law) because you believe that ML learning is the same as a human reading a piece of code and then typing it up on the side. Tesla manages just fine by generating their own training data. Other businesses have found partners to acquire data from. The only reason this isn't being immediately addressed is because there is near-zero accountability for license violations in software companies, and ML further obfuscates that.


There isn’t law any law yet.

And yes, it is the same thing as learning.

If you have a robot that learns like a human does … you think it should be illegal for that machine to look at GitHub? To watch a Hollywood movie?


A human that watches a Hollywood movie and then goes on to recreate it frame-by-frame with, idk, everyone has cat ears and go "nah this is all my original creation" is an idiot. A human that watches a Hollywood movie and then goes on to create existing works within the genre, with some homage (say, a specific hat, or a specific framing of a pivotal scene, or a specific lighting choice) to the original movie that inspired them, is learning.


So I think you would agree that the law should address the outputs of learning, rather than the inputs.

It should be illegal to reproduce copyrighted material, but not to “read”, “view”, or “consume” it.

Luckily this is what the law already says.


I think the law should address multiple things, only one of which is the outputs of learning. For example, if both the copycat human and the original director human both watched the movie via stealing it, that's also bad. Especially because copycat human is then going on to create copies of work they never had legal permission to copy! CoPilot effectively cannot tell the difference between an homage and theft.


>If you have a robot that learns like a human does … you think it should be illegal for that machine to look at GitHub? To watch a Hollywood movie?

Yes.


That's absurd, it's like saying you want an army of robot slaves.

Now, wanting to minimize the impact of robotic competition on human wellbeing, understandable. But the means to that end is declining to recognize property rights of those who who try to privatize the commons.


Ideally, I want the "army of robots" prevented from being created at all. But "robot slaves" is close to the second best option.


I agree that training is fair use. I don't agree that when the model spits out verbatim or near-identical copies of copyright code that somehow the copyright is striped or that the usage of it is somehow fair use.

I believe that the vast majority of code that copilot produces is fine. But we have also seen clear examples of copyright violation.

The biggest problem is that it is basically impossible for the user to tell which is which.


I agree there, and I think most rational people would too.

This whole website though and "investigation" is some sort of sensationalism that seems ultimately aimed at fair use training, which is very unfortunate. It seems like it involves an ego trip and popularity-chasing with targeting a megacorp as a means to that end.


Most AI projects being impossible sounds better to me than license violations.


Training by reading others works can be fair use.

But the moment you start reproducing more than a few lines of prose without attribution I guess you are in for a nasty letter.


>ML training needs to be fair use of copyrighted works, or most machine learning and AI projects will be impossible.

Good riddance then.


Microsoft has written and acquired plenty of codebases over the years. Train the AI on that.


>> "Like someone else said, there was a version of this where they asked people to opt-in and got community involvement. In true MS fashion, they just did it without asking and people are rightfully pissed."

I'm probably not the first person to say it through this whole debacle, but that might be me: https://news.ycombinator.com/item?id=33242619

>> "Copilot was Microsoft's first test of people's trust after the GitHub acquisition. It's going very, very, very poorly. There were ways to do this with consent and collaboration with the people and projects it takes code from, but they're acting like classic Microsoft here."


That's not what's happening here. Copyright work remains copyrighted regardless of how you produced it. If you lived in a secluded cave and completely independently wrote Harry Potter and the Sorcerer's Stone (unlikely, I know, but it's hypothetical), you'd be violating J.K. Rowling's copyright by selling it.

Copilot doesn't help people intentionally launder copyrighted code. It may cause people to accidentally use copyrighted code without realizing it. They're still liable.


> If you lived in a secluded cave and completely independently wrote Harry Potter and the Sorcerer's Stone (unlikely, I know, but it's hypothetical), you'd be violating J.K. Rowling's copyright by selling it.

This is absolutely incorrect. Independent creation is a complete defense to copyright infringement. Funny enough, Learned Hand gives a near identical example to highlight the opposite conclusion ("if by some magic a man who had never known it were to compose anew Keats's Ode on a Grecian Urn, he would be an 'author,' and, if he copyrighted it, others might not copy that poem, though they might of course copy Keats's").


Sorry, you're right. I feel bad.

You wouldn't be able to claim independent creation though by reproducing a work with Copilot.


There are two issues -- (1) feeding copyrighted material in to an AI model, and (2) getting copyrighted material out.

The latter is obviously a violation of copyright, full stop.

The former, to me, is obviously not a violation. If it were, that would massively tilt the playing field in favor of large corporations. It would become very hard to independently train your own models. Philosophically, I go by the principle that if it's (il)legal to do yourself, then it should be (il)legal to do the same thing with an AI's assistance.

The massive complicating factor is that nobody knows how to do (1) without also doing (2) as a side effect, because we don't understand how deep learning works well enough to control it.


This was a comment made to me in a previous, similar discussion, discussing case law around Google's use of copyrighted books in building a search engine: https://news.ycombinator.com/item?id=32654478

I'm not sure I completely agree w/ the comment (nor do I think it vindicates CoPilot), but I think it does provide insight into why CoPilot is violating copyright.


But fair use, as I understand it, is only about outputs, not inputs. (Licenses can apply to inputs.) A copyright violation occurs when one produces a work that infringes copyright. In this case, Google made digital copies of the books and showed snippets of the books to website visitors. Fair use refers to cases where producing that work is nevertheless legal.

The only analogy I can see is that copying the code internally to use in CoPilot training could be a violation of copyright (like how backing up your own MP3s is a violation of copyright?), but the licenses on these public repositories probably already allow that...


> There are two issues -- (1) feeding copyrighted material in to an AI model, and (2) getting copyrighted material out.

> The latter is obviously a violation of copyright, full stop.

It's not obvious to me that (2) is a violation of copyright. Unlike patents, copyright violation is not as simple to prove. My understanding is that, at least in the US, independent creation is a valid defense against copyright infringement. For example if 2 people independently write the same story and can prove that they did, they can both hold copyright over that story.

The analogue to this does exist without AI, when creating something that looks like copyright infringement, clean room design (don't look at similar things) is often done to ensure that "independent creation" can be used as a valid defense in court. Given that, I think (1) is probably not safe to do at all if you can't prevent (2).


Get Stable Diffusion to output Micky Mouse and see how far you can use that commercially without Disney stomping down on you hard.

Outputting copyrighted material is a violation of copyright, period. Whether that violation is enforceable depends on your means though.


And why is Micky Mouse not in the public domain as of 2022? There lies in the root of all these questions. The system is not designed to benefit people, but rent-seeking.


While I agree that copyright terms are unreasonably long, it's not relevant to this specific case.


> > There are two issues -- (1) feeding copyrighted material in to an AI model, and (2) getting copyrighted material out.

> > The latter is obviously a violation of copyright, full stop.

> It's not obvious to me that (2) is a violation of copyright. Unlike patents, copyright violation is not as simple to prove. My understanding is that, at least in the US, independent creation is a valid defense against copyright infringement. For example if 2 people independently write the same story and can prove that they did, they can both hold copyright over that story.

> The analogue to this does exist without AI, when creating something that looks like copyright infringement, clean room design (don't look at similar things) is often done to ensure that "independent creation" can be used as a valid defense in court. Given that, I think (1) is probably not safe to do at all if you can't prevent (2).

I don't think the analogue holds, the AI does have direct view of the actual code. In the most paranoid clean room design you have two teams, one analyses the behaviour of some software and writes a specification (without view of the source code), the other then uses that spec to write the reimplementation.

Copilot turns that on its head, you ask to do something it then looks up the source code how to do it and gives that to you.


> For example if 2 people independently write the same story and can prove that they did, they can both hold copyright over that story.

This is the theoretical case but I don't think I've ever seen that actually happen in practice.


> how deep learning works well enough to control it.

I don't think you can control it. Machine Learning models do not create anything, they make a prediction of the expected outcome based on the training/validation data. Similar to how human beings are an outcome of their experiences, so are ML models. Ofc human beings are much more complex than a ML model.


There is no obvious violation or obvious not violation. It is a matter of fair use and it will be settled in court. Using copywritten code and not open souring the derivative work (copilot's model) may very well be a violation.


But is it illegal for AI to provide the said assistance ? That, I believe, is the bigger question.


when you put your code on GitHub.com, you grant GitHub the right to show that code to others.

https://docs.github.com/en/site-policy/github-terms/github-t...

this is separate from the license you specify in the repository and you can't revoke it without removing your code from github.com.


From the text:

>We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

>This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use

I would say they would have a pretty hard time to justify using the content for AI training (and selling) based on that license. Copilot didn't exist at the time when many agreed to that license, so an argument saying Copilot is part of the service would be difficult to pull off. Moreover they don't even provide copilot to people hosting on GitHub.

Note that MS themselves are not claiming that they are allowed to use the code due to their terms of service. They claim they can do it due to fair use.


I would say they've used code they host to train an AI and they charge for a fraction the GPU time required to train and customize their model. they're selling you their model, not the code it produces.

if this is indeed how they charge for Copilot, and I don't know if it is or is not, then they will need to show that they have done their due diligence in making sure that code is not reproduced verbatim when a user requests that it not reproduce code verbatim.

I'm quite sure that GitHub can defend Copilot in court. That's part of the process of offering a new feature to customers; making sure that it is legal and defensible to do so.

All of the armchair attorneys here who think they know better than GitHub's attorneys when operation of the service puts GitHub's ass on the line is ... I wish I had 1 percent of that confidence. I would be a thousand times more confident than I am now.


Does “I give you permission to show this code to others” include “I give you permission to offer this code to others for their use in their code”?


users of github.com are responsible for their own use of any code they find, however they find it.

GitHub shows code to those who wish to see it. it is up to the person using that code to use it according to the license. when I buy a car, it is up to me to use that car according to the law. when I buy a gun, it is up to me to use that gun according to the law. etc.


> when I buy a car, it is up to me to use that car according to the law.

And yet we (modified) the law to mandate speedometers and seatbelts to make you more aware of the speed and more secure against failure. We require car companies to perform thousands of crash tests to validate that the tool the give you is safe for when you inevitably push “according to the law” a little too far.

We mandate mirrors and backup cameras because we know that those who intend to follow the law closely still have blind spots and it’s in everyone’s best interest to mitigate and increase awareness.

> when I buy a gun, it is up to me to use that gun according to the law. etc.

And yet few laws have caused the US (and other nations) to question this principle quite like gun laws.

Gun laws are really both a perfect example and the worst example of why we’re having a debate around CodePilot. We both expect people to be responsible for their decisions (you need to verify legality of that code snippet before using) while also giving them the notion that they can toe the line as much as possible (why regulate the availability of dangerous tools, crime is illegal, users won’t make a mistake).

Guns are used to kill people despite it being illegal. That’s why people want gun control laws. And in a comparison I never expected I would make, perhaps people want AI to be regulated because it will be (is?) used to circumvent copyright.

Edit: I don’t know if I really have a side I stand on in this debate overall, but I think the argument for why it’s copyright violation today is pretty compelling. We wouldn’t make the progress we’ve made without this violation and perhaps the loss of copyright is a worthy sacrifice?


> I think the argument for why it’s copyright violation today is pretty compelling.

I still don't see how there is any footing for a copyright infringement claim here, given that users who put public code on github.com explicitly grant GitHub a license to use that code to provide services to other GitHub users.

that license grant is above and beyond what any specified license terms the repo itself grants to users of the code.

you literally grant GitHub the right to do this when you put your code on github.com.


Actually no you don’t. The ToS is obviously long, but it’s surprisingly human readable and tech friendly (eg they have verbiage on reproduction of your content for search indexing).

Relevant snippet:

> you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking). You may grant further rights if you adopt a license.

They key parts are the “through GitHub” portion. GitHub is being careful to not give people rights to your content beyond the right to view it through GitHub. Performance refers to multimedia like music and video assets (according to others parts I didn’t reproduce).

No one is gaining a license to use your code through the inclusion on GitHub.

Section D is the relevant section.

https://docs.github.com/en/site-policy/github-terms/github-t...


Fair use is baked into copyright law, "full stop".

The only way to prevent all uses of your code is to keep it secret.

If anyone wants to say me using copilot violates their copyright, then sue me. But if you have no loss of reputation or revenue, and I have an innocent infringer defense - noone can stop me.


> Fair use is baked into copyright law, "full stop".

Someone recently said most statements on HN should automatically get "in the US" appended to them due to how US centric many of the views are. This is an excellent example. There are plenty of juristictions where "fair use" doesn't exist.


> if you have no loss of reputation or revenue

Statutory damages still apply.

> I have an innocent infringer defense

There's no such thing. It's a meaningless phrase, in a legal sense. Claiming that infringement was accidental or unintentional is not a defense. It has no effect on a determination of guilt or innocence. All it affects is the penalty.

Fair use is a defense, but a more limited one than you seem to believe. The usual formulation is "criticism, comment, news reporting, teaching, scholarship, or research" but none of those apply to Copilot. Fair-use claims are also not accepted by default, but only by demonstration that the four factors defining it are all applicable.

Also, since you've brought it up recently, you as the defendant in a copyright case don't get to choose jurisdiction. Usually the plaintiff does, either because it's explicitly defined in the same license that grants anyone rights at all or because it's a place where they do business. If you live in a different jurisdiction that might affect whether the plaintiff or court can collect any penalties, but not whether any are assessed. Having yourself declared persona non grata in multiple jurisdictions doesn't seem like a good long-term choice.

https://copyright.columbia.edu/basics/fair-use.html

Instead of "flooding the zone" with dozens of comments offering nothing but the same few (false) claims - strong echoes of a recently banned user BTW - I strongly suggest you actually read up on copyright and fair use. They're not whatever you want them to be. Courts are unimpressed by your towering intellect.


RSI took away my ability to write any significant amount of code 30 yrs ago. co-pilot plus speech recognition restored that ability. what impressed me most was that from a textual description,co-pilot gave me code that could have been written by my mind and pre-injury hands.

from the comments here, if I push copilot into giving me code that I would have written for a given problem and that code violates licenses, then who is responsible for the copyright violation? co-pilot for giving me code that looks like copyrighted code or me for tweaking co-pilot commands to give me the code I envisioned which looks like copyrighted code?

also consider that the very tools used for solving problems in code lead coders to a small number of solutions for a given problem. is it plagiarism or parallel original thought?

also consider that when I wrote code, if I was solving a similar problem to what I solved before, I recreated that previously used code fragment (or larger) and use it to solve the problem at hand. I had zero issues leaving a trail of duplicate code behind me especially if the code was a major part of a software patent.

I didn't care, my code was lauded for it's readability and reliability. reuse the same concepts in multiple variations, you get real good and writing code correctly.

maybe co-pilot like programs could scan existing code bases and find examples of code fragment plagiarism with the goal of showing that software copyrights are useless.


The issue isn't that something like copilot shouldn't exist. The issue is the disregard for open source. You want a proprietary code completion tool? License your training set properly. If you want to build on open source then open source it and what it produces.


This is a bit off-topic, but I wonder if there are people/teams right now creating git repos, doing the source code equivalent of "SEO" on it, and embedding backdoors in stupidly overoptimized for the training process code?

I wonder when we'll hear about the first big hack that gets traced back to production code pushed live after CoPilot "suggested" eval(base64decode({webshell}))


The less overt version of that is to figure out what mistakes copilot already makes (either things that are common in tutorials but not good in production, or things that are outdated, like hashing passwords with md5), and then systematically looking for software that includes such copilot suggestions.


Is there a technique to scan for software that includes copilot suggestions? Or is this just theoretical? Sounds impossible given MS/GH's monopoly on access to the model input data.


Probably not, but I can imagine something similar to hijacking a popular library and publishing a new version that opens a certain port and waits for instructions. All a malicious actor needs to do is increase the amount of exposed servers to be caught while they later scan the internet for anyone with that port open.


What is easier is to probably sneak in a new dependency that points to a malicious fork. Something like an XML to JSON library that does do the thing that is advertised but also additional things.


If that code managed to hit production, then the problem is with the management and engineering leadership, not Copilot.


And none of us here have ever worked under incompetent management or engineering leadership...

(Hell, I've _been_ that incompetent management and engineering leadership at various times over the last few decades...)


It's best to treat copilot like an eager intern who can churn out boring code for you. It still needs to be reviewed.


I'd be rather saddened if Copilot was shut down or neutered because of a few vocal few protesting against it.

It's been a massive productivity improvement to our senior devs, and I got so used to it that it's an annoyance when Copilot doesn't respond.


Copilot wouldn't be shut down or neutered because "a few vocal people" protested against it.

It would be because it's illegal and violates the licenses, desires, and intentions of the thousands of workers who wrote the code in its corpus.

You act like Microsoft is trying to do a public service and people are angry about it. The reality is that they're taking billions of hours of work and using it to build a product that only they control.

If they re-released Copilot as FOSS, a lot of the valid criticisms would evaporate.


They should in addition release all code generated as a combo of AGPL, GPL, MIT, etc. and put a comment on every usage. Users would then need to license their code accordingly.

For a commercial version, run it on Microsoft's internal code, the code they actually own!


> If they re-released Copilot as FOSS, a lot of the valid criticisms would evaporate.

That changes nothing at all. A FOSS license is, in a legal sense, no different then a proprietary one. If it’s illegal it’s illegal regardless of whether it’s FOSS.


I mean, legally, yes, but the point at issue was public outcry, which has almost nothing to do with laws.

If Copilot was FOSS there'd probably be a few absolutists complaining, but they'd be mostly ignored.


If Copilot is deemed a derivative work of its inputs, it would resolve the issue that it does not comply with the AGPL because it currently does not provide source code to its users under an AGPL-compliant license.


> It would be because it's illegal and violates the licenses, desires, and intentions of the thousands of workers who wrote the code in its corpus.

I wonder how many people on HN would be on the side of the creators if we were talking about content created by Walt Disney and whether pirating was ethical?


Equal protection under the law.

I am 100% on the side of content creators. Regardless of who they are .

The courts tend to take a dim view of theft. Which is what this is.

The article clearly lays out that multiple requests for sound legal basis have gone unanswered . It simply doesn’t exist and Microsoft is operating on a forgiveness vs permission model.

Licensing is 100% about permissions. Clear and explicit enumeration of the permissions (or lack thereof ) for a work.

This class action lawsuit should surprise nobody. It’s a class that is sick and tired of being exploited.

Do not take my work that I contributed with explicit permissions and use it in a way I didn’t grant permission for. Full stop. It isn’t complicated.

You wouldn’t download a car and all that jazz….


It's not full stop didn't grant permission.

Fair use is a major part of copyright law. I do not have to ask permission to use your work.

For you to win in court you have to overcome fair use, you have to overcome innocent infringer, you have to overcome no damages.

Anyone leaving comments saying that there's an obvious way a court would rule on a copyright case involving those 3 things is wrong.


If my code is licensed under terms , that’s the permission .

If you use my licensed work then yes, you do need to follow the terms of the license .

The issue of license / contract / copyright is messy. It doesn’t ever seem (in the USA anyway ) to be definitively answered / “solved.”

I chose AGPL v3 only on purpose.

Co pilot and users thereof (so now two levels removed ) utilizing code in whatever work are stealing my work (unless it’s AGPL v3 licensed ). The adding of intermediaries (and the most likely unknown and with no way to know) infringing is going to be very difficult to mitigate. It’s like truly unknowingly buying stolen property,


No I don't have to follow your license.

If I use it under fair use, there is nothing you can do about it.

The fourth factor of fair use is the effect on the value of your work. If I'm not affecting the works value because there is no market for it, because it has no commercial value, you are going to have a very hard time defeating this argument in court.


> The courts tend to take a dim view of theft. Which is what this is.

https://docs.github.com/en/site-policy/github-terms/github-t...

so much animosity over rights you gave GitHub when you put your code there. "Theft" gimme a break. you license your code to GitHub so they can show it to others. This is separate to the stated license in your code. Nowhere in that terms of service document are the means that the code is shown to users specified.


Give me a break. Did you even read the license? You read that it is restricted to "the service". That service cannot be copilot, because it didn't even exist when most people agreed to the license. Moreover the license explicitly states they cannot use the code to distribute in any other way or sell it. So if anything they are violating their own agreement.

Also MS themselves don't even claim that training is covered by their terms of service, they claim it is fair use.


What do you think is more likely, that Microsoft stuffed their own terms and conditions?

Or you are mistaken and "the service" of github, includes all features available on the website including copilot.

Even if you're right and a court rules against them, what's to stop them changing the terms to become compliant?


? Did you read what I wrote, MS doesn't even claim copilot is covered by their terms. They claim it is covered by fair use (some people also claimed there is code not hosted by GitHub in copilot, which would further confirm that they believe they are covered by something else)

Moreover terms have been largely unchanged for years AFAIK. If someone agreed to the license years ago, they can't have agreed to copilot use. Also copilot is not a service on their website, it is a separate service and they charge for it, also contradicting the terms.


If you think I'm mistaken I will gladly reevaluate what I said if you could kindly restate your position, by answering a few questions

What does separate service mean? What would Copilot look like if it was not separate?

Elaborate on "not a service on their website", as it is available and listed as a feature "Github Copilot" on their website.

Is the contradiction related to payment for the service, or just because you think it is separate?

Since I thought you were arguing that Githubs own terms prevent them from using the public repositories in Copilot, this is what I argued against.

If you think fair use is involved, then that's the end of the line. If MS claims fair use, then until a court says otherwise, it is. Anyone who thinks their copyright is being violated can get an injunction tomorrow.


> What would Copilot look like if it was not separate?

Maybe some type of hint shown inside your code when it’s shown at github.com. There is already a text editor.


> They claim it is covered by fair use

they claim the training of their model is covered by fair use, but they did not say that was the justification they were using. They don't need to claim fair use.

It's pretty clear from the Terms of Use that they can use code hosted on github.com to provide any service they like, so long as it is a GitHub service. They don't need fair use, they already have the rights to do what they are doing.


It'd probably turn towards Disney's history of bribing congress for copyright extensions whenever the mouse is about to enter the public domain.


The same copyright laws that the open source proponents are complaining about MS breaking?


There is a certain amount of irony that a likely possible outcome will be the strengthening of copyright laws and enforcement, at which point people will realize they won the battle but lost the war


But isn't it copyright upon which open source is founded? The license is based on the right of the owner of the copyright to authorize someone else to create a derivative work?

Without copyright, it would be perfectly legitimate and legal for someone to not follow a license (because it would bear no legal weight because of the lack of copyright).

I would argue that open source is best served by strong copyright protections that allow the people created the software to make sure that further changes to it are released back to the community. Weakening copyright law means that it is that much easier for big companies to co-opt some software and not need to release the changes back.


The problem with copyright isn't that it exists and provides protections - this incentivises creators.

But life of the creator + 70 years?

Reminds me of this:

https://arstechnica.com/uncategorized/2007/07/research-optim...

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1436186


> The same copyright laws that the open source proponents are complaining about MS breaking?

As long as Microsoft can and will wield those laws against me? Darn tootin'


Steamboat Willie was released in 1928.

People probably would have less of a problem if microsoft breached the license on 100 year old code.


That's a bit of an exaggeration. There have only been two copyright term extensions since Mickey Mouse was created, and only one of those can even remotely be attributed to Disney lobbying.


Even for the term extension that is attributable to Disney, you can also attribute it to Germany and the formation of the EU. "Mickey Mouse Protection Act" is funny, yes, but also an oversimplification.

This is also why I fully expect Steamboat Willie to fall out of copyright protection in January 2024 - right on schedule. There's a few countries that have supra-EU copyright terms, but none of them are dealmakers. Nobody is demanding we match Mexico's life+100 terms, for example.


I don’t think anyone at all is here arguing that an AI trained on a massive corpus of movies that outputs snippets of film based on a prompt would be illegal. Such a thing would literally be the same as Midjourney with images. The fact that you can likely coax any AI to output snippets close to some of the source material is not likely to really matter and would be as if you recreated a copyrighted work using any other tools.


What’s the difference here then?


I don't think many people here would object if Copilot was trained on all the publicly available source code from before 1930.


Wait, I'm not sure what you are trying to say? Can you clarify?


That many people who are criticizing Microsoft for using open source code and claiming “fair use” and saying it’s not fair to creators are the same ones that say pirating digital content is harmless.


Because issues aren't automatically symmetrical, and impact of actions isn't equal in both ways, especially when there's power difference between the sides of the conflict, e.g. it's not hypocritical to punish a bully punching their victim, but not the victim for punching the bully back.

Disney used its power to distort copyright laws in a self-serving manner. As an individual you don't have equal power to oppose them (you were supposed to have in a democracy, but lobbying is legal and corporations are people).

Disney is a huge corporation that won't even notice if you pirate a movie, which you may not even have been able to pay for anyway, because of their region-locked twisted maze of distribution and DRM.

OTOH you may be screwed if you're a creator making a living from your work, and a big corp can just take it without paying, launder it through "… in the style of $YOURNAME" query, and say they own it now, because unlike your copyright, their Terms Of Service apply.

Even people who think copyright shouldn't exist may rely on using copyright — against itself. You can't unilaterally say "I don't believe in copyright", because the law doesn't care, but if you license something as copyleft, then the law does care about your anti-copyright license.


There’s an assumption that hackers are automatically anarcho-communist, and unconditionally, wholeheartedly, vocally supportive of content piracy, which is not correct.


Hackers mostly are, but Hacker news is mostly frequented by bog standard developers, who mostly aren't.


> It would be because it's illegal and violates the licenses, desires, and intentions of the thousands of workers who wrote the code in its corpus.

Yeah, the vocal few.

Do you think I give a rats ass that Copilot is duplicating my OS code?

I have to imagine most people are completely ambivalent. Of course I have no proof, I just can’t imagine anything else.

The lines probably fall somewhere along the MIT vs GPL camps.


> Of course I have no proof, I just can’t imagine anything else.

"Ambivalent" means "of two minds," but I'm going to assume you meant that you're indifferent.

If people are/were indifferent, their licenses should reflect that. They overwhelmingly don't.

Regardless, Microsoft is legally bound to obey the licenses.


I consider GPL as meaning, "don't make the same app as me using this very source code". Cribbing one method out of quarter million line code-base, hardly seems to be redistributing the source code. Literally, it is, but is there no concept scale? Can we go to an extreme and force anyone with a `catch(Exception e)` line in their Java code to go and prove they did not take it from a GPL (or similarly licensed project)? I think this indicates there is a line, at some point it is enough code where you are recreating the functionality of the software - to me that is the thing that matters. I don't give a crap if you use my GPL code to learn from and use any parts of it to build whatever software, but I do care if you recreate the same software using my GPL code.

I would accept a claim of license violation if someone used copilot to autocomplete so many methods from one specific project that you have recreated that original project.

I still think it is a matter of scope. It can still be the case that a relatively small module is not cool to lift, but I think in this case we are still talking about such small subsets of functionality that it is completely divorced from the original software. Like, I could see it if a specific method were really key in some way to a unique application, a very novel solution to a difficult problem - but if that were the case, how can an AI possibly use that for a training model? In other words, the auto-suggestions of an AI are going to be common coding solutions to common coding problems that the AI has seen hundreds of thousands of times. That individual proprietary GPL, unique and novel solution is not really the stuff of an AI suggestion. In other words, the code that co-pilot is going to suggest is going to be non-unique, generic, and not really specific to the overall application at all.


Good point, I meant indifferent.

> If people are/were indifferent, their licenses should reflect that. They overwhelmingly don't.

Apparently, overwhelmingly they do. At least if the licenses used are any indication.

https://github.blog/2015-03-09-open-source-license-usage-on-...


The funny thing is that even the MIT license requires attribution, which of course Copilot doesn't provide.


[flagged]


It definitely does violate licenses -- this is true EVEN WHEN the repo they pulled it from is the original violator.


Well... If you do a search you'll find that there are actually lots of lawsuits related to open source license violation, mostly around GPL license. Maybe educate yourself first


Your assertion thst copyright doesn’t matter and that people should “get used to it” is… pretty incorrect on the former (surely there are still pending copyright suits on earth circa 2022) and ill-conceived on the latter (I’ll just rob your house while you sleep and you should just get used to it)


Your comment amounts to "sometimes crime occurs; therefore, laws are pointless."`


> It would be because it's illegal and violates the licenses, desires, and intentions of the thousands of workers who wrote the code in its corpus

I'm almost entirely certain you're wrong about the desires bit. 99% of the developers who wrote that code won't mind.


> 99% of the developers who wrote that code won't mind

This is completely unsubstantiated. I for one would mind microsoft profiting from the closed source code I wrote.


> It would be because it's illegal and violates the licenses, desires, and intentions of the thousands of workers who wrote the code in its corpus.

Copilot makes source code much more open, if you think about it. It implements code reuse in a different way than classes and libraries. It offers its skills equally to everyone, skills learned from everyone.

As for the cost of the API - it's expensive to run large language models, I think the price is justified. But there are free models if you like to run your own.


And if it preserves licenses that's fine. Otherwise it's copyright infringement.


Free/libre software ≠ free as beer.


Why should Copilot engineers, and the company that invests in it, not be rewarded for their incredible product and SaaS offering they spend resources on providing?


They built it using some publicly available resources. These resources are available conditionally, subject to licenses (such ad GPL). Which is fine.

The problem is that their product sometimes produces verbatim copies of licensed works, without attaching licensing information. This not only goes against the licenses under which the original authors made these works available. It can also put the product's users in danger of anything from bad publicity to a copyright lawsuit.

CoPilot is a very interesting research project. It's not yet an acceptably mature product though.


Are they following All the licenses they are consuming?

Why haven't they uploaded Windows source to Copilot?

Just how much code reproduced violates copyright?

If, instead of Copilot, Bob was giving me code to copy, and it was a AGPL codebase, am I still subject to the AGPL?


[flagged]


That's not theft. Was the original code deleted? No. Then it's copying. And not even that, is the model replicating the training set like Google search? No. Then it's some kind of derivative work. And especially for Github it's OK because user agreements allow MS to do it.

Is "imagining" the same with "copying"? Does copy-right cover learning-right? Can learning and practicing be restricted by the authors? Can visual styles, algorithms and facts be copyrighted? I say no to all of them.


Human's aren't even allowed to do this. It has to be done clean room - this is not learning, it's copying and has been proven so.

The only implementation that would be allowed is if they had to describe the code via a specification and Codepilot was able to generate it from scratch. It does not do that - it just reorganizes stuff it's already seen, which is a copyleft violation in humans, therefore it should be for a machine created by humans.

The code has had its license violated.


> It has to be done clean room

That's for patents not for copyrights. All you need to do is make it a little different, generate until satisfied for patent safe code. Another model can also do the patent checking.


I've heard multiple stories from people who can't even look at the Linux source before they do something in their company's kernel because the act of just looking at it compromises them legally. So which is it?


Clean room implementation only happens for very special codes, like codecs or efficient matrix multmul that took billions of trials to develop with AI. Not for 2-3 lines of code snippets that do one simple thing and are already covered online in hundreds of places.


One of the canonical examples of software reverse engineering is Phoenix Technologies producing a compatible BIOS for IBM PC. They did exactly what OP described - had one person look at (public but copyrighted) IBM source and produce an extremely detailed design document, then another team went and wrote a new BIOS from scratch following that document. The issue at hand was copyright, not patents.


I think he is using theft because currently piracy(which is basically copying files) is legally considered theft.


Larceny and copyright infringement are two distinct legal categories for two distinct kinds of crimes. One involves a taking of property and the other involves the violation of a government granted monopoly


It’s not about “learning” at all, copilot spits out copyrighted licensed code verbatim directly copied from the source to your project in violation of specific and multitudinous repositories.

They are making rips of other peoples stuff, selling the contents of peoples “books/movies/songs” sans author attribution or album credits etc… to put it in terms you may be familiar with Vinegar and salt on open wounds. Bad


Is there any nuance for scope? Software I've licensed as GPL, I'm concerned about the working software being re-used and re-licensed for something commercial. For a given method out of tens-of-thousands, it's very not-germaine to the overall software to the point that I don't see it is a really relevant (but that is my opinion). Though, if someone likes the opening sentence of the encylopedia (or some other giant work), and an AI says, "this is a good opening sentence - does that really make for "ripping" off? Isn't the covered work the larger contents of the encylopedia, rather than an arbitrarily well written opening sentence? Isn't the big part of the work the ensemble?

I'm starting to wonder about these arguments, and whether we've gone into bad faith and hyperbole territory here. Are algorithms subject to copyright? Is it the case that if a GPL work uses a well known algorithm, that GPL work cannot be used as a reference? (Given that algorithms have very limited forms they can take, using an algorithm as a reference is really just copying it. Even translating pseudo-code to code, it's still the same thing).

Can you explain to me how something like using Eulers formula to solve a math problem would not be copyright infringement? A GPL project might use that formula somewhere, but then using that would be a copyright violation?

How about HTML source code, does putting a 'copyright' notice on the webpage make it invalid to then use any of the javascript, even if it has nothing special to do with the domain of the website?

"selling the contents of peoples “books/movies/songs”

Going to this analogy, I don't know if it is really the contents, but more like the first sentence, or even the first few words of that sentence rather than any recognizable subset of that work.

Like, if I have an app that does a spreadsheet, I don't care if you take an implementation of quick sort from my code as reference, but I do care if you use the same and main features of the spreadsheet app that I made.


Whether verbatim copying infringes copyright can only be determined by a court, and only on a case by case basis.

Anyone saying "in violation" without doing a fair use test doesn't know how copyright works.


I’d claim that it’s more than a “few vocal” protestors. If the system is illegal, it needs to become legal or disappear.

If I’m writing code for a query optimizer, the SQL Server solution isn’t going to magically show up.


Is there any evidence that the PostgreSQL or MariaDB solution will though?


It’s not illegal, it’s at worst a fancy code search tool that Github has the right to show you the results via the license you grant them when you upload and make public code on Github which is way stronger than other search engines like Sourcegraph have to show public code.

It doesn’t mean you have the right to use any of the code it generates but Copilot itself isn’t illegal in any meaningful sense.


This is definitely not true. When your license requires you bundle said license with any reproductions of the code, and Copilot spits out said code sans license, they are breaking the law.


> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

I don’t think it’s accidental that this product is specifically Github Copilot.

But even then I think this is legal overkill. If you use the search box on Github they will display snippets of code from public repositories without the license. Same as what Sourcegraph does same as Copilot does. Nobody here is arguing ripgrep is violating the license by displaying matches without the corresponding license.


The violations are when that code is incorporated into your own codebase, which is happening in none of those examples. If you copy GPL'd code from GH search with a non-compatible license you are still in violation.


Yes but that’s your problem as a user of the tool. It doesn’t make Copilot itself illegal which is what the person I originally replied to was saying.

Yes if you use a tool to violate copyright it’s copyright infringement. If you prod Midjourney into outputting near exact Starry Night that’s on you too.

So far no one has made a compelling case that Copilot itself is violating copyright.


Code search may show snippets, but it's clearly not separated from the rest of its code base, including the license. It may be your problem as a user of that tool if you pirate snippets out of the search results without honoring its license, but GitHub at least didn't distribute it without its license. A human judge would surely determine that a search result page showing a snippet and linking back to the project wouldn't constitute distributing the code without attribution. Copilot is a different matter. There is no way to know where the code came from, whether it's novel or verbatim copy of someone's copyrighted work. Microsoft _is_ distributing code snippets here sans its license.


Codesearch at least provides users the ability to hunt down any licensing concerns. Unless Copilot start spitting out citations (this snippet was generated based on repositories x, y, z, here are links) the users of the tool have no way to verify if they are in any way in violation of the licenses.


The people protesting aren't a "vocal few"; We're the people who made copilot possible. We are frustrated that our work is being used to profit a massive corporation without any compensation and in a way that we at best did not intend to allow and at worst is in direct violation of the terms we set.


Nah, you're definitely the few. Its not a random sample, but an informal survey of my coworkers found no one who would care and generally positive sentiment.

The people who comment on something are disproportionately those who care a great deal.


My point was not that those critical of CoPilot are in a majority, it is that our perspective is important because our labour is what makes copilot possible.


our labour is what makes copilot possible

What proportion of its capability is derived from the labor of people who don't like it? I get your point about feeling like an unwilling contributor while github/MS harvests revenue from people who like it. But there's an implication here of being in a critical majority, which I am not convinced is the case.


Perhaps a good compromise is to make it opt-out, if it’s not already. Though even this is just pandering to the developer’s ego. AI writing code is a massive boost in giving users power and thus freedom. Of course, we need to make AI itself FOSS, but I doubt a legal case could be made for that. A more productive path is to clone the model like SD did with Dall-e.


Careful though, you are trading yours (and their) muscle memory and brainpower to be locked into a proprietary solution.

Reread your post. Doesn't it sound scary? You are blocked from even thinking and crafting because a specific web service is down.

Even if Google is down you can go direct to Stackoverflow and MDN, and have a choice of information sources.

Also what is "productivity" ... as in features built / month or lines of code / month?


Correction: even if Google is down, you won’t notice, because DDG works just fine. Last time I went to any of Google websites was 4 years ago.


I tried copilot and found it an excellent way to inject subtle bugs into my code. It always had a seemingly plausible guess, that was never correct, and coding turned into a guessing game trying to spot the bugs it had injected and hoping I’d found them all.


If the tech becomes open (and all indicators point to it being open in the near future) then it will become impossible to shut down. This has already happened with Stable Diffusion and the related model leaks.

People's expectations have already been set by this technology, and they are only going to want more. Also, AI researchers are still publishing their work out in the open for anyone to reproduce.

If there was a Copilot model out in the wild like with Stable Diffusion then this ceases to be a valid question, regardless of the model's legality. All it takes is a single leak or decision by another entity to release their own code generation model.


It's particularly useful to the polyglots dealing with multiple codebase, with various languages, to context-switch fast.

Saves a lot of "hey how do I do this simple thing again?" memory loss issues.


> It's been a massive productivity improvement to our senior devs

I would hate to work at a place where advanced-but-untrustworthy autocomplete would, at all, impact the productivity of a senior engineer.

Not only does this indicate that your senior engineers' productivity is measured poorly (lines of code), but also that your senior engineers are paid to type, rather than to think.


It doesn't indicate either of those things. It's a great tool when you're working on complex code that just reaches the limit of your working memory/attention, and has often suggested good and clean improvements for me. You may dislike it but there are people it helps.


And if the code generated by copilot was attached to a license that you had to obey? Suddenly your propriety solution must be released as open source or rewritten, because copilot is effectively laundering open source code?

Life's a lot easier when you can just copy whoever did the hard work without crediting/paying/etc for it.


My experience is that copilot is shit for anything not super basic.

Incorrect suggestions all the time


There are lots of comments arguing for or against Copilot on a value judgment, and having an opinion on it being ethical or legal, etc isn't going to be the same for everyone. But I think regardless of where you stand, there should be some sort of legal ruling to clarify the gray areas that Butterick breaks down.


Agreed, but I also hate how so much of our substantive law basically has to be created by the courts because (a) many of our legislatures, especially at the federal level, have become more and more non-functional, and (b) IMO legislatures are especially bad at implementing technical legislation.

I think there is a good, fundamental legal/societal question of how copyright should apply to AI output. I just don't think our existing copyright structures handle this question well.

Note there is currently a very important case before the SCOTUS that is related to this issue, [1] where the original photographer of a Prince photo is suing Andy Warhol's estate for copyright infringement. The fundamental question is whether the Warhol series of painting are "transformative" enough of the original photo. While there are always gray lines on what "transformative" means, if there is any chance that Warhol's painting are legal and not infringing, I don't see how Copilot could be in the wrong. Copilot's output, even if it contains a substantial amount of the original source, appears to me much more "transformative" than the Warhol paintings are compared to the original photo.

1. https://www.npr.org/2022/10/12/1127508725/prince-andy-warhol...


I agree. Any law that's only clear after a court ruling is, de facto, an ex post facto law. Disgusting.


That's how common law works, it's not disgusting, (unless perhaps you're an overzealous adherent of civil law) nor is it ex post facto. Legislation is produced, (claimed) grey areas are challenged in court, if the outcomes appear unfair then legislators (should) update the law.

Badly written law and poor legislators are a problem in any system.


What the solution?

Is it really better to only draft laws that are clear without courts?

Is that provable?



Bingo, I feel so uneasy at the thought we could risk a lawsuit because a colleague put unlicensed code in our repos.


Butterick sneakily asserts over and over that Copilot is simply retrieving code from Github ("Copilot's whizzy code-retrieval methods", "Copilot is merely a convenient alternative interface to a large corpus of open-source code", "our work is stashed in a big code library in the sky called Copilot"). This verbiage seems specifically chosen to present a misleading picture of what Copilot is and does.

Copilot is a set of trained weight values in a matrix. There is no source code stored in that matrix. The fact that someone can prompt Copilot with specifically chosen text to generate a short sequence of code that matches a corresponding segment of code used to train the model does not mean that it is somehow "just retrieving" that snippet. It is _generating_ that code, guided by the weight matrix, via pattern-matching based on the chosen textual prompt and surrounding context.

That distinction is significant because one of the primary defenses against copyright infringement in US law is if the derived work is transformative. Copilot is a work derived in part from Github code, but it has unique capabilities far beyond returning short snippets of input code, and the work itself is clearly an extensive transformation of the input data.

This is without even considering whether concrete _outputs_ of the model that happen to match code in a repository used to train it are themselves protected via copyright or not, which is another issue entirely (and not as cut and dried as many folks on here seem to think).


Correct. He's written a great opening argument, as long as you're the sort of person who likes speeches. To me it was full of tricks to prime the reader into accepting his premises as axiomatic, from nuanced rhetoric to pull quotes with attractive color gradients. In my view his actual motivation is the typical 30% cut of any class action settlement that goes to the lawyers, and he sees a lucrative opportunity to combine two skillsets.


And if that's the case, the legal ruling shouldn't stop at code. It should encompass any images or text that aren't in the public domain, don't have copyright owned by the training entity, or otherwise permit this use. Which probably throws a massive spanner into a lot of machine learning. Which may well be fine but would probably be a big setback for ML generally.

Any higher court ruling might well draw some lines between different domains, but be clear that a ruling against GitHub would almost certainly be a ruling for copyright maximization and against fair use in other respects. So be careful what you wish for.


While the moral and legal discussions here are interesting and worth exploring, I find this text hyperbolic. Its premise is that the main way that people currently interact with open-source projects is by digging into their source code, copy-pasting away a snippet of code that solves a particular problem, and then of course giving the authors the required attribution.

This is far from the truth. The main usage of most open-source projects isn't as code, but as a product. The median user of an open-source project wants to think about the project as little as possible. They want to be as unaware as possible of the code that makes up the project. They're happy to add the project to their `requirements.txt`, add a few lines to import and use it and then never think about it again.


I agree with that.

Also, if we agree that GitHub copilot enables you to be more productive as a developer. Can we argue that it could help open-source communities by helping them finish projects faster?


I came to write the same comment as cool-RR. I'm not sure I ever copied a code block from an open source project. I copied plenty of code blocks from gits, stackoverflow and blogs. Those are the media that could be starved off by a massive use of Copilot.

There could be a problem for open source projects (and closed source ones as well) if Copilot could autocomplete with code from private repositories. I can't remember if it looks at them too.


I haven’t used Copilot but do its samples give links on where it was from? If so, that seems to be a sufficient funnel back to the OSS repo itself without the community harming aspects mentioned in the article.


It doesn't, because that's not how it works.

Copilot doesn't recognize what you're trying to do and then paste a code sample from a repo it has in its index. Just like DALL·E 2 doesn't produce images that say "I picked these pixels from this image and this part from this other one and these colors from this third one". When a model is trained, it's effectively a set of hundreds of millions of numbers that when combined in just the right way can produce a specific output. In my experience the vast majority of the time Copilot doesn't write code that already exists. It actually uses the variables you declared, the functions that already exist in your code base, etc.

It's not an index of best matches from GitHub for what you're trying to do.


They don't. It would be a profoundly difficult problem to find the right links for each suggestion.


Not at all. There isn't even a way to get the "source" if you wanted.


The whole point of open-source is about re-using and modifying the source though. Sure it allows using the product, but that's hardly the defining factor of open-source.


The point is access to the source. If that access is mediated through a system that doesn't tell you where the code comes from or how it's licensed, it's failing at open source.

It's not like they don't get it. Even Microsoft will provide the source of its products under certain circumstances for this exact reason.

https://www.microsoft.com/en-us/sharedsource/


It's always interesting to see the buzz that occurs when Copilot is brought up as a topic. This place is called "HackerNews", yet routinely people forget that a "hacker" is somebody using technology to overcome novel problems. Doesn't GitHub Copilot fall into this category? Why is there such an outcry over a technology that has been in the public's hands for less than a year? I'm almost certain that the team responsible for Copilot is going to try to figure out how to avoid spitting out code verbatim, as that's obviously not a good look.

It's most likely the case that in 1, 3, 5 years, Copilot won't be spitting out code blocks verbatim. It will generate rightsize code, trained on lots of publicly available code, and start reducing the surface area required to code/develop.

Stable Diffusion doesn't get in trouble right now because the artwork looks like permutations of different works; text is easy to copyright, style is more challenging, but artists are facing up against the same reality. There's no rolling this back; ML models are going to remove a ton of cruft from creative/labor based endeavors, and people are going to need to evolve to stay relevant.


Plenty of people approve of individuals doing things they disapprove of large corporations doing.

For example, I wouldn't care if a small YouTuber used a copyright song in the background. I would care if Disney stole a small YouTuber's original song and used it in a movie.

This is entirely consistent within my ethical framework: scale and power matters.


I think it's cool that Copilot exists and it's a worthwhile scientific discovery. However Microsoft is re-selling this to companies and claiming that they can use that and owe nothing to the authors of the code, whose license terms they can ignore, and that is where the problem lies.

When a hacker finds a new way to get into systems or root their phone it's cool. When someone uses that technology to steal money or personal information or encrypt your files it's criminal.

Nothing new in this case.


Personally, all my code open source is in public domain (CC0). SO it's a free game. However taking any other code without any regard to license,or author's permissions, is unethical and undesirable.

At the very least add a comment in the generated snipped where the code originates from. That won't suffice in all cases but it's disingenuous to profit from others' work without any credit/permission.


Fedora no longer accepts the CC0 license for code. If it's public domain code, you should be using 0BSD.


Nothing is in the publics hands; they have taken the public data and given back nothing but a blackbox that you pay money for.

They don't even trust the thing to train it on their own code, yet their boss is over here telling us they are "learning". It's a damned insult.


I won't shed any tears for Microsoft if people liberate or reverse engineer the model weights.


I think main argument that it's a proprietary piece of software that piggy backs on hacker (GPL) culture for profit. If they released open source copilot it would be less of an issue and theft of GPL code would be less painful. Some people worry that this will kill the GPL and hacker culture which is already extremely difficult to protect.


That assumes that the open source community is going to be the same in the future...still generating code just to have it frozen, dehumanized and capitalized upon by Copilot:

> Mean­while, we open-source authors have to watch as our work is stashed in a big code library in the sky called Copi­lot. The user feed­back & con­tri­bu­tions we were get­ting? Soon, all gone


What do people think the future looks like where publicly available resources on the Internet (art, code, etc) aren't fair use for training ML models? Where you have to opt into models or can opt out (and many wind up doing so)?

OpenAI, Microsoft, Google, et al will STILL train such models that can do all the same things, but it will be much harder for non-industry-backed individuals to navigate the legal minefield where you must ensure you properly attribute your model outputs, only train on opt-in data, etc, etc. Surely no one really thinks that a court case against Microsoft/OpenAI (even if they lose) would stop CoPilot?

Most of these complaints seem to be extremely emotional and cherry-picked. "People's legal rights are being violated!" (you definitely don't know that, no one knows that, the article is 100% right about that), "look I prompted CoPilot for this piece of code that I already knew about and it spit it right out" (that's not how it's going to be used in practice).

It seems to me that the longer-term implications of the outcome of a lawsuit like this are far more interesting, yet almost all the comments I see are nitpicking and whining about how the world isn't the way they want it to be. I wish the conversations around generative AI could be...just better.


> "look I prompted CoPilot for this piece of code that I already knew about and it spit it right out"

https://twitter.com/docsparse/status/1581461734665367554

An english description plus three characters of a function name is enough to coax CoPilot into distributing LGPL-licensed code out of context, without a proper license. That's neither "emotional" nor "cherry-picked", it's a clear-cut license violation.


I mean, he clearly knew about that code in advance and used his prior knowledge to coax Copilot into spitting it out, yeah? Three characters can get you pretty far, that's 1 combination out of 125,580 (considering all english letters, upper and lower, along with most of the numbers and symbols on my keyboard), plus the description of a fairly complex algorithm.

Also, this code is really just executing a mathematical operation in what I would assume is fairly standard, so it may not even fall under copyright. IANAL.

Even if it is a copyright violation, that is one out of, IDK, millions, maybe billions already of Copilot completions?

That sounds cherry-picked. Just because some high profile, highly-retweeted person says something doesn't make it not cherry picked.


Of course it is cherry picked. The idea is that it allows you to INTENTIONALLY void any copyright you want.

So let's say I obtain an illegal copy of microsoft windows' source code. Under this precedent, what stops me from just (overfitting) training a neural network to produce the source code verbatim, sans any license notice?

But it doesn't end there. What stops me from making a neural network that exactly reproduces the bytes of Illegally_Ripped_Disney_Movie.mp4 that I obtain from the pirate bay? Copyright need not apply.

At what point can the neural network I've described (which is intentionally designed to violate copyright) distinguishable from a neural network like Copilot and others which violate copyright extrinsically?


> The idea is that it allows you to INTENTIONALLY void any copyright you want.

It doesn't void copyright.

Anyone that uses code that Copilot spits out which infringes on someone else's copyright is still liable. There's no requirement for intent. That may be a mitigating factor in terms of remediation, but it cannot void the copyright itself.

It can, however, produce a plague of completely ignorant copyright infringement, and since the user of Copilot has no idea where the code is coming from, there's no way to check if it was trained on infringing code.

If I used Copilot I would be real worried about the legal implications for me and that I could easily be accused of copyright infringement or plagiarism[*]. Of course people seem to not really care these days if they can cheat to get ahead so this is probably a feature, and 99.9% of user won't have their reputations ruined by using it.

[*] Although I'm personally more worried about the fact that most of the code will be wikipedia/blogs-quality and filled with bugs, edge cases and performance issues.


Large companies don't worry about this because they already have tooling that scans code and matches it against other public code out there (to avoid legal troubles if some dev quietly copies some GPL'd code or something like that). Everybody else basically has to manually check every snippet produced, which makes all the supposed convenience moot.

Between that and the typical quality of Copilot snippets, it's pretty obvious who the real beneficiary of this technology is: large sweatshops like Infosys.


Buggy code can easily come from copying Stack Overflow or even from Github open-source repos so I doubt CoPilot would contribute to it anymore than it already is.


It doesn't void anything. If you use Copilot to copy some licensed code illegally, you are the person in breach, not the tool. People using Copilot are possibly littering their code-base with future copyright liabilities, and they'd have no idea about it. Until someone writes an AI to find infringing software and automatically sue them...


Yes, that's the other outcome, which is the industry getting GPL'd to death for using this. But I don't think there is an established precedent when it comes to this.

I don't know much about AI/SL but I think the logical outcome is that using a transformer to generate source code is going to result in an over-trained system that produces sections of code verbatim because of the relative size of the space of all valid programs vs the space of all possible programs. It's not like art where you get soft failure if a single pixel or word is wrong: the inclusion, exclusion, or replacement of a single instruction or symbol is enough to introduce fatal bugs into a computer program. If the system doesn't have the capacity to understand programs generally (which it probably won't if it's just a transformer), then you're going to end up with a system that spits out samples from training data that do work.


> to copy some licensed [anything] illegally, you are the person in breach, not the tool

ahem napster, pirate bay ...


It’s fairly obvious if Copilot is regurgitating entire blocks of code that someone else wrote.

In 999 out of a 1000 cases it’s just spitting out boilerplate though.


If you told the legal team at any mid-sized or larger company "we're pretty sure only 1 in 1000 lines of code our developers write breaches someone else's copyright" there'd be some serious hell to pay.


No, no. 1 out of every 1000 lines of code has the potential to breach some form of license.

If someone is motivated to search through our entire (proprietary, private) codebase. They match it with repositories that are freely available. They’re properly motivated to make a problem out of it (some twitter randos?), and most importantly they gain some benefit out of spending hundreds of thousands of dollars engaging with our legal team.

By the time you satisfy all the conditions required for it to be an issue you are talking nation-state actors.


Yes but if you have a large team you'll be using it hundreds of times a day. I will not be surprised if Copilot indemnity insurance is a thing in M&A in a five years.


Yep - it would be useful if more people had literacy of using the tool for these conversations. I don't blame them, that shouldn't be expected or required, but there is a large gap between how bad this looks and how materially bad it is when you take into account the actual way Copilot is usually used.


A lot of M&A activity use tools like Blackduck software that does exactly this. It flags partials.


I don't think there will be a ruling like "anything from a neural net is yours", that'd be a bit ridiculous for very obvious reasons.

I'm no copyright law expert and I'm certainly not a lawyer, but it seems to me that in your examples you're setting out to violate copyright as a goal, which seems like it would be a factor in a court case.

To answer your last question, your examples are pretty clearly distinguishable from Copilot in their final states that you describe. IDK exactly *when* during overfitting that line is crossed, maybe it's crossed the moment you personally decide to knowingly publish copyrighted content and has nothing to do with the neural network itself?


This doesn't make sense, a byte identical copy of other work is obviously not transformed. So anyone claiming fair use relying heavily on transformation would fail.

But the only thing that does is make step 3 of fair use harder to clear. Not impossible.

There are fair uses of copyright that use the entire identical work as is.



> Even if it is a copyright violation, that is one out of, IDK, millions, maybe billions already of Copilot completions?

If you, only once, steal lines of code that you don't have license to do so and use them to make money, that's the same exact thing. "Trusting the algo" and saying "whoops I'm sorry" doesn't make a strong legal defense.

In a company of 1000 programmers, what are the odds that copilot increases the risk of using improperly licensed code because "well because microsoft gave it to us it has to be legit!"

And sure, stackoverflow copying is a thing, but they clearly tell you the license by which you can use said code: https://creativecommons.org/licenses/by-sa/4.0/

If copilot gives you CC-by-sa code, will it tell you so you can properly credit?


> And sure, stackoverflow copying is a thing, but they clearly tell you the license by which you can use said code: https://creativecommons.org/licenses/by-sa/4.0/

There are posts under an earlier license which was CC BY-SA 3.0.

There are people who don't have accounts anymore or haven't logged in to accept an updated license.

Only the changes to the post after the 3.0 to 4.0 in the above case are technically licensed under 4.0 (the original post is still under 3.0).

Furthermore, Stack Overflow didn't follow the proper process for updating the license.

https://meta.stackexchange.com/questions/333089/stack-exchan...

For example - https://stackoverflow.com/posts/11574647/timeline

Look at the license and the Aug 22 change and consider if that removing "Hope that helps" was a sufficient change to relicense it.


> what are the odds that copilot increases the risk of using improperly licensed code

In a company of a thousand programmers there are much easier ways to find improperly licensed code.


Not all lines of code are made equal under the law. If they were then Oracle would have a copyright on the Java API. Fortunately they do not.

So, no. You can in fact "steal" several lines of code and use them to make money and be legally clean as a whistle. It isn't that clear cut.


Oracle DOES have a copyright on the Java API. Google's use of it was found to be fair use, but the SCOTUS did rule that it was copyright infringement. https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf


The SCOTUS decision in Oracle v. Google didn't rule on API copyrightability. It merely assumed that the code in question is copyrightable, then showed how it's still fair use even if so, thus making the first question irrelevant to the decision. And this is very much intentional; they spell it all out:

"Google’s petition for certiorari poses two questions. The first asks whether Java’s API is copyrightable. It asks us to examine two of the statutory provisions just mentioned, one that permits copyrighting computer programs and the other that forbids copyrighting, e.g., “process[es],” “system[s],” and “method[s] of operation.” Google believes that the API’s declaring code and organization fall into these latter categories and are expressly excluded from copyright protection. The second question asks us to determine whether Google’s use of the API was a “fair use.” Google believes that it was.

A holding for Google on either question presented would dispense with Oracle’s copyright claims. Given the rapidly changing technological, economic, and business-related circumstances, we believe we should not answer more than is necessary to resolve the parties’ dispute. We shall assume, but purely for argument’s sake, that the entire Sun Java API falls within the definition of that which can be copyrighted. We shall ask instead whether Google’s use of part of that API was a “fair use.” Unlike the Federal Circuit, we conclude that it was."


The tragic irony here is that the understanding of copyright that those that do not like Copilot put forth would indeed make things like Java's API copyrightable and to the obvious detriment of innovation.


I don't see how this is related. The question wrt Copilot can be distilled down to "what constitutes a derived work", but there's no doubt that the original source code that Copilot was trained on is copyrightable. Conversely, with Java APIs, there was no doubt that Google's use of them produced a derived work - the question was whether the original is copyrightable and/or whether that is fair use.


> the SCOTUS did rule that it was copyright infringement.

In the US, "[T]he fair use of a copyrighted work ... is not an infringement of copyright." (17 USC 107, Oracle at 14). Whether there is a difference between a finding of no infringement or infringement with no liability is academic.

Practically, the copyright on the Java API is commercially worthless because after Oracle anyone may freely copy any and all of it and use it to compete with its creator.

Any other software vendor who thinks it has platform "lock in" because its customers built to their API should take notice. (e.g., Amazon AWS).


> That sounds cherry-picked. Just because some high profile, highly-retweeted person says something doesn't make it not cherry picked.

It's his code. This "high profile, highly-retweeted" crap is an appeal to emotion. He has a specific and legitimate interest in protecting his own intellectual property. It's not "cherry-picking" to report a crime being committed on your front lawn.


So why is he complaining about copilot and not the thousands of GitHub repositories redistributing his code with improper license? Typing the little code snippet he showed into copilot is analogous to typing it into the GH search bar and grabbing a properly-licensed result.


> It's not "cherry-picking" to report a crime being committed on your front lawn.

And as soon as he sees someone actually take the chair from his front lawn he can report it as a crime. He cannot report the people walking past because they could potentially steal his chair.

One could even argue that if he didn’t want his chair taken, maybe he should have locked it in his shed.

Of course, these chairs duplicate, so it’s not as if he loses his own chair.


Redistribution of that code, absent its license, is a violation of the license. That's already happened.


I'm done with this thread.

No, it is not.

For it to be a violation you have to lose a court case. To lose a court case a court has to find against your fair-use defense.

A fair use defense is fact specific to the parties involved. What's fair for you might not be fair for me.

Only a court can determine fair use.


Sure, duh, this discussion does not constitute a legal ruling. Nobody here is a lawyer nor a judge presiding over the case. We're potential subjects of a class action suit discussing grievances and merits of the case.


Microsoft Copilot is easily the greatest theft of intellectual property in the history of man.

You want to use my code, without ever knowing I wrote it? You want to use my hard work, regurgitated anonymously, stripped of all credit, stripped of all attribution, stripped of all identity and ancestry and citation? FUCK YOU

There's no need to defend something so obviously harmful, so why do you do it?

The law should be amended to make this kind of theft illegal.

It's not ambiguous.


It is a code laundering tool so corporations can steal open source code rights . It's the plainest thing I've ever seen. They charge for it on top of it. The deniers are out of their minds if they can't see what's coming.

This has the potential to severely damage open source - I would not host my open source project on GitHub, especially if it was copyleft. I'm sure many others wouldn't either. Some of these authors make amazing software that we might not see because of this.


Training must be opt in, not opt out.

Every artist, every creative individual, must EXPLICITLY OPT IN to having their hard work regurgitated anonymously by Copilot or Dall-E or whatever.

If you want to donate your code or your painting or your music, so it can be used ("written", "painted"), in whole or in part, by everyone else, without attribution, then go ahead and opt in.

Otherwise, you can't use the artist's or author's creative work for training.

All these code/art washing systems, that absorb and mix and regurgitate the hard work of creative people must be strictly opt in.


Are you saying the act of training the model itself is theft? Or you’re saying that using it is theft?

You can have a totally legitimate business making hacksaws and bolt cutters.

Now if your customers use these tools to break into homes and steal things, then yes, that’s illegal.

But making the hacksaws and bolt cutters is not.


How is me building a novel application consisting of manually linked libraries and source code any different than building a novel application out of what Copilot generates? The difference is that Microsoft is pretending attribution and source licenses don't apply to the code it generates, even though it would in any other context.


The difference is that you could be unwittingly taking on liability in the copilot case. So that's fun. But also in the copilot case, microsoft distributed the code to you without a license, in violation of the license.


> Even if it is a copyright violation, that is one out of, IDK, millions, maybe billions already of Copilot completions?

Are you saying that because there are millions of copyright violations, Copilot is too big to fail? Or are you’d saying that Copilot is too big to be held accountable for flagrant violations of the law?

I guarantee Copilot’s developers knew it was spitting out verbatim code. It’s too obvious, and probably would result in a perfect rating for the prompt.


I interpreted it as them saying only a tiny, tiny fraction of Copilot completions violate copyright.


Yeah, but that doesn’t matter does it? That’s like saying, “Well, your honor, most of my cars aren’t stolen vehicles.”


Yeah, degree generally matters. Airplane flights rarely kill people, so we allow them. If 50% of flights resulted in death, we would not. Google searches rarely illegally return copyrighted content, so it's allowed. The Pirate Bay searches often return copyrighted content, so regulators keep shutting it down.

Stealing a car is a large degree of crime for an individual. If they had stolen a penny, it would be a small degree of crime and we'd be more willing to let it slide.


I'm sure you see you've been downvoted. But I want to say I agree with you on the millions / billions bit.

While I understand the rub about licenses, the fact is the vast majority of code is not all that original or unique. Some fringe amount is, and those edge cases are worth discussing.

But the rest? Likely not all in all all that special. Yes, we get paid good money to do it. But is that a function of what it takes to do the work, or the demand for the skill (relative to supply of that skill)?

Frankly, I think some ppl just plain ol' fear Copilot. And either don't want ro admit it, or they have buried that fear. I'm not advocating ignoring the law / licenses. But putting a licence and lipstick on what is an everyday pig doesn't make that pig a unicorn. Does it?


This example is one of those rare pieces of code that is special though. It's the product of years of deliberate work by professor-level academics. This is exactly the kind of person who would have the least to fear from copilot if it really was just automating the boring plumbing parts and not shamelessly copying high-value, creative, insightful code.


I understand.

But that's not the type to fear Copilot. Yes, they might object to the license violation. I get that. I acknowledge that. But when you're that intelligent and that creative you don't fear being replaced - displaced? - by something like Copilot. Nah. That's a fear for the mundane and the common. That's a fear for the rest of us.


Right, so the fact that this is the person complaining implies that it's not about fear at all, and is likely a far more legitimate concern.


Or that the professor has a legitimate concern, and a lot of people in the comments here don't have any code being stolen and are just afraid of Copilot.

It can be both things. (I'm not endorsing either view, just trying to clarify.)


Fear here on HN.


Suppose you wanted to do what some code does, then you see this LGPL code. What can you do? Adjust variable names and play with line spacing and comments until it feels different?


First off, that's a library of pedagogical implementations, so I wouldn't even want to copy it -- I'd prefer a library focused on performance. Second, it's linear algebra, there are alternative implementations and libraries out there. Third, it's covered by the LGPL, so I'd be perfectly happy to link to the library. Fourth, I'd look up a pseudocode description and go from there. In no case would I sit down with another person's implementation and give it the undergrad treatment to pretend that I'm not copying.


"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...


So if it's fine to copy the pseudocode implementation of a non public use codebase, then you couldn't possibly object to recreating a codebase into a different language then, right?


> Suppose you wanted to do what some code does, then you see this LGPL code.

Suppose you want to express what some other writing does, and then you see that writing? What can you do?

(You write your ideas in your own words, and quote and cite your sources)


But I don't care about the expression of the idea. I just want the idea to work. And I don't know how to do the idea myself. And I've seen how you've done it.

If I want to describe life with a nature metaphor, and then see you do it with a waterfall, I can probably get away with using a waterfall to the same metaphorical effect in my story.

Can I do that in code?


Many open source project don't allow contributions from people that have worked with similar projects with incompatible licenses. I remember https://github.com/cisco/ChezScheme/pull/376#issuecomment-45... and https://wiki.winehq.org/Developer_FAQ#Copyright_Issues


Write it yourself


I wrote it myself and it came out looking very similar. What do I do?


So, then, there's no way to know whether someone wrote it themself or "copied" it verbatim.


There is legal basis for determining if software is copied.

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...


I would love to see how that applies to Copilot!


It's a violation but this examples doesn't jive with how I use Copilot, for whatever that is worth. Usually Copilot takes into account context and integrates the various bits into a whole that fits in with the surrounding code. The "empty file, type function signature" use case feels somewhat dubious to try to understand actual, not theoretical, harm. Though I respect the right for this person to defend their rights.


It’s a contrived example. This is not how people use copilot in the real world. Copilot is designed to be used in context. The completions it provides in context are specialized to your project.

Artificially starving copilot of context and then showing that it recites parts of its training dataset is mundane.


Here is this guy's function copy-pasted on a SO question:

https://stackoverflow.com/questions/17913191/using-typedef-i...

Found it after 5 mins and a couple tweaks to the search terms.

Another:

https://vdoc.pub/documents/direct-methods-for-sparse-linear-...

Someone copied this guy's book and put it on scribd: https://www.scribd.com/document/514019650/Direct-Methods-for...

Someone put it on a "personal" edu page:

https://people.sc.fsu.edu/~jburkardt/c_src/csparse/csparse.c

A modified version of it here marked as open-source:

https://github.com/rwl/CSparse.py/blob/master/csparse.py

More:

https://tonus.pages.math.unistra.fr/schnaps/schnaps/csparse_...

Google search used to find them:

https://www.google.com/search?q=Sparse+matrix+addition+%22ch...

Could probably find more if I looked harder.

Side note, looks like in a lot of places people do the ""proper""-ish thing and leave this guy's name on the code.


From what I've seen on the art side of things, the more a certain work has been copied in the real world (and thus in the training set multiple times), the more likely it is you're able to get a very close copy out of the model with the right prompts.

For example, I'm pretty sure this is why some models turn up a near exact version of The Girl With The Pearl Earring.


Other people doing it doesn't make it okay.


Jumping into this but I've honestly lost my train of thought, lol

That said... is that any different from someone copying and pasting into their code vs copilot doing it?

If someone randomly pastes code that has a copyright, and people use it, how are they supposed to know they shouldn't be using it?

I imagine we're talking functions here though. Not sure if Copilot would reproduce entire libraries unprompted if they're not open-source, anyone have an answer?


Nah, mostly what we've learned is that AI has become lazy enough that it goes out to Stack Overflow just like the rest of us.


[flagged]


> Edit: Downvotes mean you disagree with reality.

Alternative theory: you're overconfident and wrong, and haven't even read the tweet I shared. CoPilot is observed to reproduce entire files (less two lines), down to variable names and comments, verbatim, from a repo that's covered by the LGPL.

If verbatim copies are not covered by copyright, then software can effectively never be copyrighted. And that there would be an extreme departure from "reality."

regarding your edit: this is a verbatim copy. The process you're describing is actually strengthening, not weakening, my argument. It allows material dissimilarity in the face of an overall similarity. Whatever process is meant to be applied to the licensed code and the allegedly infringing code, it will produce identical output given identical input. You're attempting to claim that the entire file would be deleted by the process and that the remainder would be an empty comparison which would (granted) be covered by fair use. And since CoPilot has been shown to redistribute multiple files, your "case" would hinge upon the entire repository not being covered by copyright. This is beyond absurd.

edit 2:

> Or don't! Just keep downvoting in support of your fantasies! Wheeeeee!

Please review the site guidelines. You're both whining about downvotes and sneering at the community with this. Probably time to take a break from the keyboard.

edit 3:

> Care to address the content of my claims?

Probably time to take a break from the keyboard.


From your second link:

> In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program.

Copilot is alleged to be reproducing blocks of code verbatim, which fall into the expression side of the idea/expression distinction, which by your own links and statements appears to show that the allegations against Copilot are not false.

I'm not going to say you're necessarily disagreeing with reality (I'm not really sure what that's supposed to mean) but you're certainly contradicting the evidence you've brought.


This quote:

> In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program.

Seems to be trying to force everything associated with programming into a false “expressive/abstract idea” divide. Abstract ideas are distinct from expression and not subject to copyright, but not everything outside of the scope of copyright is abstract rather than detailed: notably, functional elements.


No, it's an illustration of two ends of spectrum, and only one of three parts of AFC. All I'm saying is that the original comment gave us three links and that they didn't strictly agree with what the comment said without further elaboration.

I will say, though, that the allegations of basically copying and pasting exact instances of substantive portions of copyrighted codebase appear to be at one end of the spectrum.


> If the code was covered by license in the first place

the comment you're replying to stated that it was covered by the LGPL.


Slapping an LGPL on something does not mean that the utilitarian aspects are covered by the license!

---

Yes, I know that comments are expressive. They are clearly not utilitarian.


Things like code comments are clearly expressive and not utilitarian (you don’t need comments to compile code and you can express the ideas of these comments with different verbiage without loss of efficiency).


For what it's worth I've found the links you've provided very interesting and insightful! It's bringing back bits and pieces of some of the software license training I've got at various jobs in the past.


> What do people think the future looks like where publicly available resources on the Internet (art, code, etc) aren't fair use for training ML models?

A better future to me. I don't want pictures of my face training ML models, nor do I want my art, or my code. I don't want my face to be more recognizable to AI, and I don't want my work to contribute to the consolidation of power to a few big firms. And for what, what can ML models even bring me besides surveillance? Cool art? Text-to-speech?


If they hire photographers to take photos of people in public and use them for training, there’s no law stopping them really. Your only real hope would be to always walk around in a burqa.


There are laws covering that use case. It just depends on the country. Assuming your countries laws is the law everywhere is a bit of a fallacy.


Good point


IDK how generative models can really be used for surveillance?

Certainly facial recognition models etc, can be, but those would seem to be appropriately covered by the Google Books ruling dealing with discriminitive models: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.... They've also been a thing for quite a while, so I think that cat escaped the bag a long time ago.

Remember clip art in Microsoft Word back in the day? Now there is an infinite supply of that. Stock images? Infinite supply. Solo filmmakers are going to have a much easier time creating their own films that can rival the best movie studios in the world. Any text anywhere will be read to you in any voice or voices you like, with tone and setting appropriate sound-effects. Smaller things will just be better too, noise cancellation on microphones? Easy and free. Image editing? Trivial to remove, relight, reposition, etc, etc, etc.

So many other things too. It's going to be magnificent. If you're not into then I guess to each their own, but I do think we are looking at something that can be a net good for everyone in the world, so long as it's available and cheap for everyone in the world.


Here here. I've never empathized with the Luddites more than when discussing this.


what if a service could tell you everywhere your photo was on the Internet?


Such a service would make a great and fantastic service for stalkers to bypass the usual difficulties in locating someone who has done the best to excise them from their life.


What if a service could scan surveillance videos and show everywhere you have been ever?

Who here is doing a startup to secure licensing rights to every companies surveillance camera videos to make the AI/Surveillance version of Equifax Worknumber? Maybe you offer to give them the surveillance system for free in return for the rights?


You mean google image search?


can’t identify your own face with GIS


>OpenAI, Microsoft, Google, et al will STILL train such models that can do all the same things, but it will be much harder for non-industry-backed individuals to navigate the legal minefield where you must ensure you properly attribute your model outputs, only train on opt-in data, etc, etc. Surely no one really thinks that a court case against Microsoft/OpenAI (even if they lose) would stop CoPilot?

I don't fucking care, I'm not in the business of competing with OpenAI or whatever. If you want to launch and AI startup but you can't that's your fucking problem, not mine. I just don't want them violating the licenses of the open-source programs I have created.

> "look I prompted CoPilot for this piece of code that I already knew about and it spit it right out" (that's not how it's going to be used in practice).

It proves that copilot has the capacity to copy existing code without fulfilling the requirements of the license. I don't care if it's "cherrypicked", this shouldn't happen under any circumstances.

> I wish the conversations around generative AI could be...just better.

I wish that these people making all these complicated language-comprehension machine-learning systems could read the fucking license statement at the top of the file and copy that license statement along with the code. this ought to be a solvable problem. I'm pretty sure i could write a bash script that does it if M$ is looking to hire.


The removal of the license where the code was learned is the real tidbit that everyone needs to focus on. This is where the laundering comments come from.

The product would be useless if it prompted you with license approvals. They didn't care and removed them. They consciously decided to prioritize their paid-for product over the rights of their users. I'm amazed that MS's lawyers allowed it out the door. That's the even scarier part.


> What do people think the future looks like where publicly available resources on the Internet (art, code, etc) aren't fair use for training ML models? Where you have to opt into models or can opt out (and many wind up doing so)?

A future where technology is developed in accordance with longstanding law? Also, maybe a future where my copyrighted works are compensated for when they're being used to automate my job away? If the music industry can deal with royalties, maybe software can, too?

> OpenAI, Microsoft, Google, et al will STILL train such models that can do all the same things, but it will be much harder for non-industry-backed individuals to navigate the legal minefield where you must ensure you properly attribute your model outputs, only train on opt-in data, etc, etc. Surely no one really thinks that a court case against Microsoft/OpenAI (even if they lose) would stop CoPilot?

I'd expect injunctions against Microsoft/OpenAI from further training CoPilot with inappropriately-licensed code. I'd expect damages for all of the instances of copyrighted material that CoPilot regurgitates.

> Most of these complaints seem to be extremely emotional and cherry-picked. "People's legal rights are being violated!" (you definitely don't know that, no one knows that, the article is 100% right about that), "look I prompted CoPilot for this piece of code that I already knew about and it spit it right out" (that's not how it's going to be used in practice).

How are these emotional? They're opinions, like all legal claims, supported by facts. It doesn't really matter if they're "cherry-picked" or not, only whether CoPilot actually violated copyright. If it gives seemingly novel results 999 times out of 1000, but in the other case verbatim generates copyrighted material without proper permission, then that is a copyright violation. At scale, that's a copyright violation with some modest damages even.


> If the music industry can deal with royalties, maybe software can, too?

Maybe this is not the industry to emulate: https://www.theatlantic.com/business/archive/2011/11/how-mus...


My 2 cents,

1. I think if you don't want your code re-used in CoPilot you should have that right

2. I think if CoPilot gets smart enough that it can read your open source code and then reproduce the algorithms without copying your code that should be fair use. It's the same thing a human would do. AFAIK CoPilot can not do that but I can certainly imagine it's not too many years away from that.

3. I think I would opt into sharing all my open source code mostly unrestricted with services like CoPilot. I think the group of people that choose to share their code with AI will do better over all than those that lock their code behind licenses.

Note that I'm referring to snippets of code. I don't know what a good definition of snippet is. In other words, if AI helps me write chunks of 10-100 lines at a time I don't see a problem. Effectively, S.O. answer level of snippets. Whereas, if I tell AI "create LibreOffice" and it clones the millions of lines of code, I think that is a problem. I don't know where the cut off is.


> 1. I think if you don't want your code re-used in CoPilot you should have that right

People already have that right - all you have to do is not host your code on GitHub.


Someone could still take your code if it's hosted elsewhere and put it up on GitHub, at which point it gets sucked into the blackbox that is Copilot


No, the only way is to make your code closed source.


What you describe, is what co-pilot already does most of the time.

The examples where people can show it reproducing snippits of code are more the exception than the rule. And they are usually done by people who are trying to manuliption into proving that it can reproduce copyrighted code.

Some people tend to think of it as a search engine. Looking though it's database for relevant snippets for the current situation and regurgitating them unmodified.

But that's really not what it's doing. It's more like the AI autocomplete on your phone, but for code.

It might not be able to understand the algorithms. But it seems to be able to adapt simple algorithms that it's seen multiple times in it's training data to match the surrounding code (in style, naming conventions, and actually using the variable/functions you already have).

I don't have number, but from my experience, I would say it generates uniqu(ish) non-copyrighted code at least 95% of the time.

The only question is what to do about the other times when it does occasionally output potentially copyright infringing code, either by accident, or when it's forced.


Any law where the penalty is a fine only exists for the poor. Any regulation where the penalty is in the millions only exists for small businesses.


I guess that's true. If you consider laws to be strictly transactional then you can totally do the crime if you're willing to do the time.

I'm just not convinced by the idea that any penalty less than death isn't a penalty.


Lots of distance between small monetary fines and a death penalty. I would settle for executives and board members going to jail when they do crimes.


Any penalty less than the profit, isn't an effective penalty, and won't act as a deterrent.

The Securities and Exchange Commission (USA) has a history of giving million-dollar fines for crimes that produced billions in profit and/or took billions away from victims. And the lack of deterrence has been reflected in the actions of the US financial industry.


Penalties that scale off the offender's income/revenues work a lot better. They're common in some countries.


Its not popular to say it, but I agree to some extent here too. We may need a wholesale reimagining of copyright/patents in many places to accept the new reality of both of building the tools (data to train) and in accepting the occasional bad output (copyright/patented function appears in output). I think watching the law evolve with the tech is going to have a lot of ups and downs.


I think you hit the nail on the head. Our laws and rules were created for a cultural context that is quickly becoming outdated. I feel there are many valid criticisms of AI today, but demonizing the technology doesn't allow for fruitful discussions. We need to evolve our thinking and we need to be open minded to do so first.


What makes the rules outdated? The fact that you want to get away with what they were designed to prevent?


Yes, I like doing things that people prevent me from doing.


If they continue that path, the future will be that OpenAI, Microsoft, Google etc. will pay larger and larger fines at least in the EU, until they are blocked entirely.


Which may be entirely justifiable

Earlier HN thread today on a large chunk of OS code pasted almost verbatim by the CoPilot engine into a project, but stripped of any licensing references.

Within the last few days, another thread on artists who have spent decades developing a unique and valuable style are making parallel complaints about Dall-E/SD/etc., where inputting "Xyz in the style of [Artist]" produces exactly a copy of [Artist]'s unique style, barely distinguishable from the original.

These engines are fairly literally giant collage engines, able to parse language inputs and output a collage of the input works. Maybe some are small snippets so it could be fair use, but they are also evidently capable of outputs of a far larger scope, amounting to wholesale ripoff.

Opting out or not posting on Github or whatever prevents nothing, as stuff is posted everywhere by many, and with code, it's totally legit posting a fork under OS licensing.

Is there a solution analogous to a <NoRobots> flag? How do we verify it? Will there soon be HaveIBeenUsedAsTraining adversarial systems to probe these output engines?

Not sure of the solution, but this seems to rather rapidly overstepping boundaries of creators.


Huh, I wonder if they decided to remove the licenses and other comments from code before training on it. That would almost be necessary to avoid comments ending up inside of code.


And the EU will continue to fall farther and farther behind in software development.


Yeah, regulation is definitely the only reason Europe is behind. /s


What are some other reasons?


If that’s what it takes to uphold EU citizens’ legal and moral rights, so be it. People said the GDPR would hinder business too.


While continuing to represent individual rights? That sounds like a good compromise to me.


> "People's legal rights are being violated!" (you definitely don't know that, no one knows that, the article is 100% right about that)

I'm not sure what the argument being made here is. If you make something opaque enough that no one can tell if it's violating legal rights, no one is allowed to say anything about it? This seems uncomfortably close to "it's only a crime if you get caught"


> look I prompted CoPilot for this piece of code that I already knew about and it spit it right out" (that's not how it's going to be used in practice).

So what? I’m not being snarky: does that actually make any difference, legally?


Well, what does the future where those materials are free use look like?

You argue that desiring ownership of works that you created is “extremely emotional and cherry-picked”, but you do not provide a compelling argument why artists, photographers, and indeed programmers should be excluded from the conversation when it is their art, photography and code that is being appropriated in the first place.


I think that future has a lot of really cool and cheap tools that will help all those artists, photographers, programmers, etc get even more out of what they love doing. I think there will wind up being some job markets that shrink (not with the tech we have now though) for small-medium things in all of these fields (think logo design, stock photos, client libraries, simple out-of-the-box applications) and my hope is that those job markets that shrink cause others to grow due to the increased levels of productivity that these tools will give us.

Ultimately my argument is that these tools will allow human beings to accomplish more things with less and that these tools should be distributed to as many people as possible for as little cost as possible. Part of that belief comes from the fact that I think these tools are coming no matter what and I'm slightly concerned about the potential (although unlikely-looking) future where a small number of large corporations are the only ones controlling these tools and they just rent-seek on them.


The job market for cheap rehashed garbage will grow and quality productions will suffer.

For programmers, the job market for loud-mouthed posers and plagiarizers will grow and quality will suffer. But those programmers will be fluent in marketing speak.


In a world where the cost of producing cheap rehashed garbage approaches 0 why would the job market for cheap rehashed garbage grow at the expense of the expensive unique gems market?

There will definitely be more cheap rehashed garbage online and we will be forced to invent tools to wade through it. I actually look at that as a bright side because there's already a lot of cheap rehashed garbage, we just don't have good tools for wading through it yet because it hasn't become completely intolerable yet.


It's a bit of a bait and switch (obviously not literally, these things didn't exist so nobody was ever promised they wouldn't be used).

But in terms of user behavior, it's rather the same. I used to make more stuff publicly available online than I do now, and the mass-scale surveillance and data modeling that big companies do off of publicly available stuff is a big part of that.

Generally that's how you get walled gardens - by abusing the commons - but here you'd need not just a walled garden but a TINY TINY invitation only one if you don't want people doing mass surveillance and data modeling (CoPilot is really more of the latter than the former, but any of this "scrape the whole internet" stuff is just a tiny little sidestep away from being used for more blatantly evil surveillance purposes - here we're training a generative model, they're we're de-annonymizing everything you've written anywhere...).

Is there a good solution to "BigCos are gonna do whatever they want with the shit you make" other than invite-only, paid-content type models?


I'm expecting to see dual-licensing used as precedent here.

Github/OpenAI should have to pay a licensing fee to use GPL and similarly-licensed code in their closed-source derivative IP (CoPilot).


I think I agree with this comment[1] from the other thread; never previously thought that a process being transformative means input and output datatypes do not coincide, but maybe that is it.

1: https://news.ycombinator.com/item?id=33240681


That's an interesting take, the whole level of indirection thing between Microsoft-OpenAI and StabilityAI and that research group is certainly another dimension to this that sort of muddies the waters.


very interesting indeed


I foresee licenses that contain 'upon training an AI network with this code, you give us an irrevocable license to your source code and IP' clauses.

Assuming they will even get off the ground with their copilot system not suggesting vulnerabilities and license traps already.


I personally expect the law to end up with a "safe harbor"-like situation. Consider YouTube. Occasional copyrighted content does not make YouTube illegal, or able to be held liable for slip-ups. See the DMCA as well, which requires takedowns upon notice, but otherwise provides near total indemnity for user-generated content.

Because of this, if the law looks at GitHub Copilot, I would expect that they would find Copilot to be A-OK despite the occasional regurgitation of copyrighted material that isn't fair-use, as long as it is removed upon request.


There are plenty of regulations that only apply to companies with more than X employees, etc. What makes you think any way to improve the law would necessarily harm individuals?


> that's not how it's going to be used in practice

If I'm writing some code and want the suggestion to be good then why wouldn't I use the name of a top programmer as a prompt?


I don't think that having emotional discussions around technology that does no less than provoke homan emotions on command is low quality.


It creates a body of knowledge, everyone can use and can't be sued for since it would be the industry standard way to do things.


> It creates a body of knowledge, everyone can use and can't be sued for since it would be the industry standard way to do things.

That's just not true. If the "industry standard way to do things" is to violate other peoples' copyright, then everyone doing that absolutely can be sued.

And while it's not clear if using these AI tools constitutes copyright infringement, it looks to me like there's at least a very strong case that could be made.

And at up to $10,000 per copy (register your code with the copyright office if you care about this issue!), that starts to add up very quickly. Even for a company like Microsoft.


You could still train your AI on, for example, Wikimedia Commons and simply add the required license to the output of your model.


There does seem to be more heat and noise then substantive discussions here.

Though that doesn't justify such a dismissive attitude towards ordinary HN commenters. As the way it's written implies that most are too stupid and overly emotional, which is more likely to fuel complaints instead of dousing them.


You should read the article before commenting.

The WAY it was done with copilot is the problem: no attribution, just shoving all legal liability off on the end “programmer” without providing the attribution required TO COMPLY WITH LICENSES as the diligent programmer tries to clear all the code copilot handed it without meta data.

Go read the article before arguing further, please. Otherwise you are wasting all of our time.


I did read the article first. Start-to-finish. As others have pointed out, it's a very visually appealing website.

I hope that when a case on generative models hits the courts that it's found that training on data from the Internet counts as fair use. I hope that for the reasons I laid out in my comment, because I think that if it isn't then we are all in trouble since these tools will STILL EXIST, but they will be in the hands of the few instead of the many. My main reaction is to how short-sighted it seems like the authors and many others are being about this technology in general. They seem to think they can just wish it away.

I also think that training on data from the Internet is fair-use, but I'm not a lawyer and I haven't studied the law extensively, so who cares what I think about that.


The only waste of time in this thread is people making allegations about copyright infringement without applying a fair use test.


What's with the default to "if it's not explicitly legal, it must be illegal"?

Imagine if every new piece of software your wrote had to be tested for legality because you don't know that it's explicitly legal. Oh there aren't laws for this new thing, so I guess you should challenge yourself all the way to the supreme court?

I get the author not liking Copilot, but I don't see that GitHub/Microsoft have any kind of obligation to figure this out just because they're GitHub/Microsoft.

If I as an individual had this obligation placed upon me I'd just never write any more code.

Ultimately I think, like open source, Copilot and the tools that will follow advance human progress in novel ways. Software getting easier to make is a good thing. If you don't like this particular implementation of something helpful, feel free to start an open source alternative without challenging yourself in the supreme court.


> What's with the default to "if it's not explicitly legal, it must be illegal"?

That's not how I interpret what's happening.

People who produce things have rights over their products. Be it artists, craftsmen, inventors, entrepreneurs or coders. There is a legitimate question here as to whether CoPilot has infringed upon those rights. I don't see it being about "making something illegal." I see it about answering a valid question as to whether CoPilot is liable for measurable damages caused to creators under existing laws.


A few snippets of code is not a product. If there was an open-source money-making product and someone builds a competing product using considerable help from CoPilot then that is a stronger case for damages then if someone just used some snippets of code in their own product.

But at that point, it would be just like someone cloning the Github code without following the license and in that case, it should become obvious that there is a clear violation harming the creators. But in most use cases of CoPilot, where-in people are just using it to build their own product, I doubt there is a cause for damages.


Music samples are a natural parallel.

You cannot sample music without permission no matter how short the sample may be.

Similarly you cannot steal a snippet of someone else's code without permission or the correct licensing.


Eh music sampling is a unique case because of the dual intellectual property concerns (the composition and the recording).

I can’t use a snippet from a recording no matter how short but I can use a tiny snippet of a composition. You can’t copyright a single note.


So code and mechanical music (i.e. a recording) aren't exactly the same, but code and music compositions are more similar. Can you use a tiny snippet from a composition and still infringe copyright? Yes, you can. Maybe not however short, but there will come a point.


> You cannot sample music without permission no matter how short the sample may be.

Which is a blatantly mistaken court ruling and one which I will not enforce if I am on a jury in such a trial.


I'm using the word "product" to mean "something which was produced." I searched for a few definitions because I thought they might actually be different forms of the same word. Turns out I could be wrong about that, but that was my intent. Something that you produce is a "product" of your time, effort, labour etc. Doesn't matter if it's something that you are selling or not. Doesn't matter if it's a relatively small production. The point is you produced it. It is yours.

The question is whether the courts will find damages. Everything else has nothing to do with my comment. You might have your own ideas and opinions, which is fine. So do I. Both are irrelevant. The point is that there is a legal question here that the courts alone are equipped to answer.


a few snippets of a book also isn't a product, and yet it can absolutely be infringing.


If someone bought a copy of an educational book like "Learning Go" and used some code snippets from it in their own product, that's fair use. But if someone released the book as their own product titled "Learning Go Better" then that is a clear violation.

For open-source projects, CoPilot is in the realm of fair-use for snippets but it can be mis-used just like Github can be mis-used if someone blatantly copies a repository.


Fair use only applies in certain contexts, of which writing commercial software is not one of. 'Snippets' are not fair use and it is shocking how many people here think they are.


People get snippets from Stack Overflow all the time and usually there is no concern whether it is properly attributed or not.

I'd argue that people who open-source code expect other people to learn from it in a small way of snippets and that constitutes fair-use.


Code in Stack Overflow answers is explicitly licensed under a permissive license:

https://stackoverflow.com/help/licensing

Copying code samples from a copyrighted work for use in a commercial product is not fair use.


>CoPilot is in the realm of fair-use for snippets but it can be mis-used just like Github can be mis-used if someone blatantly copies a repository.

Important difference is you don't know where your Copilot snippets come from.


If X and Y, then my point is valid!

But if either of those are not true, then the GENERAL point is true and your point is not.

---

Gais, gais, I downloaded the code using an automation tool called a browser, so it's fair use and not infringing!

yay for technicalities!


As a software developer, we deal with specific points rather than generalities. Am I doing something illegal or is it fair use? That's the important point to keep in mind and I assume most of Hacker News audience are software devs.


You also deal with the real world and real people, neither of which operate in binary.

Plenty of people over the years have attempted to skirt the law by putting a proxy in the middle and as such, plenty of clarifications have happened that it doesn't matter.

MS's stance here is specifically that it's the responsibility of the person/company using copilot to ensure the code isn't infringing, it is NOT their stance that the code itself is not infringing.

Their stance is that using it as TRAINING DATA is fair use, so they themselves hold no liability, only their users.

---

It's similar to chicken factories claiming they hold no liability for employing illegal immigrants because those illegal immigrants are the ones who chose to work there. And that they also hold no liability if they go through a 3rd party that exclusively hires illegal immigrants. The law very clearly refutes both stances.


Its better to know, even if you like Copilot and want it to continue.

> I get the author not liking Copilot, but I don't see that GitHub/Microsoft have any kind of obligation to figure this out just because they're GitHub/Microsoft.

Because its a trillion dollar company with an infinite amount of lawyers and legal resources?


> What's with the default to "if it's not explicitly legal, it must be illegal"?

Authors have explicitly and deliberately made it illegal for a person (or corporation) to do what Copilot is doing. Doing it through the legal non-entity of an AI changes absolutely nothing; it's still illegal. To say otherwise is to say that "AI-washing" can be used to nullify any law, which is of course totally absurd. The assumption you lead with is not what anyone is actually trying to argue.


Code on GitHub is "all rights reserved" by default, which means that it is illegal to copy by default. Only by adding a license does it get opened up and made legal to copy - though typically, only if you also include that license when you do so.

So if you trained yourself to only regurgitate github code with wanton abandon and careless disregard for licensing, then yeah, you're liable to violate that default copyright, and certainly going to be violating license rules if you're regurgitating large blocks from memory but not their accompanying licenses.

This is the system that github and Microsoft participate in and willingly and purposely perpetuate. They benefit immensely from copyright law and protection of their code. You can get that they will damned-well avoid letting copilot anywhere near Windows' source, and they would very much enforce their copyright if copilot was spitting that code out for the masses to use.


The licenses in question in this issue make it explicitly illegal for Copilot to reproduce their code.


I don't care what your license is, I'm going to use it and I'm going to claim fair use.

What's explicitly illegal about this?


> What's explicitly illegal about this?

The fact that the work is copyrighted, rights withheld in the absence of a license and limited with one. Also, you should know that willful infringement can carry 5x the statutory damages compared to accidental, and spurious claims of fair use would be distinctly unhelpful to your case. Just by posting that comment, you have probably compromised your position in any future copyright case you might be involved in, or you might even have invited one. I really recommend being more careful when anything legal is involved.


> What's with the default to "if it's not explicitly legal, it must be illegal"?

That's what "all rights reserved" means.


This struck me about the article, it's a really bad argument to point out Microsoft didn't point out the law that explicitly made copilot legal. Something not being explicitly legal doesn't make it illegal, although I'm not sure how the courts will rule on copilot.


If Copilot itself is infringing then so is GPT-3, DALL-E 2, NovelAI, and Stable Diffusion. There's no legal argument that would solely target one application of this technology, and you can't build generative AI using current ML tools without relying on a very large corpus of public data. All AI is built on free-riding[0].

While there is no US case law that explicitly says "training AI is fair use", the Second Circuit says that scanning books to make a search engine for them is. And the absolute worst interpretation of AI is that it's just a very well-compressed search engine index for its training set data[1]. So I'm not entirely sure if we can even thread the needle to only ban Copilot or AI training as a whole without also creating harmful precedent for search engines. Actual judges may try, I'm not sure if they'll succeed.

Internationally, the EU already legalized training AI on copyrighted works[2]. So if we do win against Copilot in court, all we've really done is shift AI research over to the EU where laws are already more favorable.

I fully agree that Microsoft is shoving too much liability onto their users, though. And this, again, also applies to all generative AI. My personal opinion with generative AI is that it's a nice curio, but not anywhere close to "production-ready", and Microsoft and OpenAI are trying to sell us on a lie that it's better than it really is.

[0] This also implies that all y'all playing around with image generators are just as much of a freeloader as Microsoft is.

[1] This viewpoint is also called "compressionism".

[2] This was part of the most recent EU Copyright Directive update - the one that added a de facto upload filtering requirement. It also added a copyright exception for museums and historical preservation.


> And the absolute worst interpretation of AI is that it's just a very well-compressed search engine index for its training set data[1].

I don't think this parallel makes sense because a search engine links to copyrighted works, each of which is still governed by its original copyright, while these AI create derivative works or reproduce the original works without even attribution.

Indeed, if an AI was just and index for the training set there would be less of a problem because the origin of a work could be found and its license honored.


> If Copilot itself is infringing then so is GPT-3, DALL-E 2, NovelAI, and Stable Diffusion.

I genuinely think that all of them are. But that's not why I'm against them.

We've seen the effects that text and image generators have on the bottom segment of content generation (SEO pages). As the technology matures, it'll displace more and more, in both arts and engineering.


Personally, I draw the line at corporations profiting from the derived works. But if companies want charge for tools that make these models easier to interact with, then that seems pretty reasonable.


>If Copilot itself is infringing then so is GPT-3, DALL-E 2, NovelAI, and Stable Diffusion.

Not necessarily. Copilot is a special case because it is using licensed code and the model is a derived function. There is an interpretation where it needs to be open sourced.


Are the examples of stable diffusion exactly reproducing images from its training set?


My view is the copilot is not stealing open source code. It is learning from it just as a human reader would. People's disguste is based on the assimilation of what they thought was a human trait being machine derived from their work.

The copilot service backed by an army of actual humans wouldn’t be a story at all. Nor would anyone be angry, if an individual offered coding skills as a service, and had gone through the exercise of learning great amount to open source software to do so.

No open source license was written with this in mind. Because previously learning was something only humans could do and no one had issue with sharing that knowledge. Until licenses take machine learning use into account I see no problems with Copilot.

Source cannot be open if you restrict any viewing of it.


You aren't allowed to just read code and regurgitate it in order to claim it as your own. That is, just because you memorized this great new novel you read, it doesn't mean you can go and sit down and hammer it out and sell new copies. People go to great lengths to do this sort of things (see: clean room reverse engineering [1]) in order to try and wash themselves of liability.

[1] https://en.wikipedia.org/wiki/Clean_room_design


If the code was purely utilitarian in nature, such as something that was optimized for execution time, there is plenty of precedent stating that the code in question is not covered by copyright.

Do an internet search for “copyright utilitarian” and read up on it if you don’t believe me!

Copyright is about protecting artistic expression which is held in contrast to the useful nature of a work.


Note: In the US, this concept is explicitly in the Copyright Act:

"In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work." (17 USC 102(b) [0]).

See also the "Useful Articles" doctrine. [1]

[0] https://www.law.cornell.edu/uscode/text/17/102

[1] https://en.wikipedia.org/wiki/Copyright_law_of_the_United_St...


If you think most people pay any attention to licenses or respect them you better think again. Snippets get copied verbatim with no regard to their source all the time. Licenses have no power and are routinely ignored.


The point is not if the law is actually upheld or not. Its if it is legal or not.


Mmm maybe rephrase that as “depending upon which entity’s copyright was violated”

Surely I don’t need to recite the last 50 years of tech legal precedent and case history for you to see that such a blanket generalization cannot be left unaddressed.

Litigants litigate


> It is learning from it just as a human reader would

I don't see how that invalidates the copyright/license argument. So, instead of just a straight up license violation it's a license violation via plagiarism.

That argument wouldn't hold up even if it was a human that caused the violation. You can't just paraphrase someones licensed work and then lie about looking at and pretend you made it yourself, which is basically what seems to happen with co-pilot, as it doesn't also automatically reproduce the license of the code it reproduces.


> You can't just paraphrase someones licensed work

Yes you can. That's exactly why you paraphrased it instead of copying verbatim.

At the fringes, your transformation may not be enough to overcome the requirements, but that's an exception. Nearly all paraphrasing is legal by default.


I like you


It learns the same way a human does by learning patterns. It is not illegal to comprehend how to accomplish tasks by reading other people's source code.

The arguments against my point always assume perfect memory of everything this model is consumed. This is the plagiarism position. In reality, some patterns are more common than others and generate a code that looks nearly identical. I can’t speak for the reasons for this, as I’m not familiar with all of the methods. However, I don’t assume that is the current working state or intent of Codex.


> It learns the same way a human does by learning patterns. It is not illegal to comprehend how to accomplish tasks by reading other people's source code.

It remains to be seen whether ML is true "learning" in the sense of developing a skill the way a human does over time.

It is however irrelevant to the manner in which this model operates today.


> People's disguste is based on the assimilation of what they thought was a human trait being machine derived from their work.

No, people's disgust is with Microsoft violating their legal privileges.

> The copilot service backed by an army of actual humans wouldn’t be a story at all.

Correct, it would be an open-and-shut lawsuit.


It isn't really learning, if it's just regurgitating whole function bodies. I use Copilot a lot, and definitely see whole functions being spit out, that were presumably written by a person somewhere.


I also use Copilot a lot, and while it does suggest large function bodies, I'm not sure that it's "regurgitating" them (though it could be...I don't know). I suspect that it's seen so many function bodies that are similar that it generates another similar output. Like autocomplete in a word processor has seen so many similar chunks of text that it reproduces them based on past experience. I don't know this as a fact, of course. I'm just reacting to the word "regurgitating."


It regenerates the comments from the Quake 2 fast inverse square root function.


Just did a quick github.com search on that function (with comments) and found around 131 matches. Many without a license. So yes, I believe that it would produce those comments...because it's seen humans repurpose and reuse that code without attribution or license many times.

Definitely an issue...but not as simple as copy and paste.


The fact you don't know is the problem.

I can't use co-pilot because if I am stealing someone else's copyrighted code I'm in trouble from a legal standpoint.


I definitely agree that we should find out. I've used Copilot almost since its inception, and I've seen nothing like a large copied/pasted function. If anything, it's mostly a single/double line autocomplete based on what I would have written anyway.


The Luddite reaction to copilot is very hilarious to me. It seem to be a great way to identify low-talent coders, because who else would possibly feel so threatened by an AI?… Watching HN commenters suddenly become ardent defenders of copyright is quite the sight.


Caring about licenses, fair use, and copyright has been deeply ingrained in the open source and hacker community for literally decades.


Caring about defending peoples right to fair use has been. Which is the exact opposite of the Luddite reaction to copilot.


I see your statement as an inversion of consensus reality. What actual coder would use copilot? A beginner or dabbler.

I predict your attempt at tactically “managing” this copilot scandal will not play well on HN to experienced coders, your Microsoft colleagues chiming in next claiming it boosts their productivity notwithstanding.

Yes, I do indeed suggest astroturfing afoot.


I tried it out, but I don’t use it at all on a day-to-day basis. I have no idea if it boosts productivity or not. I also have no idea how you came up with the claim that it’s only for beginners, I’m presuming you just made this up? All of the people that I’ve noticed talking publicly about their experiences using it are highly experienced engineers.

I just think it’s hilarious how fair use is so widely supported on HN when it comes to music, or videos, or interface names, but all of a sudden is a moral crisis when it appears to threaten the value of HN member’s labor.


Good bye and good riddance. Even just the idea that GitHub should be allowed to train their proprietary AI on other people's work is insane. Much less distribute that AI in a paid package which lets you spit out other people's code verbatim. Anyone who supports open-source and the (ab)use of copyright law to create free works should be vehemently opposed to Copilot.


The fact that it's the first major development to be started at GitHub after their acquisition by Microsoft is such a hit too. Way to spend their social capital. I can't imagine the money they made from Copilot subscriptions was worth it since companies have certainly stayed away from this...


> Even just the idea that GitHub should be allowed to train their proprietary AI on other people's work is insane.

You explicitly agree to this when you upload code to GitHub.

FOSS folks shouldn’t have sold their soul to the proprietary devil but they did and now they have to deal with it.


You meant to say "implicitly"? Even then, no, the terms are much more specific.


How can you explicitly agree when anyone can upload your code to github?


I don't understand why GitHub decided to run the project this way. This is a great idea but they messed the whole thing up. They could have make it opt-in from the very beginning and ask people to waive their rights, and I'm sure lots of people and lots of big projects would still be interested in joining the initiative. They could reward participants with, say, 3 year of Copilot access after it is officially launched, and people would love that. But instead they just take code without asking or attribution and keep pushing it, and now we are in this situation.


Everything else aside, the design on this site is among the best I've ever seen. Amazing typography, great to read on a phone.


He wrote the book on it. https://practicaltypography.com/


I think it's very hard to skim for some reason.


With this site you see about 50-100 words on a large mobile screen. On HN you see 2-4x that.

Also, the section breaks and headers and boxes lack obvious rhyme or reason. It scans a tiny bit like a classy version of Time Cube. You keep getting hit with different font sizes and font styles and lines and ribbons and colors and you're not quite sure why.


For me I gave up on reading it almost immediately despite being interested in the topic because the damn typography was too exhausting.


Eh, I think the line height is too cramped and the hyphens make reading on the internet harder to read. Hyphens are good if you're trying to save ink or pages in a novel, but screen real estate is more than free on the internet.


Interesting. For me it was so slow to scroll I had to use archive.ph to read it :-/

This is on a powerful PC with a state-of-the-art graphics card.


That was my first thought as well. Perfect font sizing, clean & elegant design.


Can you elaborate on what makes it so? Changing font sizes, boldness, lines etc...


For me it was the 'magazine' style with proper breaks and editing, formatted for a vertical screen but still reads great on my laptop. The effort in the content, links and emphasis make it feel like journalism I would normally get paywalled on.


Although I'm aware that this tool is a boon to many, particularly those with impediments like RSI, I still have to echo what a number of other comments say: There really is a very large proportion of adult software developers in the market who are simply too young to have lived through the EEE Microsoft era. Add on to that the proportion of old-enough Microsoft-brand "dotnetter" software developers who simply don't care as long as they get to sit comfortably within C#, Visual Studio and Azure.

After that, what are you left with? A small enough proportion of developers, and Microsoft evidently thinks so, who don't know, and/or don't care, and/or don't have the time to fight their Extend-Embrace phase of take-over of Github.

One could argue that the purchase of Github was Extend, and their involvement with OpenAI, the Codex, and the potentially illegal use of OSS (subject to the legal investigations) is Embrace.

It's my own personal view that Microsoft held-back the progress of software development by probably a decade or so with their shady commingling with academia, blatant crippling of C# .NET to sell Visual Studio, and endlessly so forth. So I am, along with many, upset to see a business like this EEE their way into OSS, something which is dear and special to so many.

In the end, and I must state in my own opinion (since there is an element of speculation here), I am just pleased that there are still people out there who are not letting Microsoft continue their old ways.


If this is MS trying to pull off EEE, what does the extinguish phase look like? That they try to make it so that any codebase the uses copilot is owned by them, and that there's no way to turn it off because all other editors or code hosting sites will exist? Plausible I suppose if they play the game for several decades and somehow no one else produces any innovation in the space.


Extinguish might look like Microsoft or their customers/partners writing proprietary replacements for open source products with the help of copilot. I don't know how likely this is, but what co-pilot provides is a plausible path for leveraging open source code to create closed-source products. Over time this allows the proprietary software industry to contribute back less code while still benefitting enormously.


That would mean that Co-pilot is, or at least in big part, a front, a false flag operation to test the legal system's tolerance of what they are doing, to determine whether they can get away with what they are doing, right?


I'm not too sure yet, but I wouldn't be surprised if we wake up one day, and just like how it went with Facebook buying oculus, we will all of a sudden require some "microsoft account" to log into Github. Then more layers, and more, until there is nothing left.

Either that, or you wake up one day to see that Microsoft have stole your open source software and Microsoft says "but muh AI".


It's a new approach compared to Amazon taking open-source code and building AWS services that kill off attempts at self-funding (dual licensing/support services) by the people who made it. I hope it's not as successful. Amazon at least abided by the letter of the licenses.

>> "I'm not too sure yet, but I wouldn't be surprised if we wake up one day, and just like how it went with Facebook buying oculus, we will all of a sudden require some "microsoft account" to log into Github."

See: Minecraft. It was already a goldmine when they bought it, but they built it into an even bigger one before forcing millions to have a foot into their ecosystem. Copilot might be their way of making everyone dependent on GitHub before "moving on" from git and offering a Community Edition of their own source control system.

It'll be easy. A lot of people hate Git.


Embrace: How nice that we don't have to worry about GitHub closing for lack of funds. Thanks, Microsoft!

Extend: Make people dependent on GitHub Copilot. Require a Microsoft account (soon).

Extinguish: Sunset git and transition to Microsoft's own source control system.


I think it's important to realize the exact implications here:

- MS absolutely has the authority to copy, use, and even train their models on your GPL-license code, because you agreed to let them do that when you signed their EULA when you decided to host your code on GitHub.

- This authority does not extend to CoPilot users, who cannot republish your GPL-licensed code without respecting the license. But remember that people have always had the ability (not authority) to copy and use open source code in violation of the license. This simply makes it embarrassingly easy for a person to do so unknowingly (although, legally, this would probably be considered negligence, not ignorance).

IANAL but I wonder if the extreme facilitation of copyright infringement here could be considered gross negligence on the part of MS, as they're almost entrapping their own customers in a minefield of copyright concerns. Can't wait to find out.

The logical next step in this arms race is for the GPL camp to build tools to automatically search for copyright infringement in large codebases. Copyright holders could set up hotlines for insiders to blow the whistle on infringement in exchange for compensation, since AFAICT all litigation precedent in the US has so far resulted in settlement.


> MS absolutely has the authority to copy, use, and even train their models on your GPL-license code, because you agreed to let them do that when you signed their EULA when you decided to host your code on GitHub.

What about GPL code which you don't own, but post to Github, Like the gcc mirror repo?


read the terms of service.

you must have the right to publish the code you put on github.com, and by publishing to github.com, you assert that you have the rights to do so. you also grant GitHub the right to show that code to others, no matter what license your code is licensed under.

why does no one read the terms of service or license agreements? these questions are answered there and this "copilot is stealing" stuff won't even make it to court.


Even if it is true that the ToS that noone reads allows GitHub to blatantly violate the license of your code that anyone could choose to upload to GitHub without the copyright holder's consent, people are going to do it anyway en masse, and that code is going to get eaten by Copilot. Even if copyright holders are constantly playing the game of reporting public repos to GitHub to remove it's not going to be enough.

>this "copilot is stealing" stuff won't even make it to court

IANAL but I am heavily skeptical of your confidence here.


> allows GitHub to blatantly violate the license of your code

it doesn't allow GitHub to violate any license; it explicitly grants GitHub an additional license on top of the license you choose for your code.

the only way to revoke this license grant to GitHub is to remove your code from github.com.

> IANAL but I am heavily skeptical of your confidence here.

ok. it's all spelled out in the terms of use. I'll ask you this, though; who do you think Terms of Usage/Service documents are intended to protect?


Please. If someone uploads a pirate copy of a movie to GitHub, GitHub doesn't get an additional license to anything, on top of whatever license the uploader chose.

Maybe GitHub is saying it's not their fault and that they were misinformed.

Those types of arguments usually don't get all that far with copyright violations.


if someone uploads a pirated movie to github, github won’t get the license if it goes to court, but they won’t get in trouble for the assumption that they had the license, either. the uploader had both pirated and committed wire fraud as gar as the github terms of service agreement is concerned.

either way, github is absolved. if a DMCA claim is filed on the movie, then that gets quarantined and removed from the list of things that they can show their users.

the user that uploads stuff to github.com attests that they have the right to upload it. by being uploaded, github can assume that it has the rights it asks of users unless and until they are told that they do not have those rights. so, github are covered until they are formally told that they are not, at which point they must restrict access to that data to themselves and others, which they regularly do.

these types of arguments do indeed work very well if github reacts promptly when they are told that they are using rights they do not have. github is not primarily used for piracy, like thepiratebay, and thus has a valid claim that they were lied to by a user. this is when the user who uploaded gets involved legally if the true copyright holder chooses to involve them.

you being mad at github, microsoft, or me doesn’t make any of us wrong.


The confidence of it not going to court is wrong. It will go to court, but the people saying it's obviously infringing are wrong. And even if it is Microsoft can just change their terms to make copilot compliant.

> Even if copyright holders are constantly playing the game of reporting public repos to GitHub to remove it's not going to be enough

There is no copyright police outside of criminal infringement.

The DMCA gives you the tools to protect your copyright, if 1000 people infringe on your copyright you need to be ready to sue 1000 people OR attack the platform under safe harbor.

If someone uploads your code to github, you need to find it and ask them to remove it.

If someone uses your code via copilot, infringingly, you need to find it and ask them to remove it.

This is the law as it currently stands.


Okay, but what gives them the right to remove the license from my code when redistributing it?


Ironically, such tools already exist, and most (all?) large software companies use them routinely on all the code that they ship.


those license tools already exist. they're not free, though, so they are mostly invisible and unknown to GPL types.


Unfortunately the tone I'm getting from many of these comments makes me feel that people see open source projects as a resource to be mined rather than as a product to be respected. A very entitled attitude ("I really don't want to lose my lovely tool")

There seems to be -- on the whole -- little respect for the spirit of the GPL and LGPL and it really is quite a change from, say, 20 years ago, when the 'free software' movement was I think more ascendant.

I think we have a generation of software developers who have only known a world where copious quantities of high quality source code has been made available to them under very liberal licenses -- which they in turn make careers and companies out of using / exploiting.

I, too, do this, and I generally open my modest projects under Apache or MIT or Mozilla style licenses. I do this because I want people to use my things, or to be able to use them as resume / portfolio material. Or because my employer at the time has helped fund construction of them.

But I also occasionally use the GPL/LGPL/AGPL, when I want to explicitly avoid corporate entities from exploiting said material without either consulting with me or in turn making their efforts free.

And in turn, I respect the value and power of the GPL for that purpose.

So many of the comments here are trivializing the value of free software and the licenses which make it possible, and acting like there's just this... natural right... to go out there and build on other people's work without recognition / compensation / contribution.

There are too many examples of CoPilot violating the spirit -- if not the actual legal letter -- of the GPL. This is unacceptable. I'm glad that someone is attempting a legal test.

Free software is not your data to mine. It is the blood sweat and tears of thousands of developers who do their work in community spirit, but under explicitly free software principles.

Putting something out under a free software copyleft-style license is not the same as saying "You can do with this what you want." It's "I made this, you can build on it, but what you made also has to be free. Or you negotiate with me."

And what I'm getting from the whole CoPilot fiasco is: GPL / free software does not belong on GitHub. And it might end up having to be put, generally, behind barriers that explicitly (technically and legally) prevent CoPilot & similar systems from getting access to it.

EDIT: I also fully expect a new version of the GPL to be published that includes clauses against this kind of datamining.


Agreed. Something else apparent from the comments is that people seem to think that some things are copyrighted, and some aren't, and that copilot would be better if it could avoid copyrighted code. Actually, essentially ALL code is copyrighted, and was so the moment it was written, and someone owns that copyright [1]. People only start noticing copyright when the terms of how that copyrighted content is licensed affects them. I think people resent reciprocal licenses like LGPL/GPL because the principle of "share and share alike" that they implement comes with real responsibilities and consequences for the user of the code, while they believe that non-reciprocal licenses (BSD) can be ignored with less serious consequences.

But the show-stopping problem is that copilot is sometimes producing code that is more than fair use of other code that, and is unable to attribute the code or identify how that code is licensed. It is copilot's (Microsoft's) fault that it auto-generates legal minefields, not the person who made an informed decision about licensing their own code.

In spite of the likely downvoting, I'll say that people should be grateful for reciprocal licenses not just because they were and are the foundation of free software (as you point out), but because they shine a light on what it means to license code, and how we are forced to revisit the difference between copyright and licensing when a reciprocal license is violated.

[1] https://en.wikipedia.org/wiki/Berne_Convention


Copyright is our current framework for rewarding creativity and encouraging innovation. It fundamentally depends on the assumption that protection of creative works is possible and feasible. That assumption is what's under assault by copilot and AI systems in general.

It's easy to forget that protection of creative works is only a means to end, not the goal or ideal state.

I believe AI systems will be able to help us build a new system that can track attribution of ideas and identify predecessor works from derivative products. This attribution could then form the foundation of a reward system. This is just one possible future.


What do you expect when people grant GitHub an extra license to their repos[0]?

0: https://docs.github.com/en/site-policy/github-terms/github-t...


This can't possibly hold up in court, can it? I have multiple repos that are mirrors of various open source projects. Some are not even Free Software! I have no right to grant such things to anyone let alone GitHub.


This particular licensing term doesn't seem relevant to Copilot.


My code is not on GitHub. If it's there, then someone copied it, and GitHub has no right to claim an extra license to that code.


>But how will you feel if Copi­lot erases your open-source com­mu­nity?

Jesus Christ, dramatic much? Are people that stumble upon a piece of code while googling how to do something, and end up copying and pasting the code from the repo, really building the open source community? Because that's essentially what it is. Whether I use copilot to generate a tedious function, or I copy it from your open source repo I'm on the same level of being a member of your open source community.

This whole thing feels like artists screaming how AI generated art is horrible, trying to figure out how to sabotage it, or how to start lawsuits - just because their value went down just a bit. Same thing with developers.


Couldn't agree more... It's very depressing that this post is popular, wouldn't want Copilot shut down over some drama queen lawyers that have no connection to the reality of software development and ALL creative fields. Creation requires influence: https://www.youtube.com/watch?v=nJPERZDfyWc&feature=emb_titl...

The entire fucking concept of intellectual property and copyright is flawed from the get go. The issue people are wrestling with beneath the surface is not copyright but the monetary system itself which incentivizes this "chisel off one another" behavior and "MINE!" behavior because otherwise how will you survive if you can't monetize your actions?, but intelligent socioeconomic alternatives exist: https://www.youtube.com/watch?v=lBIdk-fgCeQ

People are trying to solve this problem in an ass-backwards way. Either move to universal basic income or a resource-based economy and make all ideas 'free', 'copyable' and 'remixable' since it doesn't matter either way you have access to some resources (in UBI) or all resources for free (in resource based economy) and don't need to monetize anything since you have access to everything...instead people are content with making life shittier.

"We stand on the shoulders of giants" said Newton, but oh no.. this piece of paper called 'the law' knows better!


> The entire fucking concept of intellectual property and copyright is flawed from the get go.

Many people are upset because Microsoft is hiding behind copyright and lawyers to enforce it, while at the same time ignoring the concept of intellectual property when it comes to smaller players. I'd imagine that if Microsoft removed copyright on all their code and released it and Copilot as open source, there would be much less outrage.

The issue here, for me at least, isn't centered around copyright as a concept; it's about the asymmetry of the situation. Microsoft is exploiting those without any recourse in order to sell a product.

Your ideas about a new economy and no copyright are interesting, but they will not happen any time soon. In the meantime, in reality, Microsoft is making millions based on an enormous pile of community code while not offering their code back to that community.


This I can very much understand. I agree that this would've been the way to at least partially settle this. But then people will say "What about all the hard work of the employees / teams at Microsoft/OpenAI? Should they not get a return on their investment of time & money?"

In that case, something like the capped profit model OpenAI has (but with less profits) could work. They decide "Okay after we've reached this amount of money for Copilot, we'll both profit and have enough money to sustain it as a service until the next technological breakthrough makes this obsolete"

Then just make it free for everyone forever.


Totally agree. The fact is that this product assists developers (both experienced and non-experienced) with increasing productivity and aiding in generation of new code, products and services. This product is fantastic in my view. Attempts at baiting the service to output potentially copyright violating code (which is synthesised because it is copied thousands and thousands of times in public repositories) only shows how desperate those claims are.


Doesn't really explain how co-pilot is stealing your community. I've used co-pilot and it works great until you are past boilerplate than it falls apart.


Agree, the argument seems pretty threadbare. Millions of programmers use open-source software every day and incorporate it into their own projects without ever engaging the authors of the code upon which they rely.

Perhaps the author means that there’s a possibility that the programmer to whom the code was suggested won’t necessarily know its provenance and how to engage the community from whence it came. If so, that’s a stronger argument, but I don’t know that it’s the best one they can make.


It's just basically a search engine, but the context is the rest of your code.

Writing software just auto-completing from CoPilot would be like trying to write a whole novel with the auto-predictive text on your phone. You could do it, but the results would be non-nonsensical and full of semantic errors. I don't think there's any real 'provenance' at play in either case.

This guy is a literal who, who is really over estimating the value of his open source contributions over a general development tool that can reduce the cognitive load of working in some hairy code bases.

I don't think it's CoPilot that's 'erasing' his, Racket, community.


> Over time, this process will starve these com­mu­ni­ties. User atten­tion and engage­ment will be shifted into the walled gar­den of Copi­lot and away from the open-source projects them­selves

The author seems to be implying that since Copilot can reproduce the code of open source repository X in certain scenarios there'd be no reason for programmers to learn/use/engage with repository X. But this is silly. Maybe some open source repositories could be tab completed with a little prompting but people will presumably choose to add a dependency instead of tab completing the code of express or something.


It strips GPL, or any license.


It strips the (mandatory, in a lot of cases) licence text. But the licence still applies.

(Or I guess more technically, the original authors copyright still applies, and the rights granted to use the work under the license as an exception to the strict limitation under copyright - do not apply...)


It also doesn't make any sense. Copilot suggesting to me the signature of a function from some library is not actually the same as executing that library. That library still needs to be downloaded onto my computer to be executed. And who will write new features to a library if not for the people who are interested in that?


Open-source licenses have terms that apply to the source code, not "execution", so I really don't understand your point?


They don't really explain in a satisfactory way how Copilot is "stealing communities" even though they themselves complain that Microsoft hasn't provided "solid legal references".

Copilot is an AI stunt, an exploration, trying something new and exciting with very mixed and not-so-useful results.

This lawsuit, however, is just lawyers doing what they do for fun. I guess the retained Microsoft lawyers love it too. Glad to see lawyers having so much fun and profit. But we would all be better off with out so much lawyering, can't they do something more worthwhile?


I’d guess it is basically: you’re making a mural people can look at for free, but a business selling canvas reproduction of it without attribution.

If I solve a problem and I say: Sure, use my code if you want, but be sure to contribute any improvement back to the community - I wouldn’t be happy seeing a tool spewing it out everywhere. And it’s not even free. In this case, microsoft is literally making money on the back of millions of programmers. And without approval.


You are willfully ignoring thst the article absolutely DID address this point!

The license and attribution are stripped from regurgitated copied code snippets from code projects. If the people don’t known which project the code was taken from, how can they one day contribute to that codebase? If the code projects on GitHub are not getting the people who use their code at least aware of the project, that project disappears. Copilot is an interloper who doesn’t even tell you which project the code snippet was ripped off from!!


The community argument is weak but it’s just content marketing, not a thesis. The goal of the article is to generate leads for potential members of a class action lawsuit.


"It's not a problem that it steals code because it doesn't work that well right now" is not a valid argument, don't pretend that AI doesn't advance on a daily basis.

Reminds me of people who defend AI art with "you can tell it apart from real art" yeah no, at the rate we're moving you won't be able to tell at all in a year or two.


I agree with your starting point, but art actually makes the questions clearer I think - having AI art "you can't tell apart" is possibly a bad thing for human producers of art (or not, obviously lots of artists don't make their art just to get paid, but if you do...), but seems like a great thing for the human consumers of art? Copiliot/tabnine etc. could certainly give you terrible suggestions, but my experience is they often help with boring scaffolding-stuff which isn't likely to be full of bad ideas, just time savings. Could they mess you up? Oh for sure. But mostly they just seem like an autocorrect that doesn't guess wrong every other time.


It doesn't, it's just a very easy to relate to argument.

Generating snippets of code has nothing to do with a fully functional software package/product/service and an organic community around it.

One might argue that a community could be more easily formed thanks to co-pilot because it increases developer productivity and lower the effort to contribute so OS projects actually benefit from co-pilot. If this sounds far fetched, then probably the first case is also similar.


IF they attributed the code snippet to the project they ripped it off from!!

But they deliberately don’t tell you thst… it’s just a code snippet floating in space like they invented it


The license and attribution are stripped from regurgitated copied code snippets from code projects.

If the people don’t known which project the code was taken from, how can they one day contribute to that codebase?

If the code projects on GitHub are not getting the people who use their code at least aware of the project, that project disappears.

Copilot is an interloper who doesn’t even tell you which project the code snippet was ripped off from!!


Does GitHub not have the right to view and train from your content when you agree to their Terms of Service and upload your code?

People are conflating their open source license with the one they give GitHub when making a GitHub account, but they are two entirely separate and parallel licenses. The former is for other people to use your code, the latter is for GitHub to host your code.

If you don't like it, you are free to host your code on your own servers.

And anyway, as noted the other day about AI, it is often funny to see people not care about (or even enjoy) AI in other fields that they don't work in, but when it comes for their own field, they are suddenly very worried. See programmers on HN who argue for Stable Diffusion but against Copilot, and vice versa with artists on Twitter. As I commented then, it's an act of cowardice to think our own profession should be immune from AI while we enjoy the fruits of AI in other fields [0]:

> Yes, many of us will turn into cowards when automation starts to touch our work, but that would not prove this sentiment incorrect - only that we're cowards.

>> Dude. What the hell kind of anti-life philosophy are you subscribing to that calls "being unhappy about people trying to automate an entire field of human behavior" being a "coward". Geez.

>>> Because automation is generally good, but making an exemption for specific cases of automation that personally inconvenience you is rooted is cowardice/selfishness. Similar to NIMBYism.

We should want AI. That we then try to use outdated models like copyright to enforce holding back human progress is a true shame. In my view, so what if GitHub uses people's code for training data, we are all getting a better product because of that.

[0] https://news.ycombinator.com/item?id=33226515#33228948


There are quite a few projects that didn't originate on Github. Some are mirrors of projects hosted elsewhere, some accept patches through other means, some include code that predates github. If get your linux kernel patch accepted by emailing it to the responsible maintainer, it will end up on https://github.com/torvalds/linux. But you never agreed to the Github ToS, all you did was agree to publish it under the GPLv2. Linus agreed to the Github ToS, but he can't give away rights he doesn't have, so he can't be giving Github any rights to your patches that go beyond the GPL.


The current github terms of service don't seem to mention this use when they describe the license granted github.

https://docs.github.com/en/site-policy/github-terms/github-t...

4. License Grant to Us

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.


That mentions everything: Parsing the content, showing it to/sharing it with other users, using it to improve and provide the service. GitHub and all of its features are "the service".


True, but it doesn't mention doing so without the attribution that might be required by the code's licence. If full attribution of where the suggestion was derived from was included¹ there would be not issue IMO², it is this matter that creates the grey area which these discussions result from.

--

[1] the practicality³ of this is a different, though related, discussion

[2] because the user is fully informed and can take responsibility for the decision to use the suggestion or not

[3] or impossibility – given the code could be added by someone who doesn't include that attribution/licence information for the system to be able to pass on even if it were designed to


The terms of service are completely independent of the code's license. The code could say "no one but I may use this", but by using GitHub you give them rights to do everything stated in the Terms of Service.


But if the terms of service says nothing that is in contravention of your licence choice when you agree to them, then the service does something that you consider to be in contravention of your licence choice, what you have is one party unilaterally changing the agreement. Of course the exact legal meaning of the terms and any perceived change in them could and will be debated long, hard, and potentially expensively…

I'll stick to self-hosting instead of using services like GH. Keeps things a little more simple in that regard.


The license agreement is irrelevant. Literally it does not come into play here. Github is not bound by the license; they are bound by the terms of service. The code is co-licensed: once however you declare it, once to Github independently.


Sharing with license intact. If GH is sharing with the license and attribution stripped, then just punting IP vetting to pilot users, it seems to exceed their rights.


Why do people refuse to have even a pre-high-school level of of understanding of licensing? By uploading your code to Github you are granting them their own license to the code under their terms. Your LICENSE file has absolutely nothing to do with it. Your LICENSE file can say "everyone but Github" and wouldn't matter one jot because that's not the license you licensed it to them under.

And if you didn't have the rights to grant the licenses to Github? Then you are in violation of the copyright holder's rights, not Github.

The only remotely plausible, yes-I-have-graduated-fifth-grade argument against Github is that they ought to and certainly do know that huge portions of their users are in fact granting them licenses without the necessary authority. That's an interesting argument we should be having, and instead we're having this inane screaming match by people who have no clue what they're talking about while some of us are sitting here going WTF is wrong with you?


Calm down. No need for ad hominem attacks.

I am discussing the license grant GH includes in its terms. And that doesn't appear to give them a blank check to do anything they want with code those users have uploaded. Certainly not sell it piecemeal.


IANAL, but it's pretty clear that GH explicitly says they will NOT distribute the code. I'm not sure what else you'd call offering to copy a section of code.

"It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service."

Throughout the license, the Content is treated as an indivisible unit, and it specifically refers to the forking functionality. Notice that forking...forks an entire repository, licenses included, etc. You can't fork a single file, and you can't fork a region of a file. GH provides that kind of forking.

Copilot is fine-grained forking.

No significant software company is going to permit copilot to be used and potentially poison their code base in unknown ways, now that this kind of copying is in the open and is clearly a significant danger.

Somebody like Black Duck is going to make a lot of money for trial attorneys by tracing how code was created and finding the "hits". That will be joined with log data indicating who used copilot, when they used it, and exactly what copilot presented as the "hit". This entire process will be performed recursively on the "hit", together with classic source analysis, to find out where something is really from.

The bigger companies are really, really serious about not copying outside code except under really strict conditions -- these conditions mostly look like "no you may not, unless you have one of these specific situations". It's no-by-default, even when it looks like it could be a yes.


"outside of our provision of the Service"

You've ignored the important words. Copilot is part of the Service.


Once someone uses copilot they now have code fragments outside GH and stripped of attribution and license. Looks like GH is trying to say these users need to do their own IP vetting. Which seems very impractical for anyone, even the creators of GH copilot.


There's nothing about "license intact" in those clauses. GitHub is able to do whatever it wants with the data; any users of the service do have to check on licenses, as they should with any source (including copying from Stack Overflow)


Okay, so Copilot isn't illegal, it's just an engine for doing illegal things? That's... not better?


You're right, we should ban the camera and paintbrush as well because people can make illegal materials out of them too.


Cameras and paint brushes can easily make non-infringing works. Users of them can easily be trained how to avoid taking others work.

Copilot on the other hand basically defaults to infringing behavior. Users would have to go to great lengths to be sure they aren't infringing on others work.


> show it to you and other users...analyze it on our servers...share it with other users...perform it

I don't know, sounds pretty similar to training on ML programs, even if they don't explicitly say "machine learning" in the ToS.


> This license does not grant GitHub the right to sell Your Content.

This would, at a minimum, preclude charging for Copilot.

This is missing the point though. Microsoft claims their use of source code for Copilot is fair use. If they are correct about that, licenses don't matter, this EULA doesn't matter, etc. Everyone should be focusing on this claim, arguing about any other detail before that is decided is a waste of time.


If anyone asked me to define Copilot I'd refer back to this:

> parse it into a search index or otherwise analyze it on our servers; share it with other users

That is the most succinct and most accurate definition of Copilot I've ever seen.


> If you don't like it, you are free to host your code on your own servers.

From the article:

> “Dude, it’s cool. I took SFC’s advice and moved my code off GitHub.” So did I. Guess what? It doesn’t mat­ter. By claim­ing that AI train­ing is fair use, Microsoft is con­struct­ing a jus­ti­fi­ca­tion for train­ing on pub­lic code any­where on the inter­net, not just GitHub.

And:

> when it comes for their own field, they are suddenly very worried

From the article:

> First, the objec­tion here is not to AI-assisted cod­ing tools gen­er­ally, but to Microsoft’s spe­cific choices with Copi­lot. We can eas­ily imag­ine a ver­sion of Copi­lot that’s friend­lier to open-source devel­op­ers—for instance, where par­tic­i­pa­tion is vol­un­tary, or where coders are paid to con­tribute to the train­ing cor­pus. Despite its pro­fessed love for open source, Microsoft chose none of these options. Sec­ond, if you find Copi­lot valu­able, it’s largely because of the qual­ity of the under­ly­ing open-source train­ing data. As Copi­lot sucks the life from open-source projects, the prox­i­mate effect will be to make Copi­lot ever worse—a spi­ral­ing ouroboros of garbage code.


> As Copi­lot sucks the life from open-source projects, the prox­i­mate effect will be to make Copi­lot ever worse—a spi­ral­ing ouroboros of garbage code.

If I was writing this website, I would delete this sentence, because it is actually really idiotic.

1. This is an opinionated statement, except that there's nothing backing this opinion. This is fearmongering, that GitHub Copilot will get worse unless we sue Microsoft.

2. Lawsuits are not about concerns about a product's creator potentially damaging their own product. Lawyers suing Microsoft don't get a "we're protecting Microsoft from Microsoft's own bad decisions!"

The argument, completely unfounded, is that GitHub Copilot will undermine... GitHub Copilot. So, if you like GitHub Copilot, you should also be on board with suing Microsoft, so that we don't damage GitHub Copilot. What??? Good luck proving that line of argument in a court - you'd get laughed out of the room. Courts don't react well to hazy predictions about mayhem, from the suing lawyers, that have no historical facts to base them on.


Code pushed to GitHub is quite often not pushed by actual copyright holders and there no way to distinguish between it even if there was clause like this in GitHub user agreement.


How many millions of account were created before Co-pilot existed at all? It certainly wasn't in the ToS then except maybe in some extremely vague way.


ToS isn't some all-powerful thing, first of all. A lot of it is unenforceable nonsense. And I'm not sure how it really works with OSS.

For instance, what if I self-host an OSS project but someone puts a mirror on GitHub? Or just uses GH as a remote for their fork? Does that random person accepting the ToS now mean GH has carte blanche to do whatever they want with that IP?


Not all code on GitHub was uploaded by the copyright holder. The entire linux kernel is on GitHub and at least some of those copyright holders have never explicitly granted a license to GitHub beyond the GPL.


> Does GitHub not have the right to view and train from your content when you agree to their Terms of Service and upload your code? [...] If you don't like it, you are free to host your code on your own servers.

I do exactly this, and it does all of bupkis to prevent someone from downloading my code from my gitea, and uploading it to GitHub. In fact, several people have.


Maybe I'm in the minority, but I think the prospect of someone autocompleting and getting a snippet that came from me, they found it useful, and are going to incorporate it is great. It means my thoughts and logic are shaping culture in a mimetic feedback loop.


That's not the problem. If you want to license your work in a way that allows that, you are free to do so. The issue is that Microsoft did that with code that was published under licenses that either did not grant that right, or which explicitly forbade them from doing so.


Also charging for it.


"[W]e inquired pri­vately with Fried­man and other Microsoft and GitHub rep­re­sen­ta­tives in June 2021, ask­ing for solid legal ref­er­ences for GitHub’s pub­lic legal posi­tions … They pro­vided none."

Well... DUH. Why would they? You want to possibly sue them. Why in the hell would they, or anyone, provide crucial evidence for your lawsuit before you've sued them, regardless of the case and circumstances? Of course they aren't going to provide evidence, because you are obviously going to then try to prove hypocrisy, whereas you might not have enough to go on if they don't talk. No corporate lawyer in their right mind would ever grant such a request. (Edit: You are quite literally asking what their legal strategy is going to be, before the lawsuit has occurred, and then trying to spin the refusal as a proof of guilt.)

That's like claiming that an alleged drug dealer who didn't talk without a lawyer present is obviously a criminal, because if he wasn't he would have talked. What a nothing of a point.


The problems have almost nothing to do with deep learning stuff. They are on the companies who develop such products.

If a company use someone's code for a commercial product (a normal app), they do need to follow the license accordingly. If a company use someone's code for a commercial product (model training), they don't need to follow anything.

If a company use someone's art piece for a commercial product (a normal game), they do need to get consent, and pay for the right to use to the hosting platform or artists themselves if it is not royalty free. If a company use someone's art piece for a commercial product (model training), they don't need to get consent or pay for anything.

All the problems actually happen before the technical details, making the entire pipeline questionable.


I really don't care if my code gets ingested and regurgitated by Copilot, but it seems rather a stretch to imagine that this is fair use, in part because it separates me from the legal protections afforded by the licenses I released my software under. In my ideal world, Copilot would be legally viable, and releasing my software without restriction wouldn't be risky.

As a long-time open source software developer, I have favored the 2-clause BSD and MIT licenses because they are the simplest licenses that provide me some liability protection. I would release code into the public domain if that didn't increase the likelihood of being sued, whether for liability, or for someone else claiming intellectual rights to code I actually wrote.


I still release under cc0, being copied verbatim is of no concern. Yet, I don't think reproducing somebody else code is 'fair use'


All this discussion of legality is interesting to me, because I'm pretty sure that if Github ran a search in the background, found the corresponding license for the code snippet, then showed it to the user in some cookie-banner like annoyance, it would be completely legal. This is what Github already does on their website with a search bar.

Yet somehow I think most people upset about Copilot would not like that outcome.


By the way, Amazon's CodeWhisperer does that.

> in very rare cases, an independently generated code recommendation may resemble a unique code snippet in the training data. By notifying you when this happens, and providing you the repository and licensing information, CodeWhisperer makes it easier for you to decide whether to use the code in your project and make the relevant source code attributions as you see fit.


It’s interesting and pretty much uncharted territory from a legal perspective, that’s for sure. It very much relates back to the discussion about accountability of machine learning models, in that it’s desirable to be able to explain how/why some output was generated.

I don’t think a banner would be sufficient here, though; perhaps some references to the inputs that were used to generate the output, but that’s often very difficult to pinpoint.

Whatever happens, if this ends up setting some sort of legal precedent it will have a big impact on the industry, and I personally hope it leads to more accountability and transparency of the models, rather than the black boxes they are now.


How could a banner not be sufficient? Under what legal theory is Copilot + banner illegal, but code search is legal?


I’d argue that copilot’s core functionality is distributing code, which is very different from search (pointing to original code). Very different from a copyright / fair use perspective.


How do you run a search on the product of a bunch of neural network weights? Do you just mean like double-checking that the code it produced isn't a direct copy of something copyrighted?

If so I think that introduces a lot of other issues. There's an interesting phenomenon of different independent comedians suing talk shows for stealing their jokes. It almost always turned out that those jokes weren't actually stolen. Instead, there's only so many jokes you can make about a given news story and there's bound to be overlap between a whole room of comedians trying to milk every event of any comedic value and random independent comedians doing the same


You run search on the text output. Yes I mean that they would search the code. It's certainly legal to show LGPL code snippets along with the license (It may also be legal without this, IANAL).


100% correct takes in the piece, this is just ridiculous

>"Tim Davis gave numer­ous exam­ples of large chunks of his code being copied ver­ba­tim by Copi­lot, includ­ing when he prompted Copi­lot with the com­ment / sparse matrix trans­pose in the style of Tim Davis /."

Copilot regurgitates code and blatantly violates licenses, not even sure what there is to argue about. Not only does it seem straight up illegal and sideline open source communities, I think the next logical step of this is that people who want to avoid having their work vacuumed up and their rights violated simply to move to proprietary software, which would be a huge disaster for open source.


Copilot is trained on and returns AGPL code verbatim. It’s game over. If these licenses are not enforced it defeats the entire purpose.


That's a problem of the licensers, not for Microsoft or the CoPilot users.

If you released AGPL code but never intended to ever sue anyone. Why did you release it like that?

If you did and if someone is able to use your code without any damage to you, without reputation loss, and via a way they have access to the innocent infringer defense after you overcome fair use, after you sue them.

How is that game over?


So suppose you go out and about and a Microsoft representative punches you in the face. Now, the Microsoft representative has a billion dollar corporation backing him, willing to defend him at all cost through every institution, while you're just John Doe who went on a trip.

If you ever went on a hike but never intended to sue anyone. Why did you go out in the first place?

If you did and someone is able to punch you in the face without any lasting damage, without reputation loss, and via a way they have access to the myriad legal defenses you couldn't come up with if you tried, after you sued them.

How is that game over?

Just because someone corporation is, because of its sheer size, over the law (as far as a John Doe is concerned anyways), does that make it a right? We could probably do away with laws at that point and just accept getting punched in the face by Microsoft whenever they feel like it as the new reality.


You have gone off the deep end, please return to sanity.

The court system is the method of enforcement for copyright.

If you want the "right" in copyright, you have to sue people.

To sue people, you need to find infringement. That infringement must be above fair use.

However - if the infringement you find is so minor that you have no loss of revenue or reputation, a court will not award you damages, and may even dismiss the case.

Nobody has any copyright without suing people, there is no copyright police in the general case.

Microsoft has no special rights from its size. Its size makes it a target, it's not beneficial. It's why they have so much trouble with internal rules about GPL. If I infringe on your copyright, the damages will be zero or low, if Microsoft infringes your copyright, the damages could be millions - with the same burden of proof.


Microsoft is using agpl code, therefore their code is subject to it. A lack of damages doesn’t give one the right to ignore licenses. If it does, like I mentioned it defeats the entire point.


The vast majority of GPL violations are not enforced, because those who would want to enforce them are small and their opponents are big.

For Copilot to blow up, it'd need to be licensed code from a big company demonstrably turning up in a product of a competitor, or some similar event.


It might be the case that it is fair use to train the model on public data, but the code which it produces is covered by AGPL. Github limits liability in its TOS.

(I am not a lawyer).


Folks really should take their GPL code to a platform with similar ideals and stop propping up Microsoft GitHub.


A bit of a controversial opinion: to those who are defending CoPilot saying it "boosted my productivity" and would miss it if it is discontinued, maybe you are not a productive developer to begin with. I fail to see how searching the same snippets on Google or saving commonly used macros in your favorite editor would not yield the same amount of productivity. I have used CoPilot for several months and I actively stopped using it, because I was afraid I will be dependent on it, and it would actually reduce my ability to do critical code-building. I'm happy without it - sure it takes some micro seconds more to type out my code instead of autogenerating it, but I feel much more self confident in my own coding skills.

CoPilot is a great research work - it is indeed spectacular to see how pre-training can achieve such impressive code completion results. However, in my honest opinion, it should not be a tool for a serious developer.


To set people's expectations, it is likely to take a bunch of lawsuits and a bunch of cases here to get to anywhere useful. The problem with lawsuits on copyright is that they are rarely precedential. I get that what people see is the large cases that try to tackle big topics. But for every single one of those, there are probably 10x or 100x equally large case that did precisely none of that.

This is particularly true of fair use, it is very fact specific. A court is much more likely to answer a very fact specific question about copilot, tied to the very specific facts of the case (IE how is this exact thing used/etc) than more broad, abstract questions.

In fact, standard Article III courts in the US are literally not allowed to issue advisory opinions.


Oh man. I want to continue using co-pilot. It has improved my productivity and made me excited to do things that I previously felt like a chore.

Also, programmers please do not hinder on other programmers work. If you do, someone higher up in the ladder with eat your cake at every opportunity.


Yes, we’d all find our work easier if we could just steal other people’s work.


You're probably stealing other people's code daily, willingly or not. Should we run a plagiarism scanner on all your code?


Yes, I want repositories / libraries that steal code taken down, just as GitHub Copilot should be, so I don't unwillingly steal code.


I've been trained on open source code, and there are likely many algorithms that I've internalized that are very similar to the "standard" way of performing an operation.

Is there a reason why an AI being trained on the same open source code isn't a similar situation? I agree that wholesale pasting of code chunks is an issue, but that hasn't been my experience with Copilot.

I'm not arguing for Copilot here...I'm genuinely curious why this would be considered any different.


>Is there a reason why an AI being trained on the same open source code isn't a similar situation?

You are a human. You know what's right or wrong. You know you can't just copy code 1:1 from public repositories without respecting their license. The AI doesn't know and doesn't care. It's a common problem with creative AIs that they will occasionally regurgitate near 1:1 copies of their training data, and I don't think it's an easy problem to solve.

>I agree that wholesale pasting of code chunks is an issue, but that hasn't been my experience with Copilot.

The article provides several examples of it happening. Just because it hasn't regularly happened to you doesn't mean it doesn't happen.


I don't deny that there are examples that appear to be wholesale copying, and that is definitely an issue to be addressed. No doubt.

What I don't understand is why the rest of the service (where it doesn't appear to be pasting existing code) is being maligned when it behaves like a more powerful version of autocomplete.


Humans can reasonably distinguish between when they are plagiarizing and when they are just applying their experience and knowledge. Copilot presumably isn’t able to make that distinction, and, arguably, so aren’t the consumers of Copilot’s output.


Does reading software code count as "using software"? I personally don't consider myself subject to a license when I'm reading public code on GitHub. GitHub Copilot and Codex AI seem to be doing nothing more than reading a bunch of source code, not reusing that code to incorporate its functionality into a different product.


Did you read the part of the article where it states that it has been shown that Copilot copies verbatim large sections of copyrighted code, stripped of all attribution, depending on the prompt provided?


I wonder how much code like that floats around stackoverflow


So happy to learn of this and I wish them best of luck in their efforts. And I'm surprised to find so many people klinging to Copilot.

We shouldn't shed any tears for a megacorporation which shows such blatant disregard for the licensed works of people's labour.

Yes, AI is here to stay but we should be able to build AI that respects copyright. Yes, it's easier to just steal data and call it fair use. Whether or not that's stealing will be interesting to try in court.


Foolish take. If ML training is not fair use then all ML progress is dead in the water.

ML training is akin to reading or learning, and licenses do not apply to that.

You’re not thinking past “megacorp = bad”.


AI progress won't be dead in the water if it respects copyright laws. Yes, being free to just freely grab any data is infinitely easier. But having to rely on properly licensed datasets or asking users for consent should be the norm for ML development IMHO.

Also, If we had trained some A.I. on the Windows codebase and started freely using suggestions given by it I bet Microsoft would scream copyright infringement in a heartbeat.


Two things.

First, it would be nice to have a copilot variant that searched only my own work, so I wouldn't need to grep through other code I've written to get a reminder of how I solved a problem in the past.

And, speaking of the past ...

Second, I am old enough to have seen slide rules being replaced by calculators. This was a great addition to the toolbox, but it also had its downside: I've seen many students who have very clouded notions of significant digits, and many more who get quite confused with where to put a decimal point, when I ask them to compute something simple by hand.

Similarly, coding has been transformed with the advent of stack-like systems. There are two communities of coders now: those who learn a language and then can solve problems based on a solid foundation, and those who shorten the learning phase and code by web-search. The latter, it seems, are in danger of creating code that is brittle, limited, or downright wrong.

To the extent that copilot amplifies this habit of searching instead of thinking, I think it may lead to unreliable code.

So, sure, there are copyright issues. I think they have been well-discussed here and elsewhere. And courts may weigh in with new ideas. But my concern is with the reduction in code quality that may ensue. I'd love to see a discussion of the groups that are using copilot. If they are working on something I don't care about, then this is just a copyright issue. But if they are working on the "smarts" behind drug discovery, the control of dangerous machines, etc., then we have another issue, besides copyright.


For the first one, you might want to look into Tabnine https://www.tabnine.com/


>Copi­lot intro­duces what we might call a more self­ish inter­face to open-source soft­ware: just give me what I want! With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

>Mean­while, we open-source authors have to watch as our work is stashed in a big code library in the sky called Copi­lot. The user feed­back & con­tri­bu­tions we were get­ting? Soon, all gone.

I don't see how you square the above complaint with this:

> First, the objec­tion here is not to AI-assisted cod­ing tools gen­er­ally, but to Microsoft’s spe­cific choices with Copi­lot. We can eas­ily imag­ine a ver­sion of Copi­lot that’s friend­lier to open-source devel­op­ers—for instance, where par­tic­i­pa­tion is vol­un­tary, or where coders are paid to con­tribute to the train­ing cor­pus.

Is an AI that was trained on opt-in or paid-for training data any less damaging? How would these choices have alleviated the problems described above?


I do have to wonder if Copilot will last. It's going to become a legal minefield and I can't imagine for a second that Micrsoft will want to be in the crosshair for another antitrust case.


IMO, It will last. It's impossible to roll it back.

IMO, if the lawsuit goes to a point where it's likely to be won by copyright owners, Open AI could do the following:

- Use less sensitive code from big corps they have partnership with for training. I bet MS and other have plenty of such code.

- Buy training rights from copyright owners of OSS projects. Many of them have SLAs which allow the owner do much more than the license allows.

- Buy rights to train code, and collect generated code with Copilot from a large number of smaller software companies, likely with exclusions for some sensitive parts. MS has a lot of leverage here (discounts, partnerships, etc).


Yes because it turned out so badly the last time. Microsoft went from being one of the three most valuable companies in the US in 2000 to being one of the three most valuable companies in 2022.

Also back then, Microsoft had 90%+ share of the PC operating system market and was bundling IE in its operating system. I’m glad the DOJ forced MS to change its ways.


So they turned around and started buying everyone else. GitHub, Nokia, Activision. They're back to their old shit.


GitHub doesn’t have a monopoly on “hosted git repositories”.

By the time MS bought Nokia, it was already a has been in mobile and the acquisition was a total failure and the game market is competitive.


Great, I hope it is tried in court. It should be. But unfortunately I have not a big hope that the courts will come to understand the issue well enough.


Many courts, especially those in the Northern District of California (where a case would likely end up litigated), are very proficient and literate about software and copyright law. See Judge William Alsup’s cases if you want to see some examples that illustrate the court’s competence. And these judges frequently have technical consultants on staff to assist with technological issues.


Ok, well that sounds nice. I have little to no insight to how courts works in the states so I was talking about my experience of the court system at home :)


If there were no copyright problems then why didn't Microsoft train Copilot on its own source code like windows, visual studio, sql server, etc?


Brilliant point. I wonder if that can be used in any legal arguments.


If you're against Copilot as developer, you're shooting yourself in the foot.

Locking up code under non-permissive licenses stymies the pace of code development and increases the costs of progress dramatically.

We all stand on the shoulders of others before us. Including the organisations that stand to benefit the most from aggressive licensing.


Not my problem.

I put my time and my effort to open source a program for free and I want to make sure that my code creates an incentive to create more free software, by using a copyleft license.


Oh, so you're saying Copilot doesn't actually go far enough. It should not only give you code snippets but enforce particular licensing of the code it is helping you create?


If it feeds you code made available under the GPL, it should probably tell you that your code needs to comply with the GPL to use that snippet, yes?


Copy & paste isn't "standing on the shoulders of others". It is more like being an intestinal parasite.


Not trying to overly advocate for copy-pasting here, but isn't copy-pasting just the ugly child of calling a library function? If it's a blind copy-paste it's pretty much the same effect. Surely you wouldn't call using a library being an intestinal parasite?


Isn't copying a poem and releasing it under your own name (i.e, stealing it) the same as referring to it in a footnote, using proper attribution?


If Copilot was just a "copy and paste" tool very few people would find it useful and you wouldn't be here whining. So don't worry, Copilot is not just copying and pasting.


Critics of intellectual property theft are "whining". This is useful information in the next Microsoft IP lawsuit.


Yes. And I disagree there is any theft here.


I wonder if people realize that letting GitHub train Copilot on their open source contributions is effectively de-valuing your own time, which (if repeated at a larger scale) devalues your experience, and that eventually has the effect of reducing the correlation between your experience and your salary.

For example, if an overseas firm can just as easily use Copilot as I can write original code (or use Copilot myself), why would any company hire me locally?


Open source? They used everything on github with no regard for license, which would have included plenty of code under conventional copyright. Microsoft is now profiting from that code.


The example given is "sparse matrix trans­pose in the style of Tim Davis", but someone who wanted something with such specificity would be able to just take it from Github anyway, perhaps with a little more searching.


And would therefore have to follow the license of the code they took it from. That's exactly the point. Copilot is reproducing the same code but without the license.


A simple search on github reveals that those functions were reposted verbatim thousands of times, most people just copy and paste snippets of code they find useful, ignoring licenses. This highlights how all the power a license promises to hold is completely fictional. Any "in the style of Tim Davis" modifier only shows some kind of unwarranted self-importance complex on the part of the guy, thinking his style is widely known and distinctive (it's not). It's not the job of Copilot, the team that builds it, or the programmers that use it, to determine where the functions that were reposted thousands of times under all kinds of licenses originated.

This is the same case as with copyrighted photos in newspapers, a paper prints a photo somebody allowed them to use, but then it turns out that person did not have the right to use it in the first place. Did not stop newspapers from printing photos at all.

Here are the search terms: https://github.com/search?q=cs_transpose&type=Code


It does not show an unwarranted sense of self-importance on the part of Tim Davis.

Whether or not his style his widely known, his code is VERY widely used. Just look up SuiteSparse and try to find all of the downstream uses of it. It is one of the most---if not the most---ubiquitously used set of sparse linear algebra libraries. If you do anything with numerical linear algebra, there's a good chance you at least know what SuiteSparse is, and possibly also know who Tim Davis is.

The bigger issue here is the effect this has on research. Tim Davis not only programmed this library, he did the basic research leading to many of the algorithms in SuiteSparse. He went ahead and released SuiteSparse open source, probably thinking that it would be a good deal for him, provided that its use was properly attributed. Provide a public service in exchange for attribution. This is a reasonable way to get support as an academic. Clearly he has had a large number of industrial collaborations which likely have provided him with a significant amount of funding over the years.

Speaking for myself, if Microsoft has no compunction against behaving this way, I can no longer see the point in publicly releasing research code that I develop using an open source model. Microsoft is clearly telegraphing that they don't give a f** about licensing, although whether that holds if they are litigated against remains to be seen. I think there's an excellent chance many other researchers feel the same way. If you think openness and reproducibility in science is important, this is a problem.


Please look up the license in the source code that Tim Davis points too, it just mentions it's LGPL, but doesn't include the full license text. And none of the C code mentions a license or Tim Davis.

And if you dig further, the whole repo is mixed with BSD and LGPL "licensed" packages together. It's probably best that CoPilot does not suggest from code that does not have an explicit license stated.

I think originally Tim Davis was complaining about the non public sources for suggestions which Github CoPilot ignored.


I love that Matthew is investigating this and agree that Copilot warrants more scrutiny. His suggestions that Microsoft let developers opt-in to having source used for training purposes, to pay for source it uses, and to attribute or credit it appropriately all seem reasonable.

Can someone help me to imagine a reality in which these points are viable concerns?

> …how will you feel if Copi­lot erases your open-source com­mu­nity?

> …Copi­lot will become not just a sub­sti­tute for open-source code on GitHub, but open-source code every­where.

> …Copi­lot is merely a con­ve­nient alter­na­tive inter­face to a large cor­pus of open-source code.

> With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

Is the author suggesting that Copilot will be used in place of `npm install next react react-dom` or `cargo add tokio --features full` or `raco pkg install pollen` — that developers will be content to use augmented autosuggest in place of large, well-tested, well-documented open source libraries?

Does he see Copilot's final form as some kind of AI package manager that drops a library of untested unattributed undocumented files into our projects?

Or is it more that he thinks those libraries won't exist because open source contributors will grow to feel more abused than they already do, perhaps quitting the scene or developing in private, like certain artists have already done in response to the AI art movement?

There is already such a huge disparity between paid package consumers and unpaid package contributors. I haven't seen that change since Copilot launched in beta or under general availability. I see the same ratio of help/feature requests compared to code and documentation contributions that I always have. And package usage has not declined so far for the open source things I work with.

It would be nice to learn more about the “Copilot will lead to the death of open source communities” line of reasoning — what is the author's perceived timeline to open source's decline and fall as a result of Copilot's current path?


Don't confuse what you want with what the law says

"Your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device" [https://www.copyright.gov/help/faq/faq-general.html]

A copy is made whenever that text is displayed, e.g., in GitHub's UI. Even that copy is subject to copyright.

Is there an excuse/exception? In this case, there is no "fair use" exception, because exceptions have to be litigated case-by-case to be recognized, and there are no remotely similar situations. Don't forget: Lexis is a multi-billion-dollar business built on protecting the copyright to the page numbers in the otherwise public court opinions.

Does the law actually protect people if it's too costly to enforce? Not really; hence the blase attitude. Congress is considering a "small claims" system for copyright, to remedy the big-firm bias. [https://www.copyright.gov/title17/92appm.html]

In the ML era, data is the new gold. Many, many firms nowadays get a good chunk of their revenues from selling their private view of "public" data: Facebook, LinkedIn, credit reporting companies, ADP, etc. Microsoft has gone all-in on stealing that gold from open-source developers.

It's not just that the code replication reduces any need to get the code from the source. But removing any link to the source destroys the value most-commonly sought in open-source software: recognition.

Salaries are the biggest expense of tech companies. They do everything they can to increase labor competition and reduce reputational rents: outsource, cross-train, promote open-source (for competition) and destroy any reputation networks or systems that justify higher rates. And, of course, standardize on containerized copy-paste or AI-generated software if they can.

So, no: copilot is not legal, it's socially and economically destabilizing, and it presents structural challenges to developers.

It's not good, but most will keep using it because although the vast, vast majority of developers are wage laborers, they aspire to be founders. They see it can make code fast, and they'll think it make them better.


If they really don't think that they need to comply with any license, then why not include all private repos in the training set? Could it be that they're worried about legal repercussions, whereas OSS is easier to (ab)use for this purpose because there's much less legal muscle behind it?

It is also very telling that they have not included any of their own proprietary code in the training set. If it's merely suggestions that are generated, why not also train on the NT kernel? Office?


It seems like there's no good license that places absolutely no restrictions or requirements on people using your code (such as attribution and respecting patent rights) worldwide.

I want my code to be used the way people treated text in the old days. There's texts that have been re-written, added to and edited by thousands of people over the centuries and yet they don't come with thousands of pages of attribution notices because why would they?


You don't need to license it, you can just publish it with a declaration that as the author, you are releasing your work in to the public domain. However, in terms of licensing I believe MIT is the most permissive.


MIT still requires the license text to be included with the source. Copilot, if it is not fair use, violates the license of MIT code it re-emits.


Okay, so just take "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software" out of MIT and call it the Do Whatever You Want Licence. You're not obliged as an author / copyright holder to impose restrictions on people using your work if you don't want to.

Whether Copilot is breaching MIT depends on what constitutes a substantial portion, which I am not qualified to rule on.


CC0 or unlicense, probably. Although I concede that it's hard to do in all jurisdictions globally.


Just put it in the public domain.


I think that Microsoft should train copilot with their own code (they own certainly enough lines of code after all). If they think that that would not be a fair use, then why should be a fair use to use somebody else's code?


Nobody would like to use Copilot if the quality of the code it produces would be like code from Microsoft. Garbage in = garbage out.


I get the impression that many peoples' grievance with generative AI (text, code, images etc.) isn't _really_ about the data provenance. Or at least, it feels secondary, compared to the general disruptive nature of the tech.

If tomorrow someone released a StableDiffusion, CoPilot etc with the same functionality, but respecting the provenance of the data (i.e. licensing etc), what concrete difference would this make? Programmers and other creative professionals would still (reasonably) be nervous about the implications for their livelihoods and communities.

At some point it will be possible to prompt a model for music in the style of <random artist>, and having never heard <random artist>, the model will generate a convincing emulation, based purely on statistical knowledge gleaned from millions of unrelated songs and text pairs. (I give it 5 years).

Now what? <random artist> should still be concerned (or not), but at least we're talking about the correct issue: How do we co-exist with generative models that massively disrupt/alter the process of doing creative or intellectual work?


Sometimes when working on software the goal is not to get a competitive advantage, but to promote some ideas. Copyleft license is a tool that aims to help with granting, that the work derivatives are available to the public to read and modify (i.e to prevent closing the source code of the program that is commercially sold).

The concrete difference you ask about is that the work derived from copyleft code retains the license and the source code can't be closed. If you scrap the license, then the code created by someone who had clear goal in mind when writing it for not making improvements over it closed source, ends up with possibility of being closed source.


You're missing the big picture, first you create a lot of licensing violations littered throughout internal code and next they can sell you an Azure hosted open-source licensing annotation AI to fix it.


It's tragically beautiful how the copyleft crowd is putting so much effort into drastically expanding the scope of copyright.

"I used the copyright to destroy the copyright."

That sort of plot never works in practice.


> drastically expanding the scope of copyright.

I think you need to explain that more. The problem (or at least one problem) being explored here is that by using any code from co-pilot, you are responsible for making sure the licensing is correct. You could unknowingly be using and modifying GPL-licensed code in your non-GPL project, which is a violation if you don't publish your modifications.

We're not talking about expanding copyright, just protecting the existing copyright systems from being trampled by microsoft.


> Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties.

This is extremely far fetched.

User bases (let's avoid one of the four dirty C words) are organized around something which builds, executes and is documented, not searches for snippets.


I don't like that opensource code is being used in a commercial product. I feel concerned about NNs learning about stuff they aren't really "supposed to" learn, because somebody published something by mistake a long time ago. But this general argument about reproducing copyrighted code is stupid, and actively trying to shut Copilot down because of that is why lawyers are cancer.

Basically, what Copilot (or anything like that) is supposed to do is to speed up your work, i.e., ideally, to write exactly what you'd write, but orders of magnitude faster. How do you write code? Well, you may have a solution in mind — if it's something really original, rest assured, Copilot won't guess it. It can only hope to guess something that, in a sense "has a correct answer" to it. In fact, it does it worse, than it should be: graph traversals, matrix operations and such should be guessed flawlessly (in a perfect world every PL would have some primitives implementing them in the best possible way, but ours is not perfect). If you don't know how to traverse a graph, you'll go and look for a reference. 15 years ago it was likely a book, then looking up on the Wikipedia or StackOverflow became way more likely. For the last 5 or so years literally searching it on GitHub became viable because of better search engines and the sheer size of it.

Now, if I found a matrix transpose function in an open-source project, which I cannot include as a library for some (usually technical, but maybe not) reason, so I memorize it, close the page and re-type it in my IDE, do I have to be restricted by its license? Then, doing so is obviously stupid, so how about me just copy-pasting it, while renaming some variables so that the teacher wouldn't notice? And, given that this is not my homework, there's no teacher and variables are named perfectly as they are — doing that is also really stupid, so I might have just copy-pasted it. So, how about now, do I have to publish my code under GPL3 now? Is this theft? If any lawyers say yes — fuck these lawyers. It is nonsense.


> I don't like that opensource code is being used in a commercial product.

The vast majority of open source code would be almost entirely worthless (or more likely, would straight up not exist) if it couldn't be used in commercial products.

Open source software licenses were a mistake.

Agree about the rest.


First, the author’s book Beautiful Racket is very cool, recommended.

I largely disagree with this article, at least for MIT, BSD, etc. training code examples. The small autocompletions, even if they are several lines long, sort of seems like fair use to me.

I do think that CoPilot should have an option to use a smaller model just trained in code that has very liberal use licenses, because I think the use of GPL, etc. licensed code is problematic - at least for me.

For what it is worth, I have a lot of Apache 2 licensed repos on GitHub (largely examples from my books) and I am pleased if my code contributed a small bit to the CoPilot training data. I also publish my recent books under Creative Commons, allow reuse, even commercially licenses: basically anything I do that might help someone, I am all in for sharing.


Sharing is distinct from attribution. Are you okay with your code being reused without attributing it to you? If yes, then why have you published it under licenses that explicitly require such attribution?


MIT requires attribution, which copilot does not seem to include in the cases where it fully reproduces existing code.


It seems that Copilot could address this issue by searching for matches in its source repositories for the strings it generates, with appropriate criteria, and give the user a link describing the origin of the code, who wrote it, and what the license is for cases where a match length exceeds a threshold. So, you wouldn't just get the Quake fast integer square root routine, you'd get a pointer to the Quake repository and license info from which it came. A separate model could be trained up that would find the closest match in source code repositories. A user could then use Copilot safely, attribute code correctly, and avoid code with incompatible licenses.

This would be a better approach than "shut it down".


This does more harm than good. If you set a precedent, then things like stable diffusion will also be illegal since it's trained on public data. OP just wants to make money from microsoft using fearmongering and false sense of righteousness


From all the discussions, it seems people are rooting for MPAA alike organization and ContentId like system for code.


Abolish all copyright. We're all happily pirating movies and music but code is for some reason sacred.


FWIW, not all of us "happily pirate movies and music".

I want there to be more good music and movies. I want to support artists who create entertainment I enjoy. I go out of my way to buy physical copies of music from artists, wherever possible from the merch table at their shows or from their own websites. I pay to go see movies on the big screen (partly because I like the big screen cinema experience, but also because I understand "opening week revenue" is a key performance indicator for the success of a movie).

I thing copyright is old, outdated, and probably not really fit for purpose for forms of creative work invented in the last 50 years. But I also thing creative workers need to get paid for their effort (juist the same as software developers), and absent a FAANG-style set for companies employing teams of songwriters, musicians, authors, and the like - on FAANG-style salaries, copyright seems to be the option that is working (however badly).

I'll join your "abolish all copyright" crusade as soon as there's an alternative that at least likely to possibly work as well (or better) than the system copyright allows. Just abolishing copyright and erasing the publishing/music/movie/art industries without a transition plan isn't a thing I can support. (At least a transition the artists/editors/producers/writers/etc. I'll admit there's a large chunk of management and legal in the fairly abusive parts of the music industry I wouldn't shed a tear if they all became homeless and destitute overnight...)


> We're all happily pirating movies and music but code is for some reason sacred.

Speak for yourself. I pay multiple streaming services, music and video, because I prefer creators be able to eat.


As someone who works at a streaming service: thank you.

We are people, we have ambitions and families. We're not just a faceless corporation.


Maybe MSFT should have one instance of Copilot for each common license, and then the user gets to pick which licenses they want to deal with when using Copilot. If you're writing code for a BSD-licensed codebase, you might accept Copilot trained on BSD- and MIT-licensed code, as well as any other license that's compatible with BSD. If you're writing code for a proprietary codebase you might want to exclude Copilot trained on any copyleft licenses. And so on.


Don't give them a more complicated problem. It seems like they're already struggling with any distinction between code they can and can't use for this. :|

Which is ironic given that this is Microsoft. Whatever happened to "don't use programmers' code without paying them" and the whole "proprietary software is better because it sustains the programmer"?


Here's the thing about GitHub that most people do not realize. I find this funny because of all the talk of following license agreements, very few have taken the time to read the terms of service for GitHub.

from their terms of service: "Short version: You own content you create, but you allow us certain rights to it, so that we can display and share the content you post." emphasis mine.

that's what they call the "Short version" of the following paragraphs, which are found here: https://docs.github.com/en/site-policy/github-terms/github-t...

they allow themselves the right to display content you upload to others. GitHub does not seem to really put a cap on that in terms of what intentions it needs to have or for what purposes it needs to share your content.

this seems to me that, by putting your code on github.com, that you are granting GitHub license to show it to others. period. IANAL, but it seems like all code anyone puts on github.com is dual-licensed, at least. GitHub gets their own rights to your code.

I read this before I signed up, and while I can't remember if this exact passage was present at the time, I was ok with everything GitHub wanted at the time, and I continue to be.

githubcopilotinvestigation.com doesn't seem to have much hope of doing anything except getting people mad. but you all were already mad anyway, weren't ya?


> but you all were already mad anyway ...

This seems to be the line of argumentation agreed upon by several waffling pro-GitHub posters. Many comments have some variation on that diversion from the issue.

> GitHub gets their own rights to your code.

This is preposterous and false. GitHub has the right to display the entire work, properly attributed and licensed, to others.

No new licenses are given, no dual-licensing takes place, no code-laundering is permitted.


> No new licenses are given, no dual-licensing takes place, no code-laundering is permitted.

I suggest you read the terms of service again.

here, I'll link directly to the license grant: https://docs.github.com/en/site-policy/github-terms/github-t...


I suggest you take your own advice:

    This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.


There's been a lot of discussion around licenses but I'm not even sure if they matter for Copilot. I was reading their terms and conditions and there's a paragraph that basically says they have the right to display and share your code with other users. So even in the case where people are directly prompting Copilot with specific function names, I think the terms and conditions still cover them.

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

https://docs.github.com/en/site-policy/github-terms/github-t...


How can I, as the lead of a small team, make sure none of my code ends up on copilot (or any other submission of our IP to third parties)? We use Devops internally, and IDE decision is up to the developer.

Im unsure if vscode etc submit samples or just interact with GitHub.

Edit: and furthermore, make sure it doesn’t import code from third parties. I don’t want my code being infringed upon, but also don’t want to accidentally infringe on others’ work. Legal or not.


in countries where there is no fair use (most of the world outside the US) it seems quite likely copilot is willful, commercial scale copyright infringement


Fair use is unusually permissive in the US, but most countries have very complex copyright rules to allow e.g. a televised interview in a room with contemporary paintings, without getting permission from the copyright holders of those paintings. It'd certainly make for interesting cases.


I wish code didn't have any copywrite at all. It should just belong to our species for the benefit of our species. If your entire business model depends on having some private code that you lord over, versus, you know, having some expertise in the field you are in and the ability to generate more code to solve ongoing problems, it seems like you are structured on shaky ground to begin with.

For example, there are plenty of academics these days who are at the tops of their fields and open source all their code. They end up considered as experts not because of a black box code base they implement on problems, but because they can think of potential solutions to the problems at all, and one of the tools used is writing up some code. The code is a shovel or a hammer, its not the one wielding it. They have competitors too of course, just that the secret sauce isn't the code but what goes on in your actual brain.

Its too bad most business leaders fail to understand this, and think its a blackbox code base that makes a decent business. Its the ability to solve problems that matters.


This Copilot saga is another good reminder of why nothing is free. Developers have been using Github for free for years - now the chickens have come home to roost. The copyright licenses are just a formality - a form of kayfabe. If you aren't hosting your own code (GNU style), you should assume Microsoft owns it, for all intents and purposes.


Those who want to insist there are no instances of infringement or evidence thereof should take a look at this link first.

It's face-saving.

https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...


It makes sense to copyright a book, but it doesn't makes sense to copyright a phrase (unless you are using it as a trademark motto or something like that), normally phrases are free for anybody to re-use. It makes sense to copyright a program, but it doesn't make sense to copyright a piece of code.


Oh my god, round and round on this topic. Leave it alone. Copilot is an amazing tool and demo of what AI can do. I will happily pay for good ML products, which a notoriously hard area to monetize.

Copilot may produce results from the training set, but if you're letting it do that, that says more about you than about copilot.

All of these claims use the example "Write me a function to foo the bar that takes baz as an argument". If you prompt it to write entire functions and classes for you, then it will lean on its training set.

But if you actually just write code, then it will complete small single lines in exactly the style you've previously written. With code that is unique to your program because it can synthesize new code.

In this role copilot is no different than a search engine. By prompting it lazily, copilot isn't the one stealing the code, you are.


This reminds me of pirating music. Lawyers tried futilely to stop it, but if something is technically possible people will find a way to keep doing it. Maybe you set some legal precedent on fair use with AI, but it won't prevent the real world usages if there's a benefit to the technology.


Lots and lots and lots and lots of people confusing copyright (an inherent property right granted and protected by the government) and license (a privately granted privilege to use). Butterick—who is no IP fundamentalist, just go look at the license he used for his typefaces—is doing two things: looking at the enforcement of open source licenses so that they are not invalidated by nonenforcement and, related, asking Microsoft to respect the community. I didn’t see him suggest that Copilot is bad or should he shut down, just that they play by the rules. A lot of the reactions here echo a lot of non-developer middle managers who insist that open source code is free and freely usable by anyone for any reason, which simply isn’t the case if FOSS licenses have meaning and value.


1. New player shows up, changes value chain and creates abundance 2. People who benefitted from old value chain whine 3. New player throws them a bone with a small fund or maybe a setting box, doesn’t change 4. (A few years later) no one cares about the kooks who whined

I’m not even 30 yet and I’ve seen this happen again and again - it’s frankly boring at this point. We’ve seen this with Spotify and music, newspapers and the internet etc.

The practical truth is that Copilot is a useful tool for humanity to have. It is exceedingly unlikely it will be stopped because a small percentage of programmers - themselves a small percentage of people who benefit from code - feel their interests have been hurt. Change or get left behind (but make sure to enrich some lawyers on a pointless suit in the meantime).


There's a big difference between learning and memorizing.

If the AI is "learning" how it works by studying public code then using its knowledge to create, that's okay.

But if it's just memorizing code and reciting it back, not okay. Just like if a human were doing this.

Of course we don't currently have ways to know the difference [that I know of] since AI is a black box.

Interestingly, current AI is not capable of truly understanding how code works and how it will execute, so it has to learn in it's own way. I suspect it can learn what valid syntax is, but I doubt it is aware of how the code will execute.

It's possible this is just a case of Overfitting. https://en.wikipedia.org/wiki/Overfitting


BSD 5-Clause

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the organization.

4. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

5. Use of this source code for the research or training of machine learning models is permitted.


The issue with copilot is that it is not respecting clause 1.


Only if you agree it is "redistributing" anything!

Small enough pieces of code can't be copyrighted. No one would support an argument that I violated copyright by using the code "else if {" from some GPL library.

So the question becomes what is the minimal unit of copyrightable code? What if you wrote a nice big function exactly (or almost exactly) the same way as someone else did? Whose copyright are you violating?


so what? You can put on a cowboy hat and larp as one, but that does not mean everyone else around you have to take it seriously. Same way with these made up licenses. If it's on the internet, it belongs to all. Or else keep it with yourself.


So, if you are not compliant with the posted license, you are in violation of copyright.

This is no different whether you are honkler or Microsoft (other than in how vigorously or not someone may enforce it).

If you really believe that anything posted to the internet ‘belongs to all’ then I don’t know what to tell you other than you live in a fantasy land where Oracle Corporation does not exist. We might all prefer it if things were that way, but they simply aren’t, and that’s just tough.


Funnily, for an article all about copying (lots) everywhere the author writes Copilot it appears as "Copi lot" in text browsers. Also the HN title appears the same (check it in hex dump)

For example, from TFA file:

0005e10 o f C o p i 302 255 l o t


Has anyone spotted licenses in the wild that specifically prohibit AI tools like Copilot?


Copyright only covers the expressive parts and not the utilitarian parts:

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

https://en.wikipedia.org/wiki/Idea–expression_distinction

https://h2o.law.harvard.edu/cases/5004

Most of your code is probably not subject to copyright in the first place, regardless of license.


Doesn't Copilot reproduce the exact expression given the right prompt, though?


Even if it does, it may not matter. For example, APIs are not copyrightable (see Google v Oracle), and if there is only one obvious efficient way to make something work, it does not follow that the user must be prohibited from using that way even if someone else did it first.


“Expression” in the creative sense, as opposed to utilitarian in a functional sense.

Copyright is meant to protect “useless” things like poetry and music.


So does a random number generator.


Reality: *GPL licenses are proprietary licenses.

I hope Copilot and similar technologies weakens the copyright establishment.

Do Business WITHOUT Intellectual Property - Stephen Kinsella http://www.stephankinsella.com/wp-content/uploads/publicatio...

Against Intellectual Property - Stephen Kinsella https://mises.org/library/against-intellectual-property-0


A good solution might be to add a new license clause stipulating whether the owner is okay with their code being used to train AI models.

Part of the clause would explain that if you are okay with your code being trained on, then you're also accepting being okay with it being copied verbatim at some point down the line during code completion.

You do get a bit of tragedy of the commons where everybody wants to use the AI model but nobody wants their own code trained on.

I don't like the idea of a world where licensing and copyright law prevents us from enjoying the progress of AI. Caveat: I am not an expert on open source.


There's no need for the repo owner do anything: they already indicate the license. GitHub even shows a simple explanation of the license in the repo's main page. GitHub has all the data it needs to respect the license. If their trained model can't reproduce the license for the repo a fragment comes from, then they've failed in their social and legal responsibilities.

I do understand how ML works. I know it's probably not possible with how it's currently done. That doesn't make it legal or ethical.

It would actually be great for everyone if it showed both the license and repo. Imagine you pull up a great function with Copilot and want to explore the source for more insights. You can't with how they've done this.


It does actually ask if you want to use your code to help train it. The problem is that even when people have said no, they're still seeing their code pop up in copilot's auto-complete.

I don't mind it using my code because in my opinion, we as a software industry are way behind on where we should be and copilot is helping a lot of developers finish their projects quicker.

That said, software licenses should 100% be respected. I would hate for FOSS projects to start being sued over code. It's not in the spirit of FOSS, but neither is stealing code. Copilot should be doing a better job excluding code and none of this would be a problem.


There's a "Pictures in Boxes" comic about the internet stealing content on his page. It doesn't name the author or link his site on the image or in text.

But since the use is not for the page author to comment on the comic itself, but the comic is used to support his discussion of another misuse of IP, does it constitute fair use?

The page author is going deep on the content misappropriation theme and on what constitutes fair use, so it seems oddly ironic he'd be so seemingly cavalier about using someone else's content on that page.


I am glad all the legal bs didn't stop MS from making the product. Copilot is surprisingly effective, it truly makes life easier for me, as a developer. The fact is that if you give your code away publicly, you cannot finely control what the world does with it. If this is not acceptable to you, keep your IP private.

If these guys manage to shut down or cripple Copilot using legal mechanisms, you can bet there will be a Chinese/Russian alternative that will be even more indifferent to your LICENSE.md, and you won't be able to get it shut down using the courts.


Most of these points can also be raised against DALL-E 2, but software has one extra thorn: patents.

It's a common advice to not read software patents[1] because the infringement penalties are lower if you did so unwittingly, that is, by reinventing the patented technique yourself.

I wonder if using Copilot doesn't push the penalties back again to wilful infringement. Or worse, patent trolls poisoning the training data with patented algorithms.

[1]: https://queue.acm.org/detail.cfm?id=3489047


Has Dall-E 2 yet reproduced 1:1 anything from its training set?


Idk about dalle but with stable diffusion if you type in "Mona Lisa" or "Van Gogh" you have to fight pretty hard with your prompt to NOT get identical reproductions of those respective works


> Why couldn’t Microsoft pro­duce any legal author­ity for its posi­tion?

Absence of proof is not proof of absence.

They don't owe anyone anything beyond what they agree to provide to users of Copilot via its license agreement or to GitHub users whose code it has used in accordance with that license agreement. Those agreements define what they owe. That's it.

The only way those license agreements don't hold up in court is if they are somehow deemed invalid. I do not see Microsoft making that kind of mistake.

This website is designed to get people angry, and that's all it is going to accomplish.


Hey, I despise bait and switch from large corps. But I also find it unsustainable this idea that societies and legal resources are wasted fighting for IPs.

The code is out there. Millions of people are being trained and writing code based of the learnings of open data.

Designers have "mood boards". Developers have open source. Right now I don't have sympathy for MS, but in a few years any you developer could just do what MS is doing with Copilot in their bedroom. Why would you care about the kid in their bed room training an AI with free (as in public) information?


This is copyright. I put it out there with a government guaranty that I get to retain ownership out of it. Society at large benefits because more people put their stuff out their. You want to break that deal and you will end up with less sharing. Why are you for information silos? I am for open ideas and sharing. You are for stealing and breaking the moral agreement because distributing my code in the form of an AI analysis doesn't groke with you as the same thing as distributing it in another manner.


Copyright is not universal. And it will be increasingly be less so.


You will then increasing get LESS sharing and MORE silos. There is a reason copyright laws exist, it's because they are a NET PLUS to society. It's not just 'for the benefit of EVIL corp'.


Maybe I will start writing open source code intended to trick copilot. Stuff that just about works in the given context, but will fail badly if copypasta'd into another program. If we all did that.


To stay sane: for myself as a developer, I consider github copilot as a (much) faster google/code search work-flow. I can copy / or re-mix code I find in a google search, but it's my responsibility to figure out the copyright situation of that code.

Imagine if something like google didn't exist, and then it suddenly did. People would be saying: "This newfangled computer algorithm is giving everyone copies of my code with a misattributed licence, just by typing the function name and site:github.com !"


It's too bad we can't experiment with interesting things like Copilot without worrying about remuneration and the respecting of rights. But that's the way of the world - we must think of these things. MS/Github should give code copyright holders a simple and easy way to opt-out of contributing their code to the Copilot corpus. Currently the only way to opt-out is to make your repo private. That's not good enough.

It would be better, of course, if Copilot was opt-in, but they'd never go for that.


They have done so already with their license and there's no legal reason for them to have to opt-out.


A bit meta but anyone know why the submission title contains unicode between various characters?

It's hidden on both Chromium/Firefox when viewing the page but when saving the page it reveals them in the text field, eg: `GitHub Copi_lot inves_ti_ga_tion`

Plugging the title into a unicode converter shows they're 'soft hyphen' characters

GitHub Copi [0x00AD] lot inves [0x00AD] ti [0x00AD] ga [0x00AD] tion

Edit: apparently they're for indicating to formatters where character breaks should be, though I can't understand the consistency here.


> Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties. Or at the very least, remove any incen­tive to do so.

The walled garden bit I get. But I'm lost making the leap to "remove any incentive to do so." Is Butterick suggesting that someone is going to put aside their code and do a deep dive on GitHub looking for a snippet that might not exist?

I'm not trolling. I'm sincerely trying to grasp the argument being made.


Question for all those who are pro Copilot in this argument and are claiming fair use. Do these same rules apply if I manually copy someone elses copyrighted code into my codebase ?


It seems to me that in principle it should be possible to maintain attributions through the training process, so that Copilot outputs could come with a list of weighted sources, possibly discarding those that fall below a certain weight threshold. Doing so would likely be much more expensive in terms of the computational power needed for training, and probably also in the size of the model. But it would be great to actually be able to see what went into a specific Copilot output.


All this just shows one thing : copyrighting / licensing "code" is meaningless... but of course that was already known by all those people who think that the US laws about copyright should not have been propagated to the rest of the world. "Code" is merely an algorithm put to work. There should be nothing inherently copyrightable about this, no more so than the recipe take a chocolate is just a way to put chocolate and a few other ingredients to work.


Do the same copyright issues arise with AI-generated videos learned from Shutterstock?

https://news.ycombinator.com/item?id=33239706

https://waxy.org/2022/09/ai-data-laundering-how-academic-and...


In most of the controversies posted on HN they usually end up with a feeling that nothing would change, because only us, the tech community knows about details of an issue and we are too few to have an impact.

But this is solely affecting a product where we are the target audience, where if we oppose, thing should change. Now I wonder if it will show that we are actually caring that much to action or we are just as regular consumers as non-tech people in all other cases.


An intrim update to Copilot could link to where the code as pulled from. Or maybe there's a way for open source devs to add a comment to the code that links to their community/repo. If it was standardised then any data gathering would need to follow the collection rule.

I agree with the article's long term outlook about community and code quality, it is a very long term outlook though. It makes me wonder if humans will actually be writing code.


My biggest concern regarding GitHub Copilot is that it is cloud based and opens up our previously private coding activities to continuous surveillance by third-parties.

It's only a matter of time before intelligence agencies will get their hands on the data. And if use of Copilot becomes an industry wide practice then those who wish to preserve their privacy will become uncompetitive.

I really hope we have some decent offline alternatives eventually.


New tech creates winners and losers, and losers inevitably complain. See looms, VHS, Napster, etc. The more of this complaining I see, the more it falls flat. The only interesting thing is which side different communities end up being on.

To be fair, record companies were not in the least bit sympathetic. Open source contributors are easier to identify with, though imo it doesn't actually make their concerns more valid


What does this comment even mean?

I cannot parse what you are suggesting.


As far as I can tell it's just a more convoluted way of saying "new good, old bad, only old people disagree"


It's more nuanced.

Copilot exists publicly, which also means some copilot-lite thing trained on a smaller subset of repos probably exists privately in many different places. It may not be as good today, but these private instances will improve over time. Since the demand for a copilot-like service exists, eventually a large VC-funded public instance will show up.

In that lens, it is more sensible on the individual level to prepare for a world where copilot thrives than to put all of your eggs in the "ban copilot" basket.


He's saying you're a luddite for not wanting Copilot to steal all your code and use it for furthering the grand vision of AI generated code (which supposedly represents the forward march of progress)


“If I had asked people what they wanted, they would have said faster horses.” Henry Ford


Learned weights should be considered a derived work of all the things the model was trained on.

I think 'training an AI' is actually a distinctly new use of IP, and should probably be considered under a specific kind of 'AI-use' license. Open Source licenses should be updated to indicate whether they allow or do not allow AIs to be trained on covered work as well as the other rights they allow.


For me, as a (granted very minor) contributor to some open source, I couldn't care less about attribution. The ethos of open source is specifically about sharing stuff (probably for free) for the benefit of everyone, take a penny, leave a penny. It's more of an interesting question if Copilot is suggesting code verbatim from source-available rather than open source repos though.


Imagine you are a history or philosophy teacher in 2100.

How cool it is to discuss these kind of issues? What do you think about the "erasing open source community" argument from the historical perspective? What does it have in common with industrial revolution?

Even though the real life implications are real, I find it fascinating and not so simple to unravel.


This is such a disingenuous use of the word “investigation”.

They are “investigating” whether they should start a law suit. So this is not an investigation, it’s somewhere between “due diligence” and a PR stunt.

I very much disagree with the idea of a law suit that seeks to establish ML training as not being fair use. It is an utterly foolish thing for them to wish for.


Is GH Copilot only on public repos? My assumption is that code from private repos was also showing up. I feel like I read an HN article about this previously. Don't have evidence but that seems like a much bigger issue and trust violation if that is true.

Kinda like how gmail was reading everyone's emails and showing ads based on them.


I wonder in court if they will rule that this is no different than a human reading open source code to learn how to code. I guess the main difference here is the human would not be able to be used in parallel where Copilot can be used by millions of people at one time.

It will be interesting to see where this goes.


Well I am "a fan" of Copilot and I do think AI is the future, but I think the author has a valid point.

I think the fair use violation he describes doesn't happen during training. I do think training AI on anything that is publicly accessible is fair use just as in an example of a person learning by reading/watching the same materials.

However, this fair use rule is being violated the moment the resulting AI starts suggesting verbatim copied code from licensed works without attribution.

So one could argue the source code is not being used in a transformative way but copilot is just more efficient method of retrieval of licensed code. This misses the fact copilot actually is capable of writing new code. I've used it as "an autocomplete on steroids". Letting it suggest maybe half a line, or 1 line of code at a time (or trivial stuff we automate even without copilot like getters/setters in java). But when actual licensed code is suggested yes, this is IMO a license violation.

Therefore one way of resolving this would be to pair copilot with a tool that scanned the resulting code for presence of licensed code then it woukd make a list of "credits" or references. Also there should be measures taken (perhaps during training) to penalise generation of verbatim (or extremely similar) code. Would this make copilot less of a useful tool? I'm not sure.

One thing that's not going to happen is putting tools like copilot back "in the bottle". We now have similar models anyone can download (faux pilot) and I as well as many others have found those tools to speed up mundane tasks a lot. This translates into monetary advantage for users. Therefore there is no way this will disappear, lawsuit or no lawsuit.


The narrative here seems to be a David and Goliath story as Microsoft profits by stomping on the defenseless open-source communities. There's two problems with this story.

First, the huge majority of open-source projects are at no real risk because Copilot offers something totally different from what they offer. Open-source projects generally take highly-complex domains and expose them as simple interfaces or executable programs. This encapsulation is where the value lies.

In contrast, Copilot just dumps code. Never once doing front-end work have I thought "if only there was a way to dump verbatim React internals directly into my codebase." In general, Copilot only replaces tasks I would have otherwise done myself.

The second problem is the biggest loser if Copilot gets shut down is not Microsoft, who can easily take the loss in stride. The real loser is the community of developers, many of them bootstrapping their own projects or trying to develop open-source in their precious off-hours, for whom every minute counts, and for whom tools like Copilot can be the difference between success and failure.


I think the test for whether an AI is infringing or not should be:

Can this AI regurgitate the vast majority of the creative aspects of an original/novel piece of software with minimal prompting, to the point where the output code looks mostly and directly cloned to a reasonable person trained in the art?


_Maybe_ software is fundamentally different to other "creative works" which rely on copyright protection, but it's not immediately clear it is, and as far as I know it's certainly not a "special edge case" as defined in copyright law in general.

So "Can this AI regurgitate the vast majority of the creative aspects of an original/novel piece of software" is not the test that, for example, the music industry uses when determining if a sample is infringing. The test there is "is a sample, however small, identifiable as part of a copyright work by a reasonable person trained in the art?"

You can't own copyright in a composition of a single middle c note. But lawsuits have been won for copyright infringement of melodies of 2 bars (fewer than about 16 consecutinve notes). Men At Work lost a copyright case for the flute melody in Land Downunder which is the same as a 90 year old tune Kookaburra Sits In The Old Gunmtree https://www.claytonutz.com/knowledge/2010/february/men-at-wo...

Whether that's done by a flute player or an AI, really doesn't make any difference as far as copyright law sees things.

(Whether copyright law is a "good fit" for source code, and whether it makes sense to apply laws meant for books/literature/music/film to software is a different but very good question. I don't have much in the way of other ideas which take original author's efforts and potential rights to benefit from then though...)


FWIW, in this case I was trying to feel out what I think is a good fit for copyright in general. I think the same test could be applied for books, music, art, etc.


If this becomes illegal, it will pretty much mark the death of free/open ML and its sets.

If you can't train on data before asking for permission, the data set becomes sparse. Thje only people who will be able to afford this will be, you guessed it, established giants who can build their own sets.


Getty Images already handled this issue with graphics. Most of their catalog was scraped early on by AI art generators. 'Errbody knows this because the Getty Images watermark appears in a lot of AI generated art. Getty Images, in turn, banned the sale of AI generated art because it is legally tainted.

The same thing will happen to source code produced by AI code generators. Github itself, or some entrepreneur, will come up with a way to identify and flag projects containing AI generated code based on models constructed from open source projects, so that those derivative works will not inadvertently be incorporated into other software that is concerned with such a flag. (They probably will also come up with an NFT-based mechanism of some sort to allow open source project rights holders to authorize incorporation of their code into AI models such that derivative works containing those fragments would not be subject to flagging.)

Hey YCombinator, give me $10M to make a billion dollar company that "lives at the intersection of" blockchain and open source. (Haha, No.)


> how will you feel if Copi­lot erases your open-source com­mu­nity

How will you feel if greed of a lawer erases progress of your tools?

Lawyers are a detriment to anything they touch. Letting them into software was the biggest mistake we ever made. We should kept them away same way they are kept away from math.


I am feeling very greedy but ....

With all that intelligence if GitHub Copilot can't produce easy to use and manage full stack framework yet with distributed database inbuilt in either any existing programming language or perhaps a new one created by itself then its not useful for me.


I'm against software patents to most degree.

Especially with algorithms.

I was rooting for Google when the JVM topic happened and I'm rooting for GitHub with autopilot.

And yes there is src from me on GitHub too but use it! I used so much other code in the last 15 years.

Copyright on algorithm or basic code should be a no go.


Open source has trained me and countless others. We have learned from it. Why shouldn’t machines learn from it too? Is co-pilot copy-pasting slabs of code verbatim?

I see Copilot as a net positive. Open source is for sharing and learning. Copilot is sharing and learning on steroids.


Yes, copilot IS copying code verbatim from GitHub hosted repos, right now.

The license and attribution are stripped from regurgitated copied code snippets. Verbatim with no context, no attribution, no citation, no reference to the project it’s part of…

If the people don’t known which project the code was taken from, how can they one day contribute to that codebase?

Copilot is an interloper who doesn’t even tell you which project the code snippet was ripped off from!!


Oh god please no. GitHub Copilot is a wonderful technology. I am not taking anything away from you if Copilot suggests code that is similar or identical to your copyrighted code. You were not going to sell it to me anyway.

The following is supposed to be OK: somebody reads your GPLed code, learns abstract concepts from it, teaches it to me, I write code that uses the same algorithm. But it's not OK to abbreviate the process and reach the same result directly with Copilot. That is some Talmudic level reasoning. In a sane legal system, one would note that it is legal to do when jumping through pointless hoops, so it should be legal per se, and the system should be adjusted.

Copyright is increasingly at odds with technological development. Not just since AI applications, at least since Napster or since floppy disks. Of course Matthew Butterick as a lawer would disagree - "It is difficult to get a man to understand something, when his salary depends on his not understanding it."


> The following is supposed to be OK: somebody reads your GPLed code, learns abstract concepts from it, teaches it to me, I write code that uses the same algorithm. But it's not OK to abbreviate the process and reach the same result directly with Copilot.

The trouble is that this apparently is not what Copilot is always doing. If it had only "learned abstract concepts" from GPL'd (or any other form of copyright) code, then that would not be a problem, and of course that is kind-of what Copilot purports to be doing, supposedly learning the association between concepts described in comments and corresponding forms of implementation.

However, apparently Copilot is sometimes NOT generating it's own code based on the concepts it has learned, but is instead just regurgitating chunks of potentially copyright-protected code verbatim. It'd be interesting to know if it is doing this deliberately (to maintain the coherence of what it is generating) or not - I guess the more of something it has already copied exactly the more it is likely to continue copying since that is the best "predict next word" continuation. Of course while it would be interesting to learn more about the mechanics of Copilot, that doesn't change the legality, or not, of what it is doing, another aspect of which (although IANAL) is how much of the original work is being copied.

At the end of the day it shouldn't matter whether it's you or Copilot either learning from or copying someone else's code - exact same copyright protections apply.


Honestly, Github Copilot seems fine. It's just a tool that you're responsible for using responsibly. If I Google something, and copy and paste that, then Google is not responsible for my infringing. It's just "intelligent autocomplete".


Google search doesn't return random snippets of text without indicating their source.


I wonder what Dictionary companies thought about Autocomplete...


I’m really interested in seeing how this gets litigated. I imagine it will involve a lot of philosophical arguments about attribution and what the software is actually doing.

I’m also curious to see if/how Amazon CodeWhisperer takes advantage of this whole debacle.


Perhaps the only way out of this is to start suing the users of Copilot, much as some jurisdictions target the users of a product (e.g. drugs, prostitution) as a means to shut it down when the providers are too difficult or numerous to challenge effectively.


Sadly, I think this marks the beginning of a winner-takes-all economy fueled by AI.

Just imagine how in a lawsuit like this, OpenAI can use GPT-3 to generate eloquent court speech with statistical confidence that it can defeat human lawyers? It just comes down to TPU power.


I wonder what will happen when a company pays some overseas developers $50 for some code, they copy it from Copilot and it copies a bug from a US developer and that company gets hacked for $10 million.

Will the lawsuit fall on the overseas developer, US developer or Github?


No one? They'd probably stop doing business with the overseas developer and that's it.


Sorry to ask a shallow question. His "photo" is so interesting. It feels exactly like old school Wall Street Journal "photos" from 1990s. Is there a plug-in or service to create this type of image from a photograph?


It’s called a hedcut. WSJ built a generator in 2019, but it's only available to subscribers. [1]. There are artists that offer commissions, including at least one WSJ artist. [2]

1 - https://www.wsj.com/articles/whats-in-a-hedcut-depends-how-i...

2 - http://www.hedcut.com/


Is there a license that explicitly forbids corporations from ingesting my code and making a billion dollars off of my work for free? The AGPL? I've been using the MIT license for more than a decade, but it's time to change that.


Almost every FOSS license requires attribution, and Microsoft already seems perfectly happy to violate that, so I don't see why they'd be any less happy to violate whatever other license you'd come up with.


In this next episode of corporation name seems like a cool corp but reveals as selfish and malicious inc. we could saw corporation name act selfishly and maliciously as in every other episode. See you next time kids.


> GitHub Copi-lot inves-ti-ga-tion

lol rarely see such aggressive use of soft hyphens in page titles


This investigation should not stop at GitHub Co-Pilot, large language models currently that are trained on huge amount of data should also be investigated as I'm sure there are lot's of problems to be found there.


Can someone clarify if copyright violation is actually considered "illegal"? As far as I know it's a civil matter and not something a state or federal government would attempt to prosecute.


It’s a crime in certain cases, as piracy site owners can attest.


But this has long been the deal. In order to offer their services gratis, Big Tech makes money on your data, which you've freely provided. Welcome to the last twenty years of the software economy?


I have a badge on GitHub showing that I am a Arctic Code Vault Contributor. Why can't Microsoft do something similar for Copilot training data contributors? That would at least be a start.


Part of me feels like this will help big tech and hurt potential startups that’d compete in this space. Microsoft has the resources to make this issue “go away” while smaller incumbents will not.


Maybe the software engineers are worried they're being made redundant, but it is super fair for them to not allow their own work to make them redundant without permission


Uhm, strangely I get a Connection Reset error in the browser when I try to access the URL from the corporate network, but it works without problems from my phone


> the biggest concern of the decade is that some stupid autocomplete can violate your license which never existed in the first place

this is why hapas are superior to wh*tes.


Crazy idea.. have the automatic code generator check if the code is too similar to a source it was trained on, and if so, automatically include attribution as well.

Ta da!


I have a badge on GitHub showing that I am a Arctic Code Vault Contributor. Why can't Microsoft do something similar for Copilot contributors?


Let us just work on cool technical things without having to worry about this kind of bullshit.

Knowledge data should be free to copy and do whatever we want with it


> Knowledge data should be free to copy and do whatever we want with it

I'm more of a copyleft fan. Feel free to copy my stuff, but you have to make it open source as well.


I'll scream this at the top of my lungs whenever I get the chance: If you attribute copyright to open source code you are a patent troll.


To me the whole point of open source is selfless giving and sharing. You build something and release the source code in case it's useful for whatever purpose people might have: learning, understanding, contributing, forking, copying, etc. And companies might build on it, train models from it, use it internally, who knows. Great. Other companies can do the same and compete. So can other open source projects.

For some reason when a company benefits from your work instead of some other entity that's bad? Please explain.


Because a company of Microsoft's magnitude will build a walled garden around their ecosystem over time? Haven't we seen this effect in play like a million times?


If by walled garden you mean that their service is better than competitors, that isn't necessarily a bad thing. Codex does nothing to lead to a walled garden, it is just providing a useful service and could be the spark of more competition.


Eta: honestly asking, this is just my heuristic, not something I've thought a lot about


could this be solved by MS brute-force shipping all the licenses (w/ references to their original projects) of all the repos they used to train to copilot along with copilot itself?

it wouldn't cover cases where people illegally copy pasted some code into their projects with dubious / not explicit licenses, but this is the same as using any open source project in general.


This is so stupid I can’t believe how this community has become toward some of the most inspiring new technology I’ve seen in a decade.


Soon machines will step up from being an aid to doing the creative work themselves and copyright will be an artefact of the past.


Would an opt-in system fix this? Where your code is only learned from if you opt into using Copilot to help you develop faster?


It seems a benchmark of a transformative technology is whether or not people attempt to use the legal system to stop it.


Copyright is the problem. The rest of this is just dancing around the legal framework built to support the bullshit.


I wouldn't be surprised if Microsoft lawyers didn't like the word "github" in the domain name...


I think the infringement that is relevant in practice comes from users of Copilot rather than from its authors.


As a joke, I made a webpage where you can do attribution to ALL GitHub repositories:

http://thanksforthecode.com

It scrolls past all the repos movie-credits-style. Doing it that way takes several days! It shows how abstract and absurd giving contribution to such a large body of works is.


You're missing your <noscript> tag


Time for a new open source license specifically allowing fair use for machine learning?


This would for sure have affects on anything related to Ai content generation.


it seems like copilot is simply a search engine in this context. when i search gh or google or <insert tool> i can get code snippets without seeing the license.

how is copilot doing something fundamentally different?


Why can't Tim Davis (or another software author whose code is emitted verbatim by Copilot) demand that Microsoft take down Copilot, or at least the part of Copilot that contains his code?

Microsoft is distributing his software without a license, isn't it?


It could potentially under "fair use," which completely overrides any and all copyright claims if the conduct is found to be, indeed, "fair use."

Fair use in code is broader than just copying. For example, in Google v Oracle, APIs were found to be not copyrightable. Even if you copied the names of, say, 86,000 different functions in a proprietary library, you did not violate copyright.

Then comes the second problem. Let's say there is a function, say, `AddTwoNumbers(int a, int b)`. Just because John Fitzgerald in 1999 implemented that as `return a + b;" doesn't mean you can't too. There's a degree where you can copy the code that made a function work, even if that code existed earlier. It's fuzzy but it is legally real.

Finally, there is your third problem, which is that you risk a "safe harbor"-esque judgement. Just because YouTube has occasional copyright-violating content doesn't make YouTube illegal. Similarly, the person suing here risks a finding that GitHub Copilot is legal as long as any occasional long proprietary code regurgitations are removed as needed.

If your code falls under the first two conditions, copyright be damned, license be damned, it's all irrelevant. See also Linux copying Unix.


There must be a line past which copying is no longer "fair use", otherwise no copyrights in code would be enforceable at all. I suppose it is up to a court to decide, but in the Tim Davis thread from yesterday it looked to me like Copilot was emitting entire, nontrivial functions verbatim.


How can I personally and proactively fight against this effort?


Is there an AI system for those dot woodcut prints ie WSJ?


I wonder if it emits stable diffusion samples? ;-)


The discussion has been "cleaned up" massively. All Copilot discussions are heavily manipulated.

I don't know why one can freely pile on, e.g., AirBNB here but Copilot is a sacred cow.


This right here is why we can’t have good things.


ITT: armchair lawyers go after GitHub


suing Github for this seems like a neat idea to make money on our open source projects


Big money here. Good luck


I find it hard to see a scenario where MS doesn’t get absolutely wrecked in court.


I feel "stealing your community" is lawyer hyperbole, but people also seem ok with what MS is doing with copilot, and I am not.

If you think what copilot is doing is ok, and there is nothing wrong with it, I'd love it if you could go through this small thought exercise, and see if it impacts your view at all:

Say you write a bunch of code, and release it under GPL. For the sake of argument imagine it is something complicated that you care about.

Now say another person is trying to do what your code does, and they find your code, a say "excellent". They then copy and paste it into their project, and release their code under a BSD license instead.

Would you consider this theft of your IP? The law certainly would, and I think most devs would as well.

What would you say if they instead release "their" code as public domain?

Now we'll go a bit further. Another person is trying to solve this problem in some commercial software. They find your code, copy-paste it into their project, then sell their software and don't release the source, or even acknowledge you.

Would you consider _this_ theft? again the law would.

Now, what if instead they found your code through the invalid BSD relicense? or the invalid public domain one?

To me every one of these would be theft, and every one would be required to required to release the source of projects that made use of my GPL'd code, under the GPL. That is literally the whole point of the GPL.

But let's imagine a different route.

A person is writing some code and can't work out how to solve a problem, so they ask on StackOverflow. Now another person comes along and answer the question by copy-pasting from your project into SO. The first person says "yay!" and then copies that code, and we repeat the above scenarios.

In an even more extreme case, imagine both of the above people work at the same large company - so neither knows or is even aware of the other - how does this impact what is going on? It's two people, but fundamentally the company is copying the original GPL code into SO, then copying it from SO into its proprietary code.

I get that MS and GitHub try to position it as if copilot is "creating code", but it is simply doing a statistical code completion that is demonstrably happy to copy and paste from the original source into the recipient code. To my mind all it is doing is providing a mechanism to launder GPL (or whatever) code into your own without the license, by slapping "ML" and "AI" on the process and requiring more than 3 keys to be involved.


> Another person is trying to solve this problem in some commercial software. They find your code, copy-paste it into their project, then sell their software and don't release the source, or even acknowledge you.

Let's be honest, copy-and-pasting happens all the time. In software, in engineering, in marketing, in everything. Whether people acknowledge it or not.

Everyone looks at Stack Overflow all the time. You do it. I do it. Nobody reads the licensing terms. We all produce software with reskinned and taped together functions. A collage is still unique, creative, work despite being glued together with other people's art.

Most songwriters will write a song with a part like someone else's song.

The products you buy at a store are rip-offs of someone else's product.

Everybody stands on the shoulders of giants before them. Such is learning, such is life. Get over it.


> Let's be honest, copy-and-pasting happens all the time. In software, in engineering, in marketing, in everything

If I copied code without the rights to it into code I have written at any company I would absolutely be fired. It would not be up for debate.

> Everyone looks at Stack Overflow all the time. You do it. I do it.

I don't - it is very infrequently that I would look at SO answers

> Nobody reads the licensing terms.

Yes, they absolutely do, because again OSS or the GPL is meaningless if people are ignoring the license. More over I would suggest you talk to your employer's legal and IP departments to let them know you're copying code you don't have rights to into their product.

> We all produce software with reskinned and taped together functions. A collage is still unique, creative, work despite being glued together with other people's art.

Wow.

Absolutely not.

Competent engineers know how to write code themselves, they aren't copy-pasting their way to a solution. That's why you get paid a lot - if I was happy with copy pasta solutions I would hire a bunch of badly performing uni students.

> Most songwriters will write a song with a part like someone else's song.

If two people write similar songs that does not mean one copied the other. If one person copies part of another person's song they will end up in court, and they will lose all the revenue from the entire work.

> The products you buy at a store are rip-offs of someone else's product.

The ripoffs that stay on the market are not made by copying the entire implementation.

> Everybody stands on the shoulders of giants before them. Such is learning, such is life.

I didn't say anything at all to imply that we didn't. What I said was you don't get to just copy other people's work and pass it off as your own.

> Get over it.

Just because you apparently can't actually write new code yourself doesn't mean that that applies to other people.

Also, you should really get your employer to tell you whether your proposal of ignoring copyright is ok.

What you are saying is that if I google some problem, and copy some code from, say, gecko, or linux, or gcc, etc into my proprietary closed source product, that perfectly ok and those silly open source people should keep it to themselves if they don't want me doing so.

But there's also a difference between "I searched for this and copied the code I found" and "I typed some letters


Daaaamn, this post was on my top almost since it was published.


TL;DR: GitHub(Microsoft) declared that: “train­ing [machine-learn­ing] sys­tems on pub­lic data is fair use”. When asked for the relevant jurispru­dence to sup­port it's posi­tion, could not provide any.


Great article!


Training AI on copyrighted works is literally what Google always did.

Look how Google News enraged news orgs.

Now they come for the programmers. So now it’s a problem.


Fair use is about more than just the size of the excerpt, and even open source software still has a copyright and terms.

If you write an article about good writing, and quote a choice paragraph from someone else's work to show an example, and credit that quote, that is fair use.

Is it fair use if you read an awesome paragraph, something that really is the result of the authors unique intellect and effort and craftsmanship, and makes you think "damn", and then drop that same jewel into your book?

You can probably get away with it, because you probably just won't be able to convince a judge that any single paragraph is that big of a theft.

But I don't mean to ask if you can get away with it, I mean to ask if it should be considered fine honorable behavior.

The difference is, the paragraph isn't being included for examination or comment or transformation, it's being included to directly copy and perform it's original function as part of what makes a work a great work, and, it's not being credited in any bibliography or footnotes or directly.

The reader reads the paragraph and is impressed by your deep insight, which you never had, and the original author did.

How about if your new book has many such uncredited snips from other authors, such that your new work is denser and richer than any of the other individual authors?

This is what copilot is doing, or rather it's facilitating people doing it, as far as I can tell.

The original snippets are functional, not there for examination, copied verbatim, not transformed (sometimes), and not credited.

Most of it comes from open source works anyway and most authors would probably be fine with it if the stuff was simply credited.

I think as a tool, in the context of software vs literature, the tool is probably more good than bad for everyone as a whole. It probably results in the generation of more, and more correct software. Since software is more like a machine than a novel, it benefits all of humanity when machines work well.

But it needs to somehow credit the original authors, or if that's not possible then users do not get to claim credit for any work it was used on. Or, they can only claim a sort of tainted credit.

Maybe it needs a combimation of policies that together make a fair system. One element would be, the training set must be composed of strictly open source software (pick some definition). Then another element would be, any work that uses it, is tagged as such. You only get to say "I wrote this, with copilot." not merely "I wrote this". And any work that uses it is itself gpl. The individual snips maybe don't have to be credited because the theory will be the training set as a whole was credited, and those are all available somewhere. You as a contributor won't get credit for being in someone's mp3 transcoder app, but that app WILL declare that it used the training set, and the training set WILL declare all of your material that is in it.

Maybe there can be a special version that only includes code where the original terms did not require anything at all, not even preserving the authors name or the license that says it's free, and that version's output can be used without credit.

If proprietary software wants to benefit from a tool like that, they can pay for licenses from other proprietary software developers to include their software in their ai's training set, just like with normal software licensing for inclusion and re-sale in a new product.

But right now, as copilot currently exists, as far as I can tell it's blowing past and ignoring ANY considerations like that and Github are simply outlaws.


[deleted]


All class actions are a mix of both


It's hilarious how when I express displeasure about AI image generators looking likely to take a huge bite out of my profession of "artist" and playing extremely fast and loose with fair use, I get told that it's completely inevitable now and I should either retrain as a prompt engineer or go join the buggy whip manufacturers, but now that this is clearly violating programmer copyrights, you folks are starting to get angry.

I'll just leave y'all with my favorite of the things you keep telling me to STFU about art AI with: If you're the kind of programmer who feels threatened by this, then you're not a real programmer.


You're absolutely right.

Copilot and Dall-E (and so on) are all bad in the same way.

Many of us agree with you.


I'll admit it took me longer to connect the dots on this one but when I was tinkering with an image generator and it gave me a clear istockphoto watermark, I knew something was amiss.


Unless the image generators routinely generate specific works produced by you (or other artists) then it’s not a directly comparable situation to Copilot.


Like this? https://news.ycombinator.com/item?id=32573523

> I just got a Dall-E render with a very intact "gettyimages" watermark on it.


Yes, this would be a good example of genuine copyright infringement that shouldn't be tolerated.

Of course, it doesn't mean that all or even most DALL-E output infringes on someone's copyright. The same is true for Copilot. I think both have many legitimate uses if and when the "copyright laundering" issue is solved.


I'm a programmer, I don't really feel threatened by Github copilot. If the world can produce code more cheaply, the world becomes a much better place for everyone and a little worse place for developers.

Overall seems like a good trade. (for the world)


You are totally correct. I am embarassed by programmers complaining about that.


Why would anyone want to stop Copilot is beyond me.

Reinventing the wheel, millions of time a day, is an atrocity.

Millions of (wo)man hours, wasted, every single day, on writing solutions to problems that have already been solved. There is a partial solution to this, and it's making people angry, it's crazy.

If you put your code publicly on the internet, you should expect that people will reuse your code at some point, no one broke into your privates repositories.

Why would anyone waste their time to make other people waste more of their time is really beyond me.

Let go of your egos for once.


You want to use my code, without ever knowing I wrote it? You want to use my hard work, regurgitated anonymously, stripped of all credit, stripped of all attribution, stripped of all identity and ancestry and citation? FUCK YOU!

Training must be opt in, not opt out.

Every artist, every creative individual, must EXPLICITLY OPT IN to having their hard work regurgitated anonymously by Copilot or Dall-E or whatever.

If you want to donate your code or your painting or your music so it can easily be "written" or "painted", in whole or in part, by everyone else, without attribution, then go ahead and opt in.

But if they don't EXPLICITLY OPT IN, you can't use the artist's or author's creative work for training.

All these code/art washing systems, that absorb and mix and regurgitate the hard work of creative people must be strictly opt in.


Every human is using the hard work of other humans down through the entirety of history and mostly without credit or attribution. None us exists in a vacuum and we are all copying each other constantly.

Should students need to attribute the copyrighted textbooks and lessons that they learned from for all their future work?

Should artists attribute every reference they've used? Even if they draw stick figures based on the reference? Even if they only use small parts from multiple references?

What's different from a machine learning something and a human learning it?

I think in terms of practical open source/permissive licenses it makes the most sense for new licenses to be made that include no-training clauses for the rights holders that dislike machine learning.

Dall-E's use of training on non-permissive copyrighted web-scraped data seems more complicated and I imagine there will eventually be lawsuits to figure that out.


I just don't understand this at all. I publish my code as open source when I can because I want others to find it useful, either by using the software that I wrote or by reusing the code. If I didn't want that, I wouldn't publish the code. But I do want it, so I'm glad there's a way for people to access it more easily.

I understand the argument from an artist's perspective much more, since they don't really have the option to publish their work in a way that any AI or any other artist can't copy off of.


Simply being public doesn't mean it's in the public domain - this applies to movies, art, code, etc.

One example of restrictive but public licenses include requiring others to share their source code if it's derived from yours, allowing individuals to use a product but not allowing business to use it (businesses can use it under a different - likely paid for license), or requiring attribution or acknowledgement that they used your code.

There is an argument for fair use if it counts as a substantial derivative, which is a different discussion from why people make it publicly viewable without making it flat out public domain.


That's great for you. I hope you choose a license and copyright terms that enable this specific vision.

The vast majority of open source licenses and copyright terms specifically stipulate the legal requirements for reproducing even just parts of the code. Which at a minimum require reproducing the license and copyright with all software including the licensed and copyrighted code.


Do you place your published code in public domain or use something like CC0? Or do you use a license with some strings (e.g. attribution) attached?



Is your code public?


You’re missing the point. It’s not an ego problem: if you put your code on the internet with a license you should expect people to respect the license’s rules…


I think it's a gray area in the license. Much of the code was intended to be used freely and commercially by others, but not for AI training. It follows the license to the letter, but not the intent.

I expect we'll see new licenses appear making it clear whether or not the content can be used for training.


Who's to say the intent? I've published lots of code with very permissive licenses and I did so because I want people to be able to use that code for any reason. That's why I choose those licenses.


I think that's exactly why AI training (allowed or not) should be added to licenses.


There's nothing gray about it. The license requires attribution, and Copilot doesn't provide that attribution.


It's reading the code and generating similar code, not copying it.


Are you saying that's fair use? If so, then we won't see new licenses appear related to it, since a license can only give you more permissions on top of fair use, not take away fair use. If not, then we still won't see new licenses appear related to it, since the existing licenses already don't allow it.


Good point. I'm not a lawyer, but looking it up, the factors for fair use are:

1. the purpose and character of the use; 2. the nature of the copyrighted work; 3. the amount and substantiality of the portion used; 4. the effect of the use upon the potential market for the original work.

All of these are quite debatable, and I'll leave it to someone more familiar with the law.

Though if it's not, I believe there are licenses that allow derivative uses of code and licenses that don't. For many of these, the intention is that they create more code, but not be used to fuel AI behemoths.


Not everyone believes in intellectual property and good luck enforcing that license worldwide.


It doesn't matter what you believe. It matters what the judge and jury say when this goes to trial, and it will go to trial because Microsoft has a lot of money.


So? Most of the developed world have legal systems that does believe in intellectual property. The fact that a few people "don't believe in intellectual property" because they want to torrent movies/games is mostly irrelevant when it comes to the software engineering profession.


Expect you're not licensing functions, you're licensing a repository. If I use a sentence or even a paragraph from a copyrighted book, it's not copyright infringement.


> If I use a sentence or even a paragraph from a copyrighted book, it's not copyright infringement.

Note that this may not actually be true, and you may need to pay to license even shorter excerpts of creative work. Copyright is a complex topic. It's not always safe to assume that you have the rights you think you have, in terms of reproducing others' work.

For example: "The proportion of a total work is not the only factor, though. If you are including the most crucial aspect of a work, even if it is only a small part, then the question of “substantiality” comes into play." [1]

[1] https://www.dukeupress.edu/getmedia/3363cb6e-04b6-43ec-b004-...


It can be if you fail to give attribution. Plagiarism isn't just unethical, unprofessional and immoral (not to mention evidence that the plagiarist is an uncreative dullard). It's illegal. How many words or sentences it takes to trigger a complaint is mostly governed by what it takes to prove a violation. The more material copied, the easier that can be. In this situation providing attribution (tooltip when you mouse over the code?) would probably satisfy 9/10 of potential complaints. But big companies usually won't make that kind of minimal effort without being hit upside the metaphorical head with a piece of metaphorical lumber (like with an actual lawsuit).


If you take a function from a repository (or a sentence from a book), it is the unlicensed use of copyrighted material. Everything in the repository is covered by the license, functions, files… everything.

Whether or not it is infringement depends on if the use can be considered fair use. This is a more nuanced question and is not always clear.

In this case (Copilot) the real question is how transformative the AI training is. Given how verbatim some of the outputs are makes the argument less clear.


>If I use a sentence or even a paragraph from a copyrighted book, it's not copyright infringement.

I'm assuming you're referring to fair use. In that case whether it's copyright infringement or not is very situational (the legal standard consists of a test with various subjective factors) and isn't as simple as "it's less than a paragraph so I can copy whatever I want".


> If I use a sentence or even a paragraph from a copyrighted book, it's not copyright infringement.

It is…


then why do people put copyright and license notices on the top of every file in the repo?


> Reinventing the wheel, millions of time a day, is an atrocity.

> Millions of (wo)man hours, wasted, every single day, on writing solutions to problems that have already been solved. There is a partial solution to this, and it's making people angry, it's crazy.

Following this line of thought, do you think that all code from all software should be open source and publicly available (and free to copy and use), in the interest of saving more person hours from reinventing the wheel?


> Following this line of thought, do you think that all code from all software should be open source and publicly available

Let's help shape this thought: Copyright should be abolished entirely. It is one of many monetization schemes and its negative effects greatly outweigh its positives.

We know people won't stop writing software in the absence of copyright. We know they won't stop writing books, singing songs, etc. Copyright is not the primary motivator for either science or art.

Will we need new monetization structures? Of course. But generally speaking we already have them where it matters.

End copyright entirely.


> Let's help shape this thought: Copyright should be abolished entirely. It is one of many monetization schemes and its negative effects greatly outweigh its positives.

Even if you're right in principle (and I would love new monetization structures), this will never happen in reality.

Meanwhile, this idealism will get applied asymmetrically in the real world. If you (or the comment I was replying to) say "Copilot is fine, all code should be publicly available anyway", it downplays the fact that this wish will never happen with big players like Microsoft and will only happen with little players like anyone who used Github to host their code. The big player will typically hide their code behind copyright and lawyers to enforce it, whereas the little players have no similar recourse.

So, I see the issue as an exploitation, as Microsoft is selling a product built on the little players and not the big players. The debate around whether copyright should exist at all, while interesting, is not that relevant to most of the concerns being aired in the context of Copilot.


Yes, I agree the asymmetry needs to be addressed and the rules need to be enforced as they currently stand.

> this will never happen in reality.

Don't be so sure. These kinds of changes start with education.


Agreed, I was a bit pessimistic when I wrote that. Amended: "will not happen soon enough".


> Copyright is not the primary motivator for either science or art

Having a way to own works is probably pretty important to either of those endeavors, right?


No, I really don't think that it is.

There is no such as "owning" a work. We use that as a euphemism for owning copyrights, and the only function of copyrights are to prevent others from making copies. To prevent others from sharing.

The question is whether the monetization model presented by copyright is a net positive for the author, after accounting for its chilling effect on communications for all other people in the world.

The answer is almost certainly "no," as empirically demonstrated by entire segments of IP work opting out of copyright. The open source model clearly demonstrates that you do not need to own a work to fund it or monetize it. There are similar models in other areas of art and science which allow for the funding of works without preventing others from copying them.


> Following this line of thought, do you think that all code from all software should be open source and publicly available

Why are you asking this as if the answer might be no?


It says why in the linked post. People aren't doing open source for free; they do it for the community. But Copilot is there to extract value from it, giving nothing back, not even credit.


Can't open-source programmers improve their own open-source code with Copilot? Does the inherent improvements that all Copilot offers just not apply to people who write open-source code?

I understand that there is a balance, but as an open-source advocate who would love better tools to make their open-source projects better I'm lost as to why this point doesn't counter the "giving nothing back" we hear so often.


Since there's no way to know how code generated by Copilot might be licensed without expensive code-scanning tools, I don't think OSS can safely derive any substantial improvements from it.


If that's the only issue, I can't see the difference when I search for something on the web, copy the code and paste into my solution. There's no attribution, there's no giving back, nothing. Because I'm the community that you are saying the code is supposed to benefit.


Yeah, that sounds like a good definition of someone who is not part of the community. I copy code off Stack Overflow too, but often provide attribution in a comment. But like piracy, it's easier to hunt the whales than the small offenders.

Stack Overflow facilitates the same thing too, so it's an interesting comparison, but SO makes attribution easy and clear, and it actually made it effortless to contribute back.


No, you're not. The community reads and respects the attached licensing.


You plagiarize code as a professional?


Like everything in life. Its all about extracting value from someone else who has no control over the exploitation. You only notice when you are the one being exploited though.

As long as some company can improve its bottom line it’s all good though


That's... the exact opposite of a community. Communities are about contributing whatever you can, and taking what you need. There's more joy in giving than taking. Exploitation happens when someone is taking advantage of that tendency to give.

Eventually someone comes in and takes everything that isn't nailed down and then sells it, and that becomes the problem.


No one wants to stop what Copilot is being sold as. They want to stop the company selling it from doing what it is famous for doing.


Copilot is not selling code, they're selling GPU time. If you're ready to buy a hundred GPU/TPUs to train a new Copilot that is just as good, but for free, then go do it please, everyone will thank you


The tagline is "Your AI pair programmer"

That's the pitch. You're renting a pair programmer.

If your pair programmer is stealing code, you're going to have a bad time. This has nothing to do with...whatever you're on about.

Seriously though. Did you click the wrong reply link?


So you are saying, copyright does not apply to Spotify when it comes to music because they aren’t selling the music but rather a service to olay said music from a catalog?

Also to your other comment about copyright not being an issue if you just use a paragraph from a book - I am not a lawyer but I would think that copyright applies just the same way it applies to musicians who use portions of the melody of other musicians’ songs.


You're asking for people to be okay with potential copyright violations and a removal of attribution because of the common need. Like all things, there must be balance. Open source would not exist if the only use of its output was to train ML models that hide where the code comes from. Part of the allure of open source--maybe the biggest allure, honestly--is the community aspect. I get to find friends, contribute philanthropically, and feel proud of my contributions. Copilot removes any incentive I have to produce code for free.


it's not copyright violation. no one reads...

https://docs.github.com/en/site-policy/github-terms/github-t...

when you put code on github.com you grant GitHub the right to show that code to others, independent of the license you choose for your code. full stop. doesn't matter if it's on a webpage, a git client, or a github-developed plugin to an IDE.


So this doesn't negate the license. Microsoft cannot just roll the code into windows for example, closed source and proprietary. They have to abide by the license, regardless what their ToS says.

Here's a fun way to see it, suppose someone writes code licensed GPL. I take it, fork it, modify a line in it or not, and also license it GPL because I have to by law. I put it on my github account and what, I now just gave Microsoft rights to the code I don't even have? So by putting it on github I'm violating a license? It doesn't add up. The license to the code is the license to the code, no matter what site it's on and noatter what any ToS says. Otherwise what's to stop me from putting a ToS on my personal website partaining to your use of my eyeballs that says "if your creation becomes viewable by my eyeballs in any way I can use it however I want, publishing your work in such a way that it can be viewed by my eyeballs is consent to this ToS"?


> So by putting it on github I'm violating a license?

yes. if you don't have the rights to upload code to github.com, including all of the rights required of one that uploads that code to github.com, and you do so anyway, then you are in violation of the GitHub terms of service.

fortunately for you, the GPL allows what you are describing: "1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty;..."


While that certainly covers some code on GitHub, much of the code on there is just mirrored from other locations by non-owners: you can find copies of the Linux kernel and SQLite on GitHub, for instance. The users who upload those to GitHub have the right to do so (legally) but do not have any rights that they could grant to GitHub.


again, read the document. all this talk of license violation and almost no one is reading the agreements which say what rights users have given GitHub...

by uploading code you attest that you have the rights necessary to grant that license to GitHub: https://docs.github.com/en/site-policy/github-terms/github-t...

without the right to grant those licenses to GitHub, by uploading that code to GitHub, you are in violation of the terms of service, and the responsibility of acting in compliance with the license is on the shoulders of the user which uploaded that code to github.com.

Said another way, GitHub has no way to know if the person mirroring SQLite (for example) is acting in accordance with their rights, so the terms of service require that you attest that you are acting within your rights, acknowledge that it is solely your responsibility if you are not, and that by uploading you grant license to GitHub and its users.


So I can't fork code on github legally according to the ToS?


Read the terms of use yourself. it's all there.

the right to allow forking is granted by a user who uploads their code to github.com to other users of github.com. those rights are listed here: https://docs.github.com/en/site-policy/github-terms/github-t...


Nice edit.

When you upload code to github you give other people the right to fork it... I knew that already. But you license it. You don't give anyone the right to fork it and not abide the license. So if I fork it, I'm still giving Microsoft rights I don't have, I'm giving them the right to violate the license. That makes it illegal for me to fork it.

Let's say I am on a git mailing list, following a project, and I upload that project to github one day. It's licensed GPL. Microsoft says I give them the right to violate the license, and in uploading it I implicitly attest that I have the right to do so. I've violated the license? It's illegal for me to upload the code, with the license, to github, because Microsoft demands rights I don't have to give? Then let's say someone else forks it. They've now also violated the law?

It's nonsensical. The license is the binding ToS here, period, it doesn't matter what Microsoft's lawyers argue. Everything else is secondary.


> When you upload code to github you give other people the right to fork it... I knew that already. But you license it. You don't give anyone the right to fork it and not abide the license.

you are talking multiple separate things here.

when I upload code to github.com I attest that I have the rights required to do so, and the rights required to grant GitHub the licenses I've agreed to grant it by uploading.

> You don't give anyone the right to fork it and not abide the license.

correct, you can't grant a right to violate the rights granted. users of the code hold the responsibility of acting in accordance with the license.

> So if I fork it, I'm still giving Microsoft rights I don't have, I'm giving them the right to violate the license. That makes it illegal for me to fork it.

no. you did not upload code that you forked from a GitHub.com repository. if you are talking about uploading code that you copied somewhere else, and you're calling that a fork, you have violated the terms by uploading code that you do not have rights to upload. remember, by uploading code to github.com you attest that you have the rights required to do so, according to the terms of service. if you lie, you are responsible for that lie and its consequences.

> Microsoft says I give them the right to violate the license

your premise in this part is flawed. see above.

> Microsoft demands rights I don't have [the right] to give?

by uploading to GitHub.com you attest that you have the ability to grant those rights. If you lied, and you don't have those rights, that's your responsibility and your ass if a law suit comes around because of it.

perfectly sensible to me. GitHub gets to say that they require users to grant the rights in order to upload, and that the users necessarily had the rights to give to GitHub. if a user lied, that is not GitHub's fault; the user entered into a legal agreement saying they had the rights needed.


So you've got nothing? Because I'm seriously asking that.

Read the GPL.


If an ai model is allowed to emit copy-left code verbatim in proprietary software, you can effectively create a gpl 3 stripper. I don't think that ultimately serves your goal of intellectual sharing


I don't want to stop copilot. I put my code publicly on the internet precisely so people can use it, and not just as a user either, they can repurpose it, incorporate it into their software, whatever they want to do. It's called free software for a reason, and i mean it when I say it.

But they have to abide by my fucking license.


If the copyright holders are so difficult, why not restrict the scanning to code of enlightened people like yourself?

My guess is that there wouldn't be much to scan ...


Never forget this is how people who dare to reverse engineer Windows are treated: https://www.theregister.com/2019/07/03/reactos_windows_resea... https://marc.info/?l=ros-dev&m=118775346131654&w=2

I don't use Github, but fuckers upload my code there anyway.

Copyright is evil, but only large corporations having copyright, even more than they already do, is even worse.


This I feel like is one of the better points in the thread.

The asymmetry that exists in copyright law where large corporations can enforce their copyright to the point of breaking the law themselves (YouTube's content ID is another non-legal, but still very impactful example) is absolute bullshit.

Unfortunately I think that if training ML models on Internet-data is found not to be fair use then things will get harder for individuals training models and corporations will be barely inconvenienced as they can afford to pay for sources, make deals with other large institutions for data, etc.


They're treated with an email to the mailing list? Was there C&D or lawsuit? You make it sound like the ReactOS devs were thrown into prison.


Either the user-base of HN suddenly became a bunch of unethical folks who don't CARE about copyrights, usage licenses, authorship, or the future of open-source projects,

OR

This place is currently crawling with Micro$oft employees who have been instructed to swamp the place with disingenuous comments basically amounting to:

1) "fair use" is anything I want it to me

2) gimme your code NOW, because I want it, and it's MINE

3) get used to habitual violation of licenses as the new normal

4) you are ruining progress! harming kittens!

I can't see the actual HN crowd all suddenly being copilot users and fans, so that leaves me to conclude the latter.

I find Microsofts continual business model of evil to be rather threatening and annoying and they need to be checked, as they have only gotten worse with the decades. They abuse their market position to stifle any and all tech innovation. Break them up already.


I'm more worried about the status of freedom in software, open source feels like a mirage to divert the attention away from the original issues from the FSF.


One way to fix the problem would be to somehow feed Copilot a corpora of closed source code. This would either force Microsoft to add necessary copyright protections, or - which is imho more likely - would prove that those protections are already in place, but disabled for open source code.

A good start would be to take a leaked code of Windows, and then mechanically adjust all the names, constant values, and code formatting, and then publish it and observe.


> Microsoft char­ac­ter­izes the out­put of Copi­lot as a series of code "sug­ges­tions". Microsoft "does not claim any rights" in these sug­ges­tions. But nei­ther does Microsoft make any guar­an­tees about the cor­rect­ness, secu­rity, or exten­u­at­ing intel­lec­tual-prop­erty entan­gle­ments of the code so pro­duced. Once you accept a Copi­lot sug­ges­tion, all that becomes your prob­lem:

> "You are respon­si­ble for ensur­ing the secu­rity and qual­ity of your code. We rec­om­mend you take the same pre­cau­tions when using code gen­er­ated by GitHub Copi­lot that you would when using any code you didn’t write your­self. These pre­cau­tions include rig­or­ous test­ing, intel­lec­tual prop­erty scan­ning, and track­ing for secu­rity vul­ner­a­bil­i­ties."

I can't help but recall:

"Linux is a cancer that attaches itself in an intellectual property sense to everything it touches."

- Steve Ballmer, while CEO of Microsoft


intel­lec­tual prop­erty scan­ning

With "normal" code I can generally see (or figure out) who posted/published it and reach out for explicit permission. It's not uncommon for me to do this.

How is one supposed to do that for the generated stuff? Seems like an awefully hands-off attitude. As challenging as it is, they really ought to be qualifying the input samples of training code before ingesting.


There are some techniques used mostly to detect when students copy paste code. I've seen some of the tools in that space and they have varying degrees of accuracy. MOSS is a common one[0].

There are some vendors in this space too (BlackDuck comes to mind) but they're $$$ so only within the scope of large corporations.

If anybody has any ideas relating to this type of analysis, I'd be excited to chat. I am working on a project[1] in this space for "Software Composition Analysis" which could potentially overlap with snippet detection for code like Co-Pilot. (We basically just have a big pipeline of analysis jobs that run on code and store the results. I need to update the docs!)

0: https://yangdanny97.github.io/blog/2019/05/03/MOSS

1: https://github.com/lunasec-io/lunasec/tree/master/lunatrace


I don't think it's right to characterize it as hands off after they had their hands all up in the generated code. It's just malfeasant. They've produced a tool that is fundamentally (legally) unsafe to use and said that's not their problem.


Could you help me understand the link between the two?


It isn't so much a connection as an example of cognitive dissonance from the organisation.

On the one hand stating plainly that mixing in copy-left code and similar can be disastrously dangerous because it is a rampant virus. On the other hand not understanding why people think it might be a problem that their tool could encourage mixing in copy-left code.


Microsoft released a product which gives you cancer the moment you use it.

According to the opinions about what inclusion of open source code into your projects does, as per the ex-CEO of the company. That seems a bit of a far fetched conclusion, but then, Ballmer did say it.


The point is not clear, but if I were to guess, it's that Github Copilot should come with a California Prop 65 warning, because it can give your code "cancer" (GPL-licensed snippets from sources like Linux codebas).


Linux is open source and Ballmer is displaying Microsoft’s negative attitude towards open source that is demonstrated in the author’s arguments regarding copilot.


> rig­or­ous test­ing, intel­lec­tual prop­erty scan­ning, and track­ing for secu­rity vul­ner­a­bil­i­ties

Seems like best practice recommendation that everyone should apply when downloading a torrent.


Seems like we need MITpilot.


> Steve Ballmer

They have some really good blow in Redmond.

If anybody could win an award for being coked up and sweaty on stage...

https://www.youtube.com/watch?v=Vhh_GeBPOhs


Fun story: That was my first employee town hall, in 2000. I was concerned for the fellow (and so very glad when he left, Satya has been so so so much better for the company and morale). It was definitely an... interesting introduction to the company.

See also this Domo video that turned it into a song. :) https://www.youtube.com/watch?v=f7ZDH45OAt8


At the time, I was doing Linux, OpenBSD and FreeBSD stuff in Bellingham. The reaction from the local and regional non-Microsoft community was really like "Holy shit what is going on down there?!"


Copilot is great and this is a waste of time.


Being a useful tool doesn't make it legal.


Technical progress takes precedence over pitiful intelectual property discussions. If you don't believe that, I am not sure what you are doing in a community like this.


> Technical progress takes precedence over pitiful intelectual property discussions.

Let's pretend for a moment that your value judgement is reasonable and the advancement of technology should reign supreme over minor things like rule of law. Do you really think that letting people ignore copyright is always good for technical progress? Say, letting people use GPL code in proprietary code that they then refuse to share with others? Because that sounds questionable even if we agree with your casual disregard for the law.

> If you don't believe that, I am not sure what you are doing in a community like this.

Being interested in tech without being a fan of breaking the law and running roughshod over other people's work.


> Do you really think that letting people ignore copyright is always good for technical progress?

Only in very, very, very, very specific circumstances would I say it is not good. And they involve thinking about the counterfactual: "would this thing be created if there wasn't intellectual rights in place"? Code doesn't pass this test because people enjoy writing and sharing code. Pharmaceuticals, maybe.


You are not the authority on what this (or any other) community is about. Fetishizing "progress" (whatever you think this means) is next level idiocy.


You guys are trying to halt the very thing that will allow us to generate intelligent agents and hugely boost human productivity and I am the idiot.


"Generating intelligent agents" is not a worthwhile (or safe, in any meaning of the latter) goal.

I don't want anybody being able to generate "intelligent agents" and will support all legal changes likely to slow this down or halt it.


Being a hacker?


> No match for domain "GITHUBCOPILOTCLASSACTIONLAWSUITSETTLEMENT.COM".

> Last update of whois database: 2022-10-17T23:07:12Z <<<

Just sayin'...


Sad to see people trying to make copilot illegal

Using it is exactly like using Google. Google scrapes the internet and trains a model that gives you results for search queries on their website. The results may be copyright protected

Copilot scraped the internet to train a model that gives you results for code snippets in your code editor. The results may be copyright protected


Surely the difference is that if you find something via Google/search you can then read any copyright notice to determine whether the code is OK to use for your use case (no joke if you're a corporate developer). If you're using Copilot to "generate" (but sometimes regurgitate?) code then AFAIK Copilot doesn't show you the copyright notices or any indication of whether what it is giving you is a "fair use" derivative or a copyright violation exact copy regurgitation.


MS needs to give up and terminate Copilot.

The potential legal issues are there, but that's not why Copilot should die.

Copilot should die for any (or a combination of all) these reasons (and more which I don't mention):

- the operator has to already understand the emitted code to be able to determine if it is what is needed, or to modify it if it is close but not quite right

- the operator may have a false sense of capability, leading to bugs and other problems that would appear later (in production?)

- wrong suggestions are a distraction from the careful mental structures which one maintains while writing software

- any problem that Copilot can solve with guaranteed correctness is probably trivial or already met by a (battle tested) library

Forgive the analogy, but effective automated code generation is like autonomous driving systems. Anything less than 100% accuracy is a risk, and in these examples risk of incorrect behavior is not acceptable.

Copilot seems like a pointy-haired boss fantasy where they can hire only junior programmers and expect successful software products.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: