This might be overreacting but is there a way to opt-out of Copilot using your code in open source repos?
It feels morally wrong to me that I can spend thousands of hours working on projects on my own free will but then a company can sell the code I wrote to others in the form of snippet completion as a service. In fact they end up selling your code back to yourself if you plan to use the service.
If the answer is no, that moves the needle pretty far in the direction where I'd at least consider the idea of moving all of my repos to Gitlab. I don't care much about stars or popularity. I open source things that are interesting and useful to me and if other folks want to use it they can but I don't gain motivation from others using the projects I release. I like Github and its UI and it's no doubt "the spot" for open source but selling code written by others rubs me the wrong way a lot. It stinks because it also means no longer contributing to other code bases too. It's moving us in the opposite direction of what open source is about.
This is a really good point that I hadn't considered before. It's facebook all over again — selling your own content back to you. Repo owners should be at least compensated when their code gets used. That would be an incredible market.
I for one welcome our new CEO (Copilot Engine Optimization) overlords.
Jokes aside that will likely cause GitHub to be filled with lots of low quality repos (even AI generated, oh the irony!), to trick Copilot into using their code.
It is like the search engines vs. SEO arms race. The hope could be that such proliferation can be managed by disincentivizing such abuses. The reality might be vastly different with codes that have more regularity and better chance for AI's emulating humans than the natural language texts.
It should be automatic based on license. GPL code definitely shouldn't be included but MIT could be. They already have this information in most repositories and if its missing they have no right to use it at all. We don't need extra options the licenses already restrict the use and derivative work.
Not without the text of the license. I, as a developer, cannot just poach open source code under MIT without including the copyright and terms from the original project. From the license:
"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."
They might argue that a snippet isn't a "substantial portion" of "the Software", and they're only charging for the service not the content - regardless, I don't like it, this is exactly what certain licenses attempt to prevent.
I would argue that substantial shouldn't be measured in lines of code, it should be measured in importance. Something like the fast inverse square root is substantial even though it's short.
The fast inverse square makes for a poor example when it comes to notions of intellectual property and copyright because there is prior art. The Wikipedia page has a history.
And imagine if Microsoft had been able to copyright the fast inverse square function before Carmack sat down to write Quake!
Prior art matters for patents, not copyrights. Carmack's code is still protected by copyright even if he didn't invent the algorithm.
And copyright doesn't prevent someone else from implementing the same algorithm, only from copying the code. If Microsoft had been able to copyright the fast inverse square root function, Carmack could still have written his own version and even copyrighted that version himself.
What? You are free to reverse engineer any non patented object and reproduce an almost identical object. Prior art for copyright is meaningless unless it is about authorship. You can still do a clean room implementation and ignore prior art.
But the current interpretation is that you cannot claim authorship over things that were already in the public domain. Feel free to read the case notes and follow some of the links in there for more info. Am I not interpreting the court rulings correctly?
——
Considering de novo the evidence before the district court, we hold that the district court did not err in granting summary judgment. Johannsongs failed to offer admissible evidence to rebut Ferrara's analysis, so there is no genuine dispute of material fact as to his conclusions that Söknuður and You Raise Me Up are not substantially similar and most of their similarities are attributable to prior art. Based on these conclusions, Johannsongs has failed to satisfy the extrinsic test and Defendants are entitled to judgment as a matter of law.
I too have reservations about Copilot, but does the MIT license define a "substantial portion"? I doubt a snippet would fall under either "copies" or "substantial portions"
I doubt many licenses define that kind of terminology. That's left to precedents established by actual cases. My point was just that you're not free to use code from an MIT-licensed project without following the terms of the license. The other details get worked out when legal actions are taken.
Is your take that Microsoft should offer this for free? Or if they are not willing to do it for free, Microsoft should cancel this service and we should wait for Apache or someone else to offer the service?
Microsoft should make this service free for open source (not just thought leaders), and compensate people otherwise. I should have a 0.01% equity in Open AI if they're using my stuff like this.
Half serious/flippant, we need MS to create a cryptocurrency so that developers can be credited with micropayments each time their code gets “quoted” in the IDE.
Yeah, but those numbers translate to food on the table for my kids, a roof over their heads, better education, etc. Come on, this is a tired response. Nothing is wrong with people making money. There is a lot wrong with people making money off of the hard work of others without any consideration or remuneration.
I feel like you're missing the forest for the trees here - making code freely shareable and remixable is exactly the purpose of GPL and other free-software licenses, but you can bet that the proprietary codebases Copilot will be used in will go out of their way to prevent any such uses of _their_ particular code snippets.
IMO, the only way to use Copilot's output in an ethically sound way is to only use the output it produces in AGPL licensed projects (assuming that Copilot has not been trained on any non-free software codebases which in itself is a strong assumption).
> IMO, the only way to use Copilot's output in an ethically sound way is to only use the output it produces in AGPL licensed projects
Even then, that is missing attribution which should really be the default for all code reuse and derivation even when you legally are allowed to omit it.
Based on this comment, you may not understand what the GPL's purpose actually is, because it is NOT simply for the promoting sharing and remixing. The GPL is for ensuring that code, and its derivatives, are all able to be shared and remixed in perpetuity. the biggest (imo) difference between GPL and MIT/BSD licenses is that you CAN NOT use GPL'd code in a non-GPL* codebase. (*or GPL-compatible license)
It's just as shareable on Gitlab, no? And the issue isn't that code is not shareable - it's that a huge corporation is profiting from this code without consent from the developer.
The Twitter thread’s title seems unnecessarily incendiary and clickbaity.
I don’t buy that producing/synthesizing code snippets based off public repos is a problem.
There’s nothing proprietary or original about eg. the syntax of a for-loop, or the boilerplate of setting up some JS framework MVC.
Besides, it’s basically just a (semantic and contextual) search engine inlined within the IDE. Copyright infringement hasn’t taken place until the user activated the autocompletion and actually placed the code within their own and released their code containing the infringing code.
> There’s nothing proprietary or original about eg. the syntax of a for-loop, or the boilerplate of setting up some JS framework MVC.
Of course there is something proprietary or original about that. Why else would they need such an enormous AI to suggest it. Auto completing simple boilerplate was already solved in a much simpler way.
> Copyright infringement hasn’t taken place until the user activated the autocompletion and actually placed the code within their own and released their code containing the infringing code.
Copyright infringement takes place as soon as some company publishes/sells material without explicit license or permission. So not the moment the users hits accept, but the moment just before that: when the tool shows it to the user.
Applying your logic, is any search engine infringing on copyright because it contains a snippet of the source page?
After all, if showing a search result in the IDE is “publishing” (let alone “selling” (?)) why hasn’t Google been sued out of existence for showing search results (oops, “publishing” copies of original work, billions of times over), as well as selling related advertising?
> Examples of fair use in United States copyright law include commentary, search engines, criticism, parody, news reporting, research, and scholarship.
is any search engine infringing on copyright because it contains a snippet of the source page?
In some jurisdictions, it is. And in other jurisdictions, it is only allowed as long as it shows a link to the source page, which Copilot also doesn't do.
Don’t most licenses require at least attribution? I don’t believe GitHub is restricting themselves to only licenses that don’t. In fact the only software licenses I can think of that don’t require attribution are 0BSD, WTFPL, CC0, MIT-0 and Unlicense, and they all aren’t super popular. Also in some countries creators have inalienable moral rights which can be enforced regardless of the license. For example in Germany it is impossible to relinquish certain rights you have as the creator of a work, including the right to attribution.
CoPilot is a black box at the moment. Microsoft claims they used the public corpus on GitHub. There are plenty of GPL, AGPL, and "source available" projects in the public corpus. So what exactly is the licensing?
The argument may make sense if they limited themselves to public-domain (CC0) works, but that is not what happened here. If CoPilot attributed something to an AGPL project, does it mean the "virality" applies to all projects that use code from CoPilot?
There's also a good amount of commercial and leaked source code on GitHub, including MS's own leaked Windows XP source. I haven't played around with Copilot yet, but if I ever do I plan on copy/pasting some win32 API definitions to see if I can get it to spit out any of the leaked source.
> if I ever do I plan on copy/pasting some win32 API definitions to see if I can get it to spit out any of the leaked source.
If that works, then I can't wait for that to be a boon for Wine and ReactOS: "Microsoft itself provided this code and allowed us to use it, so therefore it's totally legal. Neener neener."
Some trivia: CC0 is a public domain declaration.. at least in the US. There is no process by which an author can make their works public domain, CC0 is just a (weak) promise that the copyright holder will treat the work as if it were public domain.
This feels like a tool that can easily be destroyed by a lawsuit, I can't imagine a TOS can force you to give away your copy rights (especially if they allow and encourage you to post your own copyright).
If it can't then Wikipedia is doomed; its entire licensing status rests on the notion that editors grant such a license as part of their clickwrap ToS.
Does GitHub verify that the code that is in my repository is actually in accordance to the license that I’ve added? I could just upload any proprietary code with an incorrect license, and GitHub would just use that to feed their AI. Like any other dependency that you incorporate into your application, GitHub should verify/audit whether the license allows them to do so.
Even if Github did provide that setting, as a courtesy, someone could clone / fork the code to another repo (if you use any licence that allows it) and not enable that setting.
In a case like this, GitHub itself could set up a bot account that forks all projects as soon as you make the switch. The company in fact would be incentivized to do so.
I'm not sure using a different license actually opts you out. By merely hosting your code on GitHub you grant them the right to analyze your code on their servers[1]
They may be morally in the wrong, but I'm unsure they are legally in the wrong here. To boot, denying them the right to create this tool in your license is technically a violation of OSS principles and problematic
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
> This might be overreacting but is there a way to opt-out of Copilot using your code in open source repos?
I don't think there is a way to opt out if it is a public repo regardless of license, and Microsoft's copyright theory suggests that they wouldn't feel obligate to enxclude any code they got their hands on except under a specific NDA preventing such use; the use of public GitHub repos isn't based on legal constraints but practical convenience.
> If you read free code yourself it’s fine, but if a machine does it for you it’s not? We overvalue humans.
No, it's not fine. Apparently, you missed SCO & Oracle vs. Google cases. Both of these cases argued that somebody looked to the code, and copied it. In SCO case it was not true, but the argument stretched the timeline rather successfully. In Oracle vs. Google, copying function signatures opened a big can of worms.
So, just by copying the function signature without filling it the very same code with the original, even for interoperability, you're getting into a huge gray area in a legal sense.
Similarly, no sane Wine developer will read leaked Microsoft source code, yet alone copy it. Again, no sane emulator developer will read leaked Nintendo code.
Reading the code "colors" your creativity, and if you're tried at court and enough similarity is found in your code with the leaked code, it's game over.
So, reading code and copying is not guaranteed to be legal, depending on its license. When this is done by a robot, it's still illegal (you're breaching licenses during the code generation process), and immoral and unethical on top of it.
So, we don't overvalue humans, but overvalue AI, which is just informed search, BTW.
No, I didn't and don't read other people's code to understand how something works. I use books and official language/library documentation for that.
On the other hand, this is irrelevant to the issue at hand.
GitHub copilot is not a tool for education. It's tool for auto-completing code, which can be put to production, where licenses and other stuff come into play.
The issue is not code sharing per se. It's more of a legal problem, and an important one at that. In the software copyright sense, even reading code you can't import to a project (let it be leaked, not compatibly licensed or for any reason), puts you at risk of legal troubles. This is why we have methodologies like "clean room development".
In Copilot's case, you're possibly deriving a code from a source which contains many licenses, and some of them are not compatible with that you're doing. As a result, you're in direct breach of the license which is not compatible with your code.
On a more higher level, you're also breaching the ethics code and morality by using a code or its derivation with an incompatible license to your code, and disregarding other peoples desires codified as a case-tested and valid license.
As a result, if you think that using a derivative of a GPL licensed code in your closed source application is OK on every front, then the vice versa is true. I can disassemble and reverse every part of your code and re-implement it as GPL bug for bug and open it.
Because if you can breach my license, and expect no consequences, I can breach your license without consequences, as well. It's a two way street.
I find this whole topic very annoying, this is like the 3rd variation to reach the front page today. But it has made me realize why I instinctively dislike Free Software as a movement.
Copyright and licensing are bad, actually. Stop getting worked up about the idea of using courts to punish theft. Stop getting into a frenzy of arousal about the police kicking down doors to drag Billy Gates to jail because 80 characters of fast square root is theft but 79 isn't.
Where on earth is the ambition and vision!? Knowledge is public domain. A commons of knowledge is a public good. The cost of code copying is zero.
Sure in our day job we have to pretend to care about this stuff. But when did the ideological scope of what can be achieved become rules lawyering over license text.
Copy my MIT licensed code without attribution? I don't give a shit, go ahead, I hope it helps, in fact I want a truly public domain license but copyright law is so hostage to corporate interests no such thing exists in many countries.
Yes but this copilot model takes that, adds value and doesn't itself join the public common good. Instead it takes it, and makes you pay to have it back in another form.
If copilot were open source and the model released for the public good, being built of public data (in your scenario) we would have a very different conversation.
It costs money to run a huge language model with low latency, in the loop with you - charging 10$/month is reasonable. You need multiple GPUs to load even a single copy. Copilot is adding something extra to the original code - it selects the recommendation from the whole corpus, while keeping the surrounding context into consideration and adapting to your variable names.
And in reality 99.9% of the generated code has no long ngrams in common with the training set, it's already original. All they need to do is to enforce never to generate data identical to the training set, something that can be implemented with a bloom filter, then the generated code is impossible to attribute and should have no legal problems.
In the end what do models like Copilot do? They act like culture - absorbing and replicating memes. They free the knowledge and make it reusable. They can act like a general purpose NLP tool for information extraction, classification and text generation. You can implement your ideas faster with it, don't need to label much data.
It works even with just a prompt. Try OpenAi Codex to extract a receipt to see what I am talking about - it gives you the output in JSON. It's a new tool and a new interface to the computer. There are going to be plenty of open source implementations as well, some are already under training.
You are incorrect. The code it generates is substantially the same (complete with comments) as the input, which is often sought without permission and in violation of license.
And offers nothing back to those authors in return.
Thank you for you this. I wouldn't never been able to articulate it better - people are just annoyed that someone is making m money and they aren't, without considering why that is.
I want every line of code I've ever written to be used as much as possible.
I find "intellectual property" to be dubious to the core. I'm not confident enough in my feelings to be a zealot, but if I had to pick sides then I know which side I would pick.
You're welcome to use a "do whatever you want" license on your code, and people should respect that. (Though even those licenses tend to require attribution, and copilot doesn't do even that.)
Other people use licenses that try to create a commons where if you want to use it you need to share your own code, as a counterpoint to the non-commons in which you can't use code at all. And if people use those licenses, they should be respected as well.
By all means, eliminate copyright, and let all code be copied freely. And until that happens, as long as proprietary code exists and doesn't let anyone copy it, respect copyleft licenses as well.
If an AI "listened" to music and created new samples for musicians to use for a fee, do you not think the original musicians should be compensated?
The value transfer is basically theft.
It isn't about the usefulness of the service, or even that something similar is a good thing ... it is about the execution and what it says about fairness for those that worked to create the data it depends on to produce value.
I'm not sure I was clear enough when I expressed my doubts about the concept of intellectual property.
Your musical example is playing out in the courts in multiple forms. The Marvin Gaye case, Led Zeppelin, Katie Perry etc.
And each case pushes me further towards wanting to rip down the whole rotten edifice.
We've lived through 4 or 5 decades of unprecedented expansion of the domain to which IP lays claim. Surely it's time for the pendulum to swing the other way?
Yes they haven't paid it forward, or back, but why fight on the occupier's territory. By calling for legal frameworks to enforce this we accept the language and terms of the dominant party. By using courts and the law and creating new law for copyright we actually move further from the goal of abolishing copyright and IP entirely.
Every time we use courts to enforce IP we're strengthening the Walt Disneys and Nintendos of the world.
(I accept I am in a group of like 3 people with this goal but it's my view)
Edit: to expand slightly more on this. People should be able to decompile/reverse engineer whatever the hell they want. They shouldn't have to worry about armed goons kicking down their doors. Every time cases are used to strengthen the enforcement of IP/licensing, whether for the light (FSF) or dark (Micro$oft, Google, etc) the outcome is the same, we move further from that goal.
> the goal of abolishing copyright and IP entirely
Completely agree with you. It's the 21st century, once data has been published there is no controlling it anymore and all attempts to do so lead to the destruction of computer freedom. No doubt people all over the world copy code every single day with nobody even finding out about it. I'd rather get rid of all these monopolists than limit the potential of computers to whatever reality enables them.
>I accept I am in a group of like 3 people with this goal but it's my view
> Every time we use courts to enforce IP we're strengthening the Walt Disneys and Nintendos of the world.
Can you actually point to substantial examples where Disney or Nintendo benefited significantly from a precedent set by an open source court case? Open source has been around for decades, so it should be trivial to find numerous clear-cut examples at this point... if your theory is actually correct.
No, I honestly have no idea. I know nothing about the law and understand even less. I may be wrong about all of this, but if we take the (laughable) idea of justice being blind it stands to reason any precedent that protects a single open source developer also protects Amazon's code.
Funny thing is ALL these legal frameworks are there to protect these 3 people like you.
If there would be no enforcement of IP/licensing or legal enforcement - M$, Google etc. would not be nice - they would just come over and kick your doors cut your head off because they could do so. With legal framework they at least have to ask someone else.
You just have to understand you don't stand a chance with your 3 buddies against 10 motivated attackers.
Writing about "accepting terms of dominant party" you clearly never had a robbery at your house - imagine now corporations doing the same when there would be no legal frameworks.
Read up on Dutch East India Company - or just Nestle - Microsoft or Google are still quite nice companies with Walt Disney and Nintendo.
This is a slight misreading of my general political position. I am pro-government in general. I find the term "monopoly on violence" to generally indicate someone who lives a very cosseted and easy life who can spend time getting mad about like, seatbelt laws or speed limits, so I use it somewhat tounge-in-cheek.
There's quite a lot of possibilities between DMCAs of youtube-dl repositories and Big-co death-squads decapitating people in their homes. I'd prefer where we are now to the Brazil end of that spectrum but we can imagine better models of digital and intellectual 'property'.
People aren't whining because the price is too high, they are upset because some (myself included) believe Microsoft is exploiting developers by copying their work against their wishes and then turning around and selling other developers a product which may or may not be generating code which violates copyright/patent licenses. A developer who inadvertently uses a copilot suggestion which gets them into hot water is going to be spending a lot more than a the cost of a latte to defend themselves in court.
If someone contributes to open source, then they shouldn't be surprised that someone else uses this code. The licensing hell is something that shouldn't belong in IT.
When source code is made available under an open source license, there are strings attached; attaching those strings is the author’s right! Assuming you or any company has the right to do anything you want with that code without respecting the license is immoral.
That “licensing hell” (i.e. strong copyleft protections) is the reason we enjoy such a vibrant and large open source community today. I don’t take it for granted that open source as we have it today was inevitable: it required a lot of work and I’d hate to see that slip away.
The licensing hell is exactly the problem. If someone contributes to open source, which is a praiseworthy activity, then they do it with the intention that anyone can use this code but also re-adapt it, bundle in new products - it's all about bringing humanity forward.
And all those "you can do this, but you can't do that" licenses are things that only invite lawyers to the tech world. IMHO, licensing open source is a bullshit activity.
You are making a lot of assumptions about what someone wants/intends when they contribute to open source codebases. If an author chooses, for example, the AGPL, I think they clearly had a different intention. Like it or not, not everyone wants to dedicate their work to the public domain.
GPL code is open source but what you do with it also needs to be open source as a condition of its use. Will CoPilot inform developers if suddenly the code suggested requires them to re-license their software?
Every large successful open source project I know is explicitly not in the public domain/licensed CC0. I understand that there are some people that are very against copyright/intellectual property but you surely must interact with a large number of projects/people that disagree.
Yep, anything useful has to be legal and welcomed. Microsoft should start breaking into people’s houses and sorting their underwear drawers for them while they’re out. Million dollar idea!
> "Yes but this copilot model takes that, adds value and doesn't itself join the public common good. Instead it takes it, and makes you pay to have it back in another form."
$10/ month ... how much to you think this thing cost to build, and to maintain?
That's the whole point. Without the data, it would be worthless. Microsoft is not paying the full cost because it is ripping the data without asking consent. I'm not saying what they are doing is illegal per se, but it's definitely immoral.
But why is it immoral? All that code is still out there, if I had the time and the resources I could build a language model. Unlike commons in the real world (e.g. land, fresh water, etc) a code commons is purely additive. With the release of Copilot (which I don't intend to pay for or use) nothing has been destroyed, instead we'll get more code for less work where companies do pay for their developers to use it, some might even find its way back into the commons as new open-source code (whether more code of copilot generated quality in general is an unalloyed good is left as an exercise to the reader).
Because copilot is violating the terms I put for my code. My code is GPL. It cannot be put into projects with incompatible licenses. That’s my code, and I share it with strings attached. You can’t just copy my code and sell to other parties no strings attached.
If that’s fine and dandy, Microsoft should also train Copilot on their source code repositories, so we can use that knowledge, too.
I guess I've just never had to work with GPL code before, but the complaints essentially only seem to be coming from coders who like this style of open source where you still get to make it kind of a pain in the ass to actually use your software.
I guess you have the right to do this, but it doesn't mesh at all with why I personally contribute (without any expectation of attribution), which is that (much like stack overflow), programmers mostly agreed awhile ago that it's just easier if we all share.
So much of what's wrong with the modern economy comes down to seeking rent on an idea that should just be public knowledge.
Sorry if my viewpoint towards your work is apathetic, but the whole field is already infested with academics who only understand citation as a useful metric. Further, the point remains that anyone with enough money could do this - not just Microsoft (Salesforce has released several models for python competitive with Copilot). Times are changing - maybe don't share code anymore? I imagine in ten-twenty years this whole conversation will seem pretty petty though when your entire program is trivially recreated from its GitHub description without ever needing to have seen it in the first place.
>from coders who like this style of open source where you still get to make it kind of a pain in the ass to actually use your software.
Most "coders" don't publish anything if they don't have to. Using proprietary code is an even worse pain in the ass because you don't have access to it.
The point of the GPL is to force people to share their code.
>which is that (much like stack overflow), programmers mostly agreed awhile ago that it's just easier if we all share.
>So much of what's wrong with the modern economy comes down to seeking rent on an idea that should just be public knowledge.
The entire point of the GPL is to force e.g. hardware vendors to share their driver code under the GPL or any other opensource license to be included in the Linux kernel.
>Times are changing - maybe don't share code anymore?
The entire point of the GPL is to force people to share their code.
> I imagine in ten-twenty years this whole conversation will seem pretty petty though when your entire program is trivially recreated from its GitHub description without ever needing to have seen it in the first place.
What the hell are you talking about? If that is the case then why did humans ever bother with extensively documenting and testing their software if three sentences are enough to encode it? Your perspective is particularly annoying because copilot isn't learning to write its own code, it's entirely reliant on an army of unpaid software engineers publishing code on the internet. If it knows how to recreate a project from just the GitHub description it basically just had the codebase inside its model to begin with and merely pretend that it did everything on its own. That is actually a form of rent seeking.
> extensively documenting and testing their software if three sentences are enough to encode it
Was just hyperbole for "from plain English specs/requirements".
I'll admit to being uninformed about GPL, but your understanding of large language models is also limited. They actually learn to interpolate between data points meaning they can compose sequences not found in the training data. Further, GitHub added a feature that checks existing code for a match and rejects predictions if any match occurs.
Nobody disputes their ability to interpolate, I think (at least not me), but the problem is the starting points for these interpolations contains GPL licensed code, hence it derives GPL licensed code.
This derivation brings GPL in, and the model doesn't understand this. As a result, every time a GPL training data is mixed into the interpolation, you're converting the code GPL, or if you're not converting your code to GPL, you're violating GPL.
It's plain and simple.
On the other hand, I'm hearing "we'll write the specs, and computer will just auto-generate it" gospel since 2002. This time it won't be different. Human brain, intuition and creativity is beyond algorithmic modeling.
So, no, computer will not autogenerate the code from specs. It might link boilerplate together, which can be already done today.
But GPL owners aren’t seeking rent, so you’re just asking those who believe all code should be open source to unilaterally let large companies use all their code, while they reap no such benefits from the large companies
Like I said, I understand the premise, just not the emotion behind why you want to release code to the public at all if it isn't simply a donation to all human knowledge.
There are better ways to gain notoriety as a coder than by essentially legally requiring your name is attached to a thing for all time.
I personally would be thrilled to know my work was valuable enough to be used by a company because I really just couldn't care less that about the "credit" part of it. I know what I've done and don't have anything to prove.
> Why you want to release code to the public at all if it isn't simply a donation to all human knowledge.
On the contrary. I donate my code to all human knowledge. Just not to corporate's private code corpus. I intend my code to be open to all humans to run, study, modify and share, forever. I don't give you the freedom to take it to a closed domain, and not share the further knowledge you derived from my code. If your primary intention is to return this knowledge to human kind, GPL is an enabler, not an hinderer.
> I personally would be thrilled to know my work was valuable enough to be used by a company because I really just couldn't care less that about the "credit" part of it. I know what I've done and don't have anything to prove.
I personally don't care whether my code is good enough to be used by a company. If I want to contribute code which can be used by a company, I can contribute to MIT projects (which I also do). I don't have anything to prove.
I release my code with the hope it'd be useful for somebody, and I don't want it to be included in any permissive or closed source base. Doesn't matter it saves your beef for today or not. That's not my problem. Go write a better one, then. I don't care.
When actually using the software means "taking it, adding it to a commercial software and never telling anyone, incl. the developer of the original code, and not giving any attribution whatsoever, and earning money over that piece of code", yes GPL makes it hard. It's by design, and this is why I license anything and everything I put in the open GPLv3+.
If anyone contributes to a GPL software, they're clearly attributed. Moreover, Git makes this attribution irrevocably visible. Before that patches were sent in with mails, and mailing lists were open, so attribution was also visible back then. So, no, GPL makes attribution visible, and irrevocable, by design.
GPL doesn't seek rent over any idea. It forces ideas to stay open, forces you to put your improvements back in the open. You'll be attributed, your code will be in the open all the time, and nobody can grab and run your code and hide into its software to make any kind of unjust profit, which makes "Open Source" coders visibly and literally wince and cringe, because they can't grab and paste a piece of code and make their days easier.
Again, this is by design.
Sorry if my viewpoint towards your view is apathetic, but the whole field is already infested with programmers who only understand being able to copy and paste code left and right to develop software as a useful metric.
It's not about Microsoft, it's just about being honoring a license. A case-tested, lawyer written, trusted license which many developers chose for licensing their work. It's a breach of contract, plain and simple.
As I said elsewhere, some of the code I'm writing is backed by papers. I don't obfuscate my papers to prevent anyone from implementing it, but if I open my reference implementation as GPL, this is because I don't want someone to grab it and run with the code, change it a little, put into a closed source program and call the idea theirs, possibly patenting it in the process.
I have a serious piece of research, my Ph.D. actually, and I'm still developing the code powering the whole idea. I was planning to open it under GPL license, to force its evolution in the open, but I understood that people don't appreciate that. So, probably I won't open the code. Binaries maybe. Highly obfuscated, protected binaries, probably.
You can say the exact same about piracy, when I take a game or a pdf book from a pirate site, nothing is destroyed, nothing is subtracted. The server still owns the data and can copy and share it infinitely, all that changed is that I now have a copy too, and I use it to enrich my own intellectual life.
The argument has 2 main flaws
1- It's not symmetric. The massive corporations with paid armies of lawyers aren't hugging trees and talking about how "Knowledge is - like - just free, man" with dreamy eyes, I would love if they were like that but no. They are constantly on the lookout for anyone remotely using their work. They don't deserve the language of free knowledge and open data, that would be like extending peace to an invading army, or defending a tyrant with the lingo of free speech. He Who Lives By The Sword Dies By The Sword.
2- If the person(s) behind the data or the code lives off their intellectual labor, you are ripping them off by using it without compensation. Sometimes the compensation is as little as simply citing them, just mention their names so that they get visibility and prestige they deserve for toiling in the intellectual field to produce the ideas and brain patterns you use and benefit from.
The whole thing is a huge mine field, digital reproduction of information and abstract structures is an extremely novel phenomenon that breaks tons of human intutions about how ideas and thinking work and spreads. But the involvement of a corporation allows you to shortcut the entire thing by invoking (1), also known as the fundamental theorem of ethics : Do Unto Others As You Wish They Do Unto You. Do corporations allow you to freely take and mix their intellectual produce and sell it back into them ? No ? then they DON'T get to do that either, except maybe among themselves.
What I find strange is how nobody talks about how inherently repulsive and ugly the "Copilot" philosophy is, how it is fundamentally a dead end and how much it betrays a lack of understanding of how programming works on part of those who fund and market it. Code is different from natural language, the fact that we call the symbols we write algorithms in "Programming Languages" is purely a historical incident. Code doesn't have the redundant resilience and error-correcting properties of natural language, removing or modifiying or adding even a tiny bit to correct code can give you atrociously-slow correct code, or full-of-security-holes correct code, or non-correct code, or any of the 3 mixed together with other disasters. If you're going to steal people's open source code, at least do somthing interesting and intelligent with it, don't be a lazy fuck and apply an NLP technique to a highly formal and rigid domain then smile smugly and charge people for it as if this going to end anywhere useful.
To allow access to a service that grants you the accumulated knowledge's output in small bits.
I'm all for a world where these tools help developers, but i'm not here for a system that isn't open. I want to own my tools.
Copilot is a bit like musicians paying a monthly fee for access to a loop library. Except all the loops are rip-offs of other peoples hard work and there's no effort to compensate them.
If I made an AI that resampled music into derivative tracks ... you can be damn sure i'd be sued until my ears bled.
> monthly fee for access to a loop library. Except all the loops are rip-offs of other peoples hard work and there's no effort to compensate them.
the analogy works if there were an open access library of music (restricted licenses tho they may be…) that was available to search and browse without the tool
then an auto-composer could suggest music to fill in gaps in my own composition, using snippets of audio from the otherwise freely available library
that's a plug-in I would pay for too, but yea if my "no commercial use allowed" melody made it into someone else's composition, I would want my license terms to be surfaced to them as well
except I personally wouldn't want to live in a future where every line of code has to have some claim of "who authored this function first" or "who wrote this melody and rhythm first", pursuant to licensing terms in perpetuity. that sounds terrible.
I'm all here for openness and tools you own, so there could be a FOSS implementation. Microsoft could just open it up and still charge the $10/mo for hosting the model, and I hope that happens.
Making the tool better without verbatim copying and making it more effective should be the priority, IMO. Trying to control it too much would be missing the point of the tool.
Seeing as copilot is known to output code thats a straight copy from non-permissive code where the author's permission wasn't obtained ... I'd say it is helping you steal from code authors without giving back (as there is no obligation to open source your code).
Given Microsoft's record of persuing IP violations aggresively through the legal system, I'd say the whole thing is ironic.
The issue is that whether the free software people want it or not, the copyright system over code exists, and historically has been used as a cudgel against smaller players. If we got rid of copyright over code entirely I'd totally be down for this. And IIRC RMS has said the same thing; that he'd be in favor of the removal of copyright over code as a concept even if it meant neutering the protections of the GPL.
Until that happens, and copyright protections are still used by larger entities, using the same system to protect yourself and (more importantly) your users isn't turning your back on your ideals, but instead simply adjusting your strategy to the current material conditions. Remember that Google v. Oracle (while ultimately a win versus what could have been) was a step back, with de minimis claims left on the table as not a valid defense. The play field is heavily slanted towards the big players and software freedom requires every tool it can put it's hands on at the moment.
> The issue is that whether the free software people want it or not, the copyright system over code exists, and historically has been used as a cudgel against smaller players. If we got rid of copyright over code entirely I'd totally be down for this. And IIRC RMS has said the same thing; that he'd be in favor of the removal of copyright over code as a concept even if it meant neutering the protections of the GPL.
As someone else asked, I would also want a citation, but I agree.
Actually, I want a license that you can do pretty much anything you want to do with it (including: lack of attribution, distribution without source codes, distribution with source codes (whether they are the original source codes or reconstructed), lack of copyright notices, reverse engineering, circumvention of your own copy and write reports about anything you want to do, to use or not use the software (and to modify or not modify) at your choice, etc), but that you are not allowed to add further legal restrictions to it (with a few exceptions dealing with trademarks (but not all) and allowing conversion to GNU (A)GPL 3 and CC-BY-SA 4.0 if you are able to satisfy the conditions of those licenses) or to derivative works, and that if someone will try to use legal processes against you relating to this, then anyone can countersue.
I think at its root the problem is copyleft is a mirror image of copyright. It relies on and replicates all the cultural and legal requirements and constraints of the copyright model and curtails an imagining of other possibilities. Every sentence or thought spent on copyleft is misdirected in my view.
Which is why I find Microsoft doing this (potential) en-masse license violation and then a bunch of GPL folks getting mad pretty funny overall. I just find the high and mighty tone annoying, like sure, they've (allegedly) screwed you, but they're going to (theoretically) get away with it because they're rich and powerful, sorry that didn't turn out how you wanted.
I don't think that's true: copyleft is right to repair for software. Even if the software is not copyrightable without the source code users are still relatively powerless. (Incidentally this is related to why patents were created: not to constrain or encourage innovation, but to get people to publish inventions instead of keeping them secret). If copyright were abolished and so too copyleft destroyed, linux users freedoms would probably materially go down, not up (though in general user freedom would marginally increase because most software is not copyleft).
> And IIRC RMS has said the same thing; that he'd be in favor of the removal of copyright over code as a concept even if it meant neutering the protections of the GPL.
Do you have a citation? I was under the impression he defended copyright because copyleft depends on it.
> Copy my MIT licensed code without attribution? I don't give a shit, go ahead, I hope it helps
This is my feeling as well. I don't build stuff in the open so that I can get bent out of shape at someone not properly licensing it. It's in a public repository, FFS... I assume that if anyone even notices my repo, that they may copy/paste a few lines out of my solution if it helps them.
Exactly! Do they really think every single line of their code is so precious it requires attribution? If I publish code, I assume it might get pushed, pulled, refactored in a million ways and no one will ever know my name’s attached to it. And guess what? I DONT’T CARE. It’s code. Not a self-constructed monument to my own intelligence that needs a little placard with my name on it to follow around some clever async function I wrote
If its a couple lines of generic code, of course. That's also an indefensible copyright, btw. But if its hundreds of very specific likes of code written to do one thing under a license you don't follow, that's something else.
This isn't just an issue of code. You can write a program that combines songs, or combines novels creating a different work that has sections that are essentially the original protected work. I don't think the authors of those novels are going to be OK with you selling or giving away a version of their work just because an AI edited it or combined it somehow.
But this isn't everyone's feeling. And they have a right to choose how their work is used. Thats the basis of commerce being possible here.
The mechanised license ignorance and the way original authors are not compensated is the issue.
If you had a repo you'd worked really hard on, and offered a commercial license or GPL depending on the use (so you can be funded to work on it) ... do you think it is fair that copilot ingests that code and allows others to benefit from your work and knowledge without the commercial license as you intended?
Note how Microsoft always throws out the capitalism "rules of engagement" when it benefits them and undermines everything else. The fact we are even trusting the situation Microsoft are creating is dire, and speaks to the short memory of our industry.
When it's demonstrated that it can generate whole function bodies intact (fast inverse square root debacle), and autocomplete it with a wrong license, it's not a stretch anymore.
You may not care about licensing or copyright, and I imagine many others who create code under an attribution license don't. That's still not the same as saying "copyright and licensing are bad." Too many businesses depend on them to exist for me to have that opinion.
If an AI takes a copyright work and makes its own version-- say combining two novels by popular authors in a way that is unique but keeps large parts of the text intact, can I sell that? I think if I were the authors I would be unhappy.
Also, how hard would it be for copilot to include a comment saying "// I got this line from x repo" when you are copying from a new repo? I am guessing not hard at all. Then at least the user would be aware of where their code was coming from and could be expected to make a judgement. If the line is "let a = b" then probably no worries. But if it is hundreds of lines of a simulation, all from the same repo with no changes, then I think some attribution is good for both parties.
Don't get me wrong, I know this (copyright abolition) is pie-in-the-sky stuff. I'm using an anon account to post because even advocating for it could be troublesome for employment. But I don't accept we have to be meek or have small goals in talking about this ideological stuff. And I think this has made me realise why I find the Free Software vision so disappointing and weak. And hence why I find all these (ideologically) Free Software aligned takes of sending Billy to jail for a thousand years so irritating.
> I find this whole topic very annoying, this is like the 3rd variation to reach the front page today.
Me too. I also find three iterations of the same subject not enough discourse. We need to take this matter more seriously.
> But it has made me realize why I instinctively dislike Free Software as a movement.
On the other hand, this whole discourse reminds me why I absolutely love Free Software as a movement.
> Copyright and licensing are bad, actually.
This is why we have "Copyleft".
> Stop getting into a frenzy of arousal about the police kicking down doors to drag Billy Gates to jail because 80 characters of fast square root is theft but 79 isn't.
And, stop getting into frenzy of arousal about being able to use any and every code piece you see elsewhere in any project regardless of its license.
> Where on earth is the ambition and vision!? Knowledge is public domain. A commons of knowledge is a public good. The cost of code copying is zero.
This is why GPL is important. It forces knowledge to evolve in the open, stay in the public domain and help it actually makes public good. It also doesn't hinder ambition and vision by not taking it to private domain, and keeping it open to everyone.
> Sure in our day job we have to pretend to care about this stuff. But when did the ideological scope of what can be achieved become rules lawyering over license text.
You might be pretending to care about this in your daily job, but we really care. Some of the projects I take part can't ever include GPL code (because the projects are MIT licensed). These texts are court-tested licenses, so they're as proper and serious agreements as the EULAs of "particular" software companies.
> Copy my MIT licensed code without attribution? I don't give a shit, go ahead, I hope it helps, in fact I want a truly public domain license but copyright law is so hostage to corporate interests no such thing exists in many countries.
If I want my code to be copied and possibly closed, I'll license it with MIT or BSD-0 and forget about it, but if I'm licensing my code with GPL3, it means I want that code to stay open. As a license, I expect anyone using that code to respect that license.
> Free the code.
Yes, and respect the license the author selected for his/her code.
This. Exactly. It's suprising how many developers have strong anti-copyleft/anti-GPL opinions while being completely uninformed on what they're talking about (but hey, I guess "uninformed but strongly opinionated" is HN in a nutshell). The purpose of GPL and other copyleft licenses is exactly to combat the insanity of intellectual slavery.
If that's what you want, you should license your code not under MIT, but under a license that allows replication/distribution without attribution. Meanwhile, others who do care about such things can license their code under licenses that require attribution/copyleft/etc.
And I can't because there are a bunch of, for want of a better word, dweebs who care about this stuff. I don't give a single solitary frick about the finer points of MIT vs GPL vs BSD 3 clause vs CC-BY-NC or whatever-the-hell. But y'all are forcing me to care by making the legal frameworks for software ever more strict and confusing.
I take a maximalist view, don't want the code copied, sliced up, re-used in any form whatsoever with no credit? Don't post it on a code sharing site. Like I say in the OP, in my job I obviously have to follow the rules, but on an ideological level I'll ignore them where I can get away with it outside of work.
If you don't want the code to be used, don't post it online,
I'm curious if this view is software specific or relates to any work released online? For example, do you feel similarly about a novelist or graphic artist? I reckon at least a few software engineers look at what they produce not entirely differently from how an artist or writer looks at theirs.
First to be flippant the idea of a software developer with that view sounds so unbearably insufferable and full of themselves I hope never to meet one. All code is terrible, be less attached.
Stream of consciousness: Should artists or writers be paid for what they produce? Yes. So why not software developers? I'm paid for what I produce. But then I don't release the stuff I'm paid for for free on the internet. But I'm against DRM, I also think Winnie the Pooh shouldn't have IP protection (now expired). What makes art or literature a different commons from software? I also think all scientific journals should be available for free. Do artists and writers have an alternative route to make money from what they publish, what is the artistic or writer equivalent of open source? I think this is the crux of it, if we're going to do open source let's actually do it and stop being precious about it but this only applies to freely-entered open source. So does that mean I support some form of copyright after all? Then again some old out-of-print books will sell for Amazon for like $4000 so we should be able to copy those for free.
Ultimately it's a question of what a vision for society without copyright would look like. I think software is uniquely placed to start exploring that idea. How would we make a living of software if anyone could reverse engineer (even our proprietary) code freely and safely?
The reason I ask with writers in particular is because, like code, having access to it necessarily means that the viewer has the ability to copy it as much as they'd like. Unlike software, however, there is no ability to keep the source code private in a book while still having users.
I definitely agree that copyright protections have become far too strong but I don't think we can really ever know if we would have be able to build the strong open source community we have today without coopting the copyright system for copyleft protections. At the same time, perhaps we are past the point where it's necessary and now it's holding us back... it's entirely possible!
To the first thought, I personally see some coding as a creative act (some is doing _a lot_ of work there though). It's not because I fancy myself a Picasso but because I think some (again, doing a lot of work!) solutions/ideas have a bit of their creator in them and, for those works, the author should be able to exert some control over their works. I think this is more philosophical than legal/political, but I would disagree that its flippant :)
You don’t need an “official” license, although I agree creative commons is closer to what you want. I feel like you can pretty easily write a license file that explicitly waives all of your rights and responsibilities. Such simplicity is after all what made MIT such a popular license, even though it’s not substantially different from Apache.
I suggest you read up on the history of free software and open source. It exists as a reaction to intellectual enclosure, to prevent that ill and create greater freedom of ideas. Yes, it uses the tools of copyright to fight greater ills of copyright, because those are the tools available, and actions like these are necessary to keep the enclosure from happening all over again. Anyone who has actually studied the matter for even five minutes can see how silly the "free software is anti-freedom" FUD is.
So glad this sentiment is becoming more common in the OSS community! I MIT license everything, if someone wants to make money using stuff I wrote that's awesome, and I wish them the best.
I don't think users owe me anything at all. If people want to PR back that's cool but if not that's cool too.
A "license" implies that there is a copyright holder who allows usage of the work under the terms of said license.
While "Public domain" implies that there is no copyright holder (e.g. because the copyright expired, was explicitly waived, or is for some other reason not applicable).
If you want to put your work in the public domain, you can do so; simply include a note saying that you dedicate it to the public domain.
You're right that it does contradict itself, but the unfortunate situation is that public domain declarations don't work and would make it harder for people to use your code safely in the current licensing model. The closest options are Unlicense and CC0 afaict and both don't work in many European jurisdictions.
I just want people to be able to take my code and do whatever the hell they want with it (including commercially) and optionally contribute to it. Having a license currently makes that easier but every time the Free Software lot going zooming off into the weeds of GPL v3 versus GPL v2 versus LGPL my eyes roll back into my head and I internally start screaming "get a life!".
I think because this kind of ML is so new, we have no choice but to frame arguments for/against in terms of the structures that have been in place for decades past (copyright, open source licenses). We don't yet have the legal language to express dissent against ML in clear yes or no terms.
I think if there were an option to add a machine learning clause and ask individual creators if they wanted it applied in that context, we would see a considerable amount of uptake. It's just that we couldn't forsee this progress happening so soon, and the issue is still not visible enough. I think it's only a matter of time before the culture catches up and new creative works in the coming years are excluded from training sets by their authors with clear and direct language.
By that point there would be no way to argue "but they shouldn't care, they licensed it like this, so I'm assuming it's fine for ML use."
If copyright is not enough to stop another entity from using a person's data for training, then some other protection should be invented that does.
The problem with this is 'freeing the code' in this instance leads to microsoft building a wall around it and asserting complete control in a few years.
Copyleft exists for a reason and without the ongoing fight for the commons we lose it all.
I totally agree, this reaction seems very hypocritical. If some rinky dink startup did exactly the same thing - as they are entitled to do under the licences of huge swathes of code on GitHub - hardly anyone would bat an eyelid. But just because it’s a Microsoft-owned company, it’s somehow verboten?
That seems totally inconsistent with decades of people clamouring for more openness/liberty when it comes to IP rights.
Regardless of the size of the offender, if you're not respecting the terms of a license, you'll get pushback. It's natural.
If you're a company which executes Embrace Extend Extinguish on any technology you like yet don't own, you'll get quadruple amounts of pushback. That's normal too.
Microsoft isn't saint, and copilot is breaking a lot of legal, ethical and moral rules. It's doubly-natural to give reaction to this.
I think we share a lot of the same goals but they presuppose openness based on violence, if you don't do what their license says exactly then they're going to use lawyers and courts and the state's monopoly on violence to make you comply.
I think at a fundamental level this abandons any vision of a true commons since as copilot discussions reveal the well is now polluted (to mix metaphors) and though in some frames the code is more free you certainly won't be if you fail to pay the penalty levied in a civil case for misusing it.
These are incompatible concepts. RMS's vision of 'free-as-in-freedom' software doesn't let people do whatever they want. It forces those who distribute binaries to also distribute source. This is not possible with a public domain work.
In this thread: many engineers nervously sweating. The moats are drying up and the wizards are about to be thrown out of the castle. This tech is the first product in a long line of products that will massively lower the barrier to entry. It has been a good run, but it was never going to last forever. We are not part of the capitalist class and were never going to be.
The world might change, but software engineers have been working with and within change their entire careers presumably. I think we'll be OK, as people, no matter what happens.
I was sweating nervously before I started using Copilot awhile ago but I've stopped since because A - it really doesn't replace me, tried really hard; B - I don't sweat nervously for IntelliSense either.
There's also C, where being of an entrepreneurial mindset, I'd love the opportunity to hand over the software to an AI dev and just direct the implementation to my desire until I have a working product. I bet I could secure a higher room in the castle if instead of coding for 8 hours per day I could work on n products with capable AI Software Engineers. We're not there yet though.
Is this supposed to be a joke? You're arguing that software developers are being replaced by themselves because ML just takes in training data and is entirely dependent on real humans to provide that data. If anything, this will simply result in another productivity explosion where software developers will get paid even more.
Copilot replaces code monkeys, not engineers. Ultimately it's just faster stack overflow, proper software engineers and system architects are going to be just as in demand as they are right now for the foreseeable future. At the point at which that stops being the case, we'll have much bigger societal and existential problems (because it implies the singularity is nigh)
(You're correct on not being part of the capitalist class, though)
I don't agree that we are that close to that (or that Copilot is a significant contribution to bringing it closer) but ultimately eliminating mindless jobs is a good thing. The problem only comes from the expectation built into our current society that people need a job in order to be allowed to survive. Or to put it another way, the profit from automating away jobs en masse should be shared with the whole of society, not privatized.
> the wizards are about to be thrown out of the castle
You have completely misunderstood who the moat-building wizards are. That's proprietary software. Heard of it? I ask because a lot of young people nowadays don't seem to understand how dominant it used to be and the threat that it represented. (Plus a few older folks who never knew, forgot, or deny reality for other reasons.) We've been trying to throw the wizards out for decades, by making code available to everyone and making sure it stays that way via licensing. Code without a license is subject to re-enclosure as important enhancements - even necessary ones, such as security - are made behind locked doors. The open version becomes out of date, the proprietary one wins, we're back to wizards and moats. What Microsoft is doing is the same thing for code that was supposed to have legal protection so it could remain open and avoid that fate. It's taking magic back from the people and making it exclusive to the "capitalist class" (eye roll) again.
It is now proven that copilot returns code from codebases with non-permissive licenses [1].
I'm curious - what are the legal implications of this going forward? I've so many questions.
1. Will Microsoft ever face lawsuits for these license violations?
2. If so, who/how? Class-action?
3. Will copilot be forced to open-source in the future? Under which license? Some open source licenses are incompatible with others, but copilot uses code from probably every OSS license conceived.
4. If Microsoft faces no justice, will we start seeing more OSS license violations? Will Google start using AGPL-licensed code?
"jethrodaniel" does not appear to have the copyright to offer that license, but it's hard for Github to determine that in general, so I doubt they would be liable for the error.
Even if it's somehow available under an MIT license (which is questionable on the part of jethrodaniel), there's still infringement. MIT isn't public domain, it still has
> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Replicating it without complying with those terms is still infringement.
this. People are being willfully blind here, like cult members looking dead-eyed at their leader and chanting "This is great" as they drink the kool-aid.
And from Microsoft no less, once outcast for mass poisoning.
Actually the legal system is evidence based. Microsoft has evidence that the code they are producing is licensed under MIT as far as they can reasonably know. There's no definitive way to know that who actually owns the original copyright. I could grant permission to use my repo, but maybe I got that code from someone else, who then got it from someone else and so on and so forth. It's a similar situation with stolen goods, if you unknowingly purchase stolen goods you usually cannot be charged for theft as long as there aren't obvious signs that it's stolen such as the goods being priced far below market value.
Microsoft has evidence that the code they are reproducing is MIT licensed, so are they intentionally violating that license or does this AI thing include the license and attribution in every snippet it generates?
Major aspects of copyright infringement are strict liability, like a lot of civil actions around damages. It doesn't matter if you thought it was OK, there's still a damaged party that needs compensation according to the law. At best you'll simply avoid the criminal and punitive penalties.
No, PornHub doesn't have liability in a lot of cases because of 17 § 512, but has still had to deal with liability in general, which is why they nuked some 80% of their library not backed by verified individuals a while back.
A huge part of 17§512 is the DMCA takedown process mainly in 17§512(c)(3). Does Microsoft even have the ability to truly remove training data from the model? Or do they have to retrain on each DMCA takedown?
I personally don't want to have to upload proof of identity to GitHub and a signed document swearing that I own the copyright to all the code I upload to GitHub, or proof that I coded it. We need to be careful what we wish for.
> THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
If they had a reasonable basis for believing they had a license they're in the clear. "I didn't know" might not be enough but "I had good reasons to think otherwise" is.
I’m not a lawyer but my understanding these are torts so all you have to prove is Microsoft has liability. I think this would be easy to prove due to the way neural networks work since it’s just a way of performing a search.
Since it’s a tort I don’t think you have to prove they should have know it would return copyrighted code, the fact that it does is enough to have liability.
IANAL. My understanding is that the general legal precedent in the US is that a) datamining text has no copyright implications (in the same way that reading a book has no copyright implications) and b) it is not a copyright violation to use a small amount of copyrighted material provided the context is sufficiently transformative. This might seem silly or unfair to you, but that is the current legal reality.
But even ignoring that, everybody uploading code to GitHub has given GitHub the right to analyze that code as per the GitHub ToS. This is the same mechanism by which you can't upload code to GitHub with a license that says "nobody is allowed to display this code on the internet" and then sue GitHub.
I can't imagine a scenario in which any lawyer would consider granting Github the right to "analyze" code anywhere close to granting Github the right to spit out that same code verbatim without your copyright notice (even if laundered by AI).
Here's Kate Downing, an IP lawyer specializing in software license:
> According to Downing, the answer depends to a certain extent on where that code is hosted. If it’s on GitHub, there very clearly would not be copyright infringement.
> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
Downing cautions that copilot output of large chunks of code complete with comments are more questionable to use, but that for the most part it looks above board.
> The licence is broadly worded, and I'm confident that there is scope for argument, but if it turns out that Github does not require a licence for its activities then, in respect of the code hosted on Github, I suspect it could make a reasonable case that the mandatory licence grant in its terms covers this as against the uploader.
To me regardless if it is technically legal, it certainly doesn’t feel right. Furthermore, contracts rely on people understanding what they are agreeing to, and I don’t think many developers would agree to letting the code be used outside the terms of the license they uploaded it under.
I am very surprised there hasn’t been a legal challenge to it.
“I’m sorry your honor I didn’t understand what I was signing” I don’t think has ever been a valid reason in a courtroom, similar to “I’m sorry I didn’t know I was committing a crime” is not a valid defense.
Courts interpret the intended and understood meaning of contracts and terms all the time. Research the term "meeting of the minds" and case law around it.
When the terms were written, it's exceedingly unlikely that they intended it or anyone understood it to be blanket permission to allow a trained AI to copy code for others and no user would have interpreted it that way. Microsoft/Github can't necessarily unilaterally increase the intended range without making it clear in the terms.
If it got to a court case, and both sides could afford it, it could be a lengthy one.
(This comment is not legal advice. I am not a lawyer.)
How does "[allowing] a trained AI to copy code" change the interpretation of the ToS?
By uploading your code, you give Github an exclusive license to use it to improve their services. Copilot is such a service. Just because it's an AI and it provides others code does not somehow invalidate the license you gave.
Again, research "meeting of the minds". It's a standard legal term directly relevant to all contracts and terms. Also, "transparency" is another important one.
Many online services have very wide terms around what they can do with your data, which most people who bother to read them interpret as being what is required for them to handle the service for you without breaking copyright law. In that context, being able to use and analyse your data to improve their services could be another catch-all that lets them do specific performance optimisation on their backend.
One party instead deciding they've got blanket permission to do whatever they like with your work, including selling it to others, may well not hold up in court.
Contracts aren't programs and one party tricking the other rarely works out in court - courts world-wide tend to rule against trickery and deception.
> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
That's assuming that all code on GitHub is uploaded in good faith by the copyright owner, which is not always going to be the case.
Many repositories on Github were put there by people that do not own the copyright and never agreed to GitHub's Terms of Service.
Linux, for example, does not require copyright assignment. The original contributor of a change owns the copyright for that code and may have never used Github.
5. Even if it is illegal, is it actually bad?
No one can possibly sell code snippets, the transaction costs are many orders of magnitude greater than any reasonable price.
In my opinion, at least in this case the benefits massively outweigh the costs and the law should not apply here.
I really, REALLY like the idea of Copilot. I think it is a glance at what the future of AI can bring to improve programming. I understand where all the litigation and "uneasiness" is coming from, both from commercial and open-source projects.
I've not installed or used it for the same reason (don't want to use AGPL or GPLd code by accident, and don't want my closed source code to be used accidentally as well), but the thought of Copilot being "killed" due to litigation/copyright/licensing issues is sad.
For me, It's kind of like when MP3 first appeared: Sharing music in Napster or downloading Mp3s from Geocities was just amazing. The idea of having such things at your fingertips. Even though I understood the issue the authors had with the unpaid distribution of their music... still, the idea of "what could be..." made it amazing.
I guess Microsoft could be a bit forward thinking, and implement the "Spotify" model in code: Pay OpenSource developers (whoever owns the repo, or whoever made a commit?) a small amount whenever their code gets used through Copilot.
I'm super excited by how "Copilot" related services will look like in 10 years. And I really really hope that the technology/idea doesn't get killed by litigation.
Microsoft could have trained this on their own code and there would be no issue. The problem is instead of doing that they knew full well the approach would reproduce the code and they decided they would rather breach GPL than expose their own code. But I bet Microsoft has more than enough lines to train an AI, there was a clear choice to breach other peoples licenses in preference.
Huh... These comments have given me an idea: MS needs to be forced to train a model to compensate (pay) code authors and codebases based on snippet suggestions given by their tool: the Spotify model replacing Napster!
Some people won’t let you use their copyrighted work no matter how much you pay, that’s reasonable.
By all means allow repos to opt in, although if it’s licensed under something like GPL there’s no way to convert it to non gpl without permission from every contributor. I for one am not interested in Microsoft or anyone else paying me to close my code.
Allowing people to pay $xxx to copy my copyrighted work without my agreement is simple piracy.
Either they international agreement to drop copyright as a concept, or obey the law.
Of course it's bad. Noone who put up their work as open source wants some huge company taking it and selling it to get even more competitive advantage and influence in the world. And that's without mentioning the people who put that into their license pretty much explicitly. Taking GPL code and getting away with it is a failure of our justice system, and that can't be made right with throwing pennies at developers.
Is there any leaked Microsoft code on GitHub? Someone should check if Copilot regurgitates that as well, then see how Microsoft reacts when someone slaps an AGPL license on that…
It seems like Microsoft could be in the clear on the basis of it being essentially "search". But it also seems like anyone who uses it could be risking to a high degree getting infected with copyright violating code.
My question is, if it isn't a copyright infringement issue to use copilot in its current form right now, why not just claim copilot was used whenever accused of copyright infringement hence forth?
> why not just claim copilot was used whenever accused of copyright infringement hence forth?
Without speaking to the particulars of copilot, this situation where laws seem toothless because of the ease of plausible deniability is actually fairly common. And in many such cases, the law is not as toothless as it seems, because
1. Getting multiple people to stick to a script under oath is difficult and dangerous.
2. Criminals frequently send each other messages like
A: "lol I just crimed, hope nobody figures it out."
B: "lol just say you used copilot".
A: "lolol yeah fuck the law"
Obviously this only gets the worst criminals, but there seems to be lots and lots of them.
Microsoft is trying to legally position Copilot like StackOverflow. It is possible to post copyright-infringing code on SO even though their TOS requires a CC BY-SA 4.0 grant to the company and its users.
> It is now proven that copilot returns code from codebases with non-permissive licenses [1].
That same Quake example from last year is repeated every single time.
Aside from the fact that GitHub has since added a protection for this, that this example gets repeated time and time again instead of a *list of examples leads me to believe this is (and was not) a common occurrence.
3) Not likely. Worst case a judgement will go against them, they'll effectively pay a fine and then they'll retrain it on a more restricted set of source code.
4) OSS has a pretty tragic history re: enforcement. It wins nearly every skirmish but has no interest in the war so from a big picture standpoint, it loses due to apathy.
You don't think a mountain of MSFT lawyers in every state, including partner law firms around the world haven't thought about this? Do you practice law or are you speculating based on emotions?
No, SCO was found in 2002, from Candera Software who was a Linux distributor [0]. How could Microsoft in 1980s own a company that wasn’t founded until 2002?
That doesn't imply ownership and the article [20] that you pointed out doesn't make a specific claim. All good, but MSFT never fully owned or operated SCO at any level is the point I'm trying to make.
You aren't really saying anything at all. SCO would never have existed without Microsoft and Microsoft had a very significant stake in their business and gave it direction.
I mean, if it's autocompleting a fairly simple line, and can do that because it's analysed a lot of lines, I don't really see that as "stealing anything".
If you are using it to write whole complex functions thatare the same as other people's, I guess that is copying.
But if you do the second thing you are not a great dev, and would have probably ended up copy pasting it anyway.
I think the first use case is far more common, and creating boilerplate that is so generic you could never really attribute it anyway.
You are responsible for your tool use. That's the same discussion as with whether uTorrent is responsible for your torrenting copyrighted stuff or with Tesla's auto-pilot. You buy the tool, you are responsible for what you create with the tool.
True, however, the users have been liable too. If my company gets sued because I used Copilot, it won't matter that much that the plaintiff also sued GitHub/Microsoft.
>If Copilot does it for you, it's GitHub's/Microsoft's responsibility.
GitHub/Microsoft says that it's still your responsibility.
>You should take the same precautions as you would with any code you write that uses material you did not independently originate. These include rigorous testing, IP scanning, and checking for security vulnerabilities. You should make sure your IDE or editor does not automatically compile or run generated code before you review it.
I'm not really sure how am I supposed to go about validating that I can in fact use this code that the magical black box barfed into my IDE using a bunch of different weights.
Let MS buy BlackDuck scanner and integrate in to GitHub/CoPilot. They could then suggest code and also scan it for any license violations, and give you both sides of the equation in the same tool.
If I pay for grammarly, and it plagiarizes an existing work but represents it as an entirely new, independent work and I am unaware of the existing work that is being stolen, who is doing the stealing?
This makes more sense for text message auto complete: you just take the suggested next word after a one word start deed, it might reproduce a Wikipedia entry. But what did tub expect? The same would be true with grammarly if you somehow got it to produce a bunch of new text. You expected garbage, but somehow infringed on copyright instead. But I guess think the user deserves some responsibility in realizing their expected garbage output isn’t for some reason.
If you pay a shady character to get you a modern laptop for $100 you can't claim that you were unaware that it was most likely stolen and the fact that you paid for it something doesn't absolve you morally.
So it's my job to check my supplier, to make sure lines from co-pilot are legit.
At the same time when fast fashion companies sell T-shirts made with slave labour, its not the company's responsebility to check what their suppliers are doing.
And if tesla autopilot kills you and your family its not their fault either.
Neoliberal morality - companies are never accountable for anything, it's heresy to suggest they should do their job properly.
Other than the first sentence nothing you wrote is true. If a company doesn’t do due diligence on their suppliers they face fines and possibly criminal charges. The news came out the other day that the NTSB is considering whether to require Tesla to recall all their vehicles with self driving enabled. Companies of all types face huge fines and civil liability for product safety issues.
I was talking more about ethics than the law. I try to live adjacent to the law (to an extent that is reasonable and sane) in all fairness, as it will never be in-line with my ethics.
It changes the code for use. I'm not sure it can be considered a copy. It much like reading someone else's code and drawing ideas and patterns from that code.
I neither see it "stealing". The neuronal network was trained with code as input. It's creating code as output. The output has nothing to do with the input once it is trained. Do people dont know how neuronal network work?
It's like saying GPT-3 created text is copyright infringement, because some author used the same sentence in a book before.
1) Copilot is not designed to output the source code for a project source file
2) It does not re-create the whole source code, just parts of it (sentences, not chapters)
3) The source code license, e.g. BSD, works on "the code" - copying a line like "void main(void) {" will not trigger it, obviously
My problem is with the weights not being released. They are a derivative work of open source code in the most literal sense. The weights would not exist without those lines. Gradient descent is using literal derivatives.
> If you are using it to write whole complex functions thatare the same as other people's, I guess that is copying.
> But if you do the second thing you are not a great dev, and would have probably ended up copy pasting it anyway.
How would I know that the boiler plate I ask copilot to write for me is copied verbertim from a codebase, that neither I nor Microsoft has licensed to use?
>Hector Martin: If you use Copilot, you are basically playing Russian Roulette that the random mashup of existing, copyrighted, hegerogenously licensed code that you get out of it qualifies as an original work, mostly by chance. Or that nobody will ever sue you otherwise.
Well, that's already the case with Stack Overflow copypasta enterprise code. If anything, use of Copilot would be an improvement...
I feel this is more a meme, rather than reality. I do check StackOverflow, but never have I took an answer verbatim.
I try to see if it's the same problem and what was the approach in deconstructing it, which I find more useful in the long run.
Well, to be fair, most of that is probably just copying the a particular syntax or built-in function, which (I think?) has nothing to do with copyright.
At least for me, that's most of the copies I do, followed by the ones that basically are 'call these functions in order', then paste it as a comment and use it as cheat sheet, and only very rarely I copy a 'creative' snippet almost verbatim, like a regexp matching email addresses, a to-hex or a crc calculation. And perhaps that's actually tricky.
It depends on what you need. In most cases the code on StackOverflow is not exactly what you need, so you need to understand it in order to adapt it. But if you're looking for a specific well-defined algorithm (MD5, say) then you can just copy & paste it.
I catch people using cut and paste code all the time. If there is a spelling error in code (Especially if it is in a code comment), I can guarantee you that someone copied and pasted it from StackOverflow.
He talks about code, and Copilot works with code, so I'm not sure how it "applies to any".
If you mean that if you make a "random mashup of existing, copyrighted, hegerogenously licensed" works of art (audio/video), it also applies that you might be sued for it, then yes.
But that's not much of an issue with Copilot if you're using it for enteprise code that's already a mashup of copypaste "existing, copyrighted, hegerogenously licensed" and that you wont release and nobody will see anyway.
Whereas audio/video you generally want to release.
If you make them for your own consumption, then it's my response that rather applies: since nobody will see it, and you don't release/sell/circulate it, you can go ahead and mix Michael Jackson, Disney and Star Wars material - nothing will happen to you.
> If anything, use of Copilot would be an improvement
What do you mean, Copilot regularly pastes stuff directly from SO. One of those automatic doc generators was able to point me to the exact answer where one of them was from.
A good and well argued opinion made hostile by saying "get over it" twice! Saying "get over it" discourages further discussion. Your comment would be better without it.
Not an expert, but fair use generally covers education, criticism, parody, and satire. There is a test for meeting fair use and it includes things like amount copied and commercial or non-profit interest.
The amount copied from any particular source might be small, but an aggregate strip-mining of many copyrighted sources is an interesting twist. Another might be, as you suggest, it might be a machine that itself does not violate copyright, but has the effect of causing users (who accept the suggestions) to violate copyright.
Yes, the copyright clause gives as its purpose "the progress of Science," but that doesn't mean that anything which claims to be "progress" gets a free pass.
Indeed, the US supreme court pointedly refused to accept that the purpose clause limits the power of copyright in "Eldred Vs Reno" (at least, that is my understanding as a non lawyer)
Bit of a stretch to fashion AI-derived/AI-coauthored works as other people's work. Are DALL-E portraits done Picasso-style unrightfully selling Picasso's works? Is an individual selling portraits done Picasso-style unrightfully selling Picasso's works?
No, of course not. Joyce's literature was influenced by Ibsen, Mozart looked up to Haydn, Newton was humble enough that he openly professed he stood on the shoulders of his predecessors, Perelman refused the Millennium prize because it wasn't also offered to his colleague Hamilton.
Our skill doesn't grow in vacuums, without outside mentorship and guidance. There are areas where I am upset about the application of AI, but this is not one of them. Consider copilot a gentle guiding hand for those without access to a second pair of eyes nearby to give you reminders on what you may otherwise have on the tip of your tongue.
But in the way that Led Zeppelin refused to recognize how heavily their music was influenced by delta blues artist was unbecoming, I can accept the argument that it is perhaps douchey of Github to sit on Copilot as squarely their creation.
I do feel these arguments are valid if a little overstated. Most devs have googled, found some code, and pasted it in without thinking about attribution. Doesn’t make it right, but it is a question of how much code is being copied and how specific. For example, I peruse open repos to learn - I learned about the spread operator in JavaScript that way- doesn’t mean every time I use it I need to attribute whatever repo I saw it in. But, yeah, if I copied a larger chunk and the owner wants attribution, probably wrong.
I like the idea of having the bot automatically update a attribution file if it detects it’s used licensed code. Seems like it would be fairly trivial. Also a robots.txt for repo owners to control automated use.
Also, they should totally pay back a portion of revenue to the community and support the repos used to train. That seems like it would be a good PR move if nothing else.
I like this take. Copilot to me seems a glorified (very intelligent) auto-search-paste/autocomplete service. It is just mimicing what usual devs do which is to copy-paste code from StackOverflow/github for many mundane types of codes like for loops, mongo find queries, callback func definitions etc for JS devs for eg.
The idea of auto-attribution if copilot surfaces licensed code is best because then it keeps the copilot user honest where the code is coming from and honor the original license.
> It is just mimicing what usual devs do which is to copy-paste code from StackOverflow/github for many mundane types of codes like for loops, mongo find queries, callback func definitions etc for JS devs for eg.
I’m genuinely disturbed to see how many people in this thread think that casual plagiarism is the norm for “usual devs”.
Again, I get the argument, just think it’s overstated. First, when referring to stack overflow and blogs, generally, that’s intentionally shared with the express purpose of people copying it- hopefully while learning from it at the same time. Second, again with some code bits it’s not really plagiarism any more than all iambic pentameter is plagiarizing Shakespeare.
Devs often look at code to see basic syntax, understand algorithms, etc. There is absolutely nothing wrong with this. One should draw a line somewhere, but to say I need to attribute […somevar] every time I use it because I happened to see it one time on a blog post is silly.
A thought experiment may help: Scrape Github for all unique strings longer than X and store in a file with a timestamp and owner. How large does X have to be before attribution is required? If not length, then how do you determine whether attribution is required?
So, how often does it actually happen? Does it happen more often than for a human? Does anyone actually have numbers on this?
Of course, if you provide already a copyrighted prefix, and it has seen that code, the chances are high that it would complete the copyrighted code (because that is what you actually would also expect).
So, for real use cases in the wild, where you write some own real novel code, how often would it suggest some copyrighted code? And how often would a human?
I have used Copilot the last months and I have never ever seen such a case (I can be pretty sure because all the identifier names are really unique, and the code was very custom).
However, I assume that I myself might have produced copyrighted code unknowingly because if you write common patterns (e.g. some tree or graph search, or some sort function, implement LSTM or Transformer, whatever), the chances are not so low.
It’s the same problem with those ML models, the other day someone generated a children’s book using GPT3, turned out that there is a real children's book with the same name and a very similar content: The Very Lonely Firefly by Eric Carle.
Other thing I'm worried about: how to retract facts from ML model? I guess it's impossible, you need to retrain from scratch with part X removed from training set. Or... people could invent layered ML models similar to docker - each layer would be marked what data it was trained with. Then at least you'd have some cache of trained model to re-use in next training session. Nasty stuff.
Or instead of inventing complicated layered ML models Github could just use each repo's license information to decide what's okay to use. Detecting licenses is already a feature on that site.
Interesting, it's a big question I've had for a while, how 'original' stuff coming from these AI systems is, and also the distribution of uniqueness over many answers. I haven't dived into it yet, but I find it surprising how little this comes up when these systems are discussed (ie here on HN).
Does anyone even know? Can we even check? What if 1 in a thousand, or one in a million outputs is (very close to) something existing? I find this especially relevant when generating faces.
I'm bit mixed on this, code Copilot usually autocompletes me is not particularly novel, it's just mundane stuff I would write anyway. Most of these snippets are not copyrightable in my opinion, because it was obvious in the first place. Like CSS nth-child odd / even logic, or one case it filled me ~10 lines JS logic of filtering rows by category stored in dataset, which I would have written anyway.
Then there are cases where it amazes me completely, it wrote 10 lines of C++ code for rendering a monochrome glyphs with bits using Freetype library. It though had odd subtle bug, the glyphs came reversed and it worked with only certain font size which it seemed to pick up from different file all together.
This type of argument always distracts from the fact that figuring out where we draw the line between theft and reimagining.
The Magnificent Seven for instance was a reworking of Seven Samurai, but stands on its own as an original creation. Going into a cinema and filming a picture to later put on a torrent site is not artistic reworking.
The hard discussion is about what is acceptable, we all know prior art exists.
What if we just say "both"? Libraries were a thing for millenia and writers still wrote books. There are costs to IP laws and the benefits aren't obvious.
Because writing a book, shooting a movie, composing a song, takes time ?
So either those pieces are IP-protected, and their author can make money with it, or we have to set up a basic income for everyone, and art becomes free.
It's perfectly consistent to say both that there needs to be a system to ensure creators are compensated and that the current system for doing so is terrible.
It is consistent but useless if you have no suggestion as to what would replace the current system in a way that preserves the benefits to both parties.
1. Creators get a sustainable reward for their work. They wouldn't do it otherwise. I certainly don't do it for fun.
2. Consumers get to access that work as they wish.
(Of course, this being HN, I'd expect any ideas to apply to developers as well as to writers and artists i.e. if writers have to give up copyright, so do developers, startups, and so on.)
Keeping the benefits intact for both parties is a non-goal.
How about 14 year max copyright terms? Make copyright unsellable and uninheritable, so you don't get massive copyright hoarding entities that can distort legislation for their own benefit?
That's just two suggestions off the top of my head. I do get tired by false dichotomies.
the grandparent comment said the benefits of IP Laws were not obvious. So it is of the benefit of the laws as they currently exist, that implies enforcement of said laws.
Not in the USA, where the "first-sale doctrine" means once you buy a book, you can do whatever you want with that copy of the book (lend, rent, sell, destroy) without needing a license. Libraries in the USA definitely don't pay a fee beyond the purchase price of the book (or they can legally lend donated books etc). Copyright holders don't make any additional money from library lending.
I am not familiar with how it works in other countries, but I have heard something about there being such a fee.
(It's not quite true to say libraries have existed for "millenia" though, with regard to this issue. Mass produced printing hasn't in fact existed for millenia, libraries 1000 years ago had hand-copied manuscripts, probably mostly scrolls. The effect on "the market"? For whatever reason authors were writing then it was not to make money by selling reproductions of their writings, that wasn't a thing. Which means, yeah, btw, people still wrote things and made up stories even when they couldn't make money by charging people for copies to read...)
It was a news to me so I checked and it's true. Since 2016 in my country ;)
And it's a symbolic amount for vast majority of authors (country-wide it's around 5-5000 USD per year per author and the distribution is heavily skewed towards 5 USD).
So yeah :) I think authors were fine without these 5 bucks a year.
EDIT cause it might not be obvious. It's not per library. It's per country.
> This type of argument always distracts from the fact that figuring out where we draw the line between theft and reimagining.
This seems to be missing a word, could you clarify?
Also: since you mentioned theft: this actually comes down to the discussion whether you can own thought and/or digital artifacts which can be replicated without taking anything away from the "owner".
Given the absolute choice I'd rather pick complete freedom than restriction. I suspect that anyone's opinion on this follows what they value higher: creation or exploitation.
Sorry, I should have double checked, that sentence was incomplete. Yes, I meant to say that a more nuanced approach is crucial, and that means rejecting that we have to choose between Disney-backed extreme IP laws or total freedom.
There are many differences between those acts of thievery or inspired creation however you might call it. But there are many similarities too. Fascination with the original is one. Desire to own it in one way or another is one too. Differences are in the skills, the means, the result, what was stolen and financial success that came out of the act.
> Pretty soon the world is going to come to realize art/creation is just blending, incrementing and repurposing prior art
If that happens, the big copyright/IP conglomerates will immediately jump on that and make sure that laws are adjusted and they get their cut of every single word and line anyone puts near their smartphones ;)
I'm not sure what do you mean by "theft-less" but I believe you might be conflating inspiration with derivative work: Copilot can produce verbatim copies of open-source code, this would make it more similar to how some musicians sample other people's music to create new music.
>Pretty soon the world is going to come to realize art/creation is just blending, incrementing and repurposing prior art.
That applies to everything, its even a basic law of physics, and there's absolutely nothing wrong with it. Any layperson already knows what a remix is anyway so not sure what you think will change
Unless every invention is gonna be AI generated (which is kind of a scary situation), intellectual property still needs to be a thing (otherwise people won't have incentive to invent, it'll just be stolen from them).
People have an innate desire to invent and create. This is why so many people do it for zero extrinsic reward. Hell, this is the case for almost every musician. They are fed a pittance in streaming, only a bit more than most OSS developers get.
This intrinsic motivation is more normally "farmed" by investors who capitalize and capture the IP value for themselves. This actually has a detrimental effect on innovation.
Doing away with or watering down intellectual property protections will just take big meaty chunks out of the stock market and partly equalize wealth distribution.
It'll probably spur innovation too - historically it usually has, but preserving the existing social order takes precedence over that which is why a lot is invested in persisting the myth that it aids rather than hinders innovation.
> otherwise people won't have incentive to invent, it'll just be stolen from them
Citation needed. Speaking personally, I spend most of my creative energy on a project which is open source and permissively licensed to the point where I’m fine with anyone stealing it. I expect to earn negative money from it at the limit.
Why do I do it? I dunno it’s fun. Can’t that be enough?
They have been the same for most of history. People could openly copy titles, plots, parts, phrases, etc from prior work. Same for mechanical designs. The only thing preventing them was obscurity (e.g. the inventor trying to make it hidden) not any law or ethical idea that it's bad (there wasn't any). That's how things from math to gears to tunes got better (or changed over time, in the case of art, as better/worse is subjective there).
E.g. globally and historically folk music has been basically taking whatever you want from tunes and songs where everybody does the same with no "permission" asked or needed to be given.
Like 4 verses but want to add a fifth or change some part? Go ahead. Want to play it exactly like you've heard it? Go ahead again.
The idea of "theft" in that regard came in the last 2 or so centuries, and was enforced with artificial legal barriers and new "ethical" concepts that are neither "natural", not present for the vast majority of history (including golden ages of art production).
Not sure why I'm being downvoted here - I agree that this idea has been the same for most of history.
Your example of folk music is an odd one, for exactly that reason - it largely repurposes existing art. For example, Wagner wrote extensively about why we shouldn't respect folk music for this reason. I mostly disagree with him, but his comparison at least illuminates that this isn't so black and white. And that's really just scratching the surface of a complex topic.
I sense that if someone came along 2400 years ago with the exact play that Sophocles had just produced and claimed they had just composed it themselves, immediately after a public performance, someone would claim that theft had occurred. Do you disagree?
>I sense that if someone came along 2400 years ago with the exact play that Sophocles had just produced and claimed they had just composed it themselves, immediately after a public performance, someone would claim that theft had occurred. Do you disagree?
Yes. They would say it was "plagiarism", which is different than theft.
>because nobody gives a sht about art created by AIs. And nobody will.*
You'd be surprised. Especially if people don't care/are told/whether it's "created by AI or not".
Whether in "high art" or lowly pop, "generative music" (and fine art) has long been a thing. And people do attach to it (e.g. to Brian Eno's generative works made by rule based systems he programs).
No, I will not be surprised. Outliers are outliers.
"Art" created by AIs will just have price (and cost) of ~0 and, like everything that has a price/cost of 0, nobody will give a sh*t about it. The only real question is how will human artists (provided they exist in your preferred dystopia) will prove that they have created something themselves.
>No, I will not be surprised. Outliers are outliers. "Art" created by AIs will just have price (and cost) of ~0 and, like everything that has a price/cost of 0, nobody will give a sht about it.*
Art doesn't touch people because it has cost.
In fact, for ages certain types of art had no cost - poetry, public festivals, and so on. And many still don't (e.g. free punk/underground/indie/etc public performances), Soundcloud music, and so on.
Most movies and series seen on TV are also ~0 (and for kids, everything is ~0, as their parents foot the bill), but they're still touched by them.
>The only real question is how will human artists (provided they exist in your preferred dystopia) will prove that they have created something themselves.
Note the loaded words "your preffered dystopia" (who says whether I prefer it or not? I merely describe what's the case. You have some ethical/political point to make).
As for the answer to the question, they wont have to. People respond to the quality of the work, not who made it (and whether they used AI or chance - another popular method - or not).
In fact tons of genius artists have described themselves not as the creators but as "mere conduits", and say the music/words/etc come from "elsewhere" (implying god, some muse, some spirit, etc). Especially when they fell the most "inspired" (the word itself means "visited by the spirit").
None of those things had zero price and zero cost. The fact that the consumer didn't pay directly for them is irrelevant.
You can try testing your theory by trying to sell a "painting" created by DALLE/whatever for more than a third-rate amateur painter can sell one of his. Good luck with that, especially when access to the model becomes easy.
>People respond to the quality of the work, not who made it
This is so painfully incorrect and naive (and contra anything we know about the value of everything which creation has been automated before) that I think it's meaningless to continue this conversation.
>You can try testing your theory by trying to sell a "painting" created by DALLE/whatever for more than a third-rate amateur painter can sell one of his. Good luck with that, especially when access to the model becomes easy.
As if that proves anything? Sale price is irrelevant. There are paintings sold for millions that 99.9% of the people could not give less fucks for, and "amateur painter" stuff that touch most people who see them.
It's also not like a $2 million in production costs Michael Jackson song with $50M sales is "better" artistically (as opposed to commercially) than a song composed and played by some random guy on an acoustic for ~0.
>This is so painfully incorrect and naive (and contra anything we know about the value of everything which creation has been automated before) that I think it's meaningless to continue this conversation.
It was meaningless to begin with, as you don't discuss, you present your "ultimate truth" ("contra anything we know", lol).
In fact there are tons of works where the creator is anonymous (from folk music and art to early house, techno and rave music, a scene with cherished anonymity), and people respond to it just fine...
> The idea of "theft" in that regard came in the last 2 or so centuries, and was enforced with artificial legal barriers and new "ethical" concepts that are neither "natural", not present for the vast majority of history
This is true for other forms of property as well, like land ownership.
If you assigned a task to a junior dev, and he/she used some code from open source projects and Stack Overflow to develop a custom program for the task, would you say that this person is selling you other people's code? Is it common or expected for this type of use to be acknowledged?
People I've worked with have different philosophies on this, but personally, if you check in code that is distinctive enough that I can identify the source you copied and pasted it from, and you provided no indication (whether in a comment or a PR description) that you copied it, I will really get quite grumpy at you about it.
Way too often I burn half an hour needlessly during review in one of two ways:
* trying to figure out how the heck someone figured out some "magic" code that achieves something by invoking a bunch of poorly documented library or framework internals, and trying to reverse engineer WTF all the magic does by diving into the framework's source... only to eventually think to google the whole snippet rather than each individual method call, and discover it's copied from a Stack Overflow answer
* trying to figure out why something was written in an unidiomatic or overcomplicated way rather than a more obvious approach, and commenting at length on how I'd simplify it... only to eventually realise it was copied from a Stack Overflow answer
Attribution isn't just about making sure the right person gets credit, or about license compliance; reviewers and maintainers frequently need to be able to see where stuff was copied and pasted from in order to do their jobs effectively, even for snippets of just a few lines.
I understand where you are coming from. However, I think you are making the assumption that this person simply copy/pasted some code with no understanding of it, or that this code is then very different from your codebase and needs to be refactored. If using Stack Overflow did not add to your overall development time but subtracted from it, because it was used as an appropriate piece of a much bigger puzzle - a far more realistic scenario for both Copilot and our general use of SO -, then I see no issue with it whatsoever. Certainly no moral or copyright issues as this person on Twitter implies.
No copyright issues in the sense that no entity is likely to ever pursue the matter, sure. But copying and commercially using someone else's nontrivial bit of code that doesn't have a license that says you can is quite blatantly a copyright violation.
About 10 years ago or so, I was working at a certain place. They put me into a small team apparently focused on some R+D project under the direction of an "architect".
Basically, the project was to package Cordova + Backbone + Marionette, plus a couple of tools, under their own commercial name. Then they'd go around potential clients presenting it as the perfect solution to build hybrid applications for web/mobile/smartTV/whatever.
A certain Monday, the "architect" arrived boasting. He did that often, but this time he was more boastful. He explained that he had spent the whole weekend coding. He had written an incredible tool that would create a skeleton for a project from zero. You would type something like `tool create` and it would create the whole project with all the scripts and some example views and whatnot.
It was Yeoman's yo CLI tool, of course. He had just changed the copyright in the comments, removed most of the comments, he had deleted any mention to yeoman or the original creators, changed the name of the executable script and that's it.
The whole thing was OS code picked up from various repos and packaged as their own. The company used it to sell development projects. The so-called-architect used it to sell himself inside the company and then jump away into a startup as CTO.
Is this common or is it just anecdata? I don't know. It's clearly not the only time I've seen something like this and I do know that in certain companies around here it isn't exactly uncommon. But I can't say how common or uncommon it is.
Would I call this "selling other people's code"? Yes, I would.
If the solution was made up of ideas from OSS and snippets from Stack Overflow? No; that's fine.
If the solution was copied from an OSS project without proper attribution? Yes. Absolutely. And they'd have words with a senior dev and maybe even legal if the code they copied made its way into production without attribution.
Many copyleft OSS licenses require attribution and distribution of derivative works that we wouldn't allow.
It depends on the source of that code and the expected license of the code you paid them for. If everything is MIT/BSD (and attributed), no problem. If the code was GPL and I’m making a commercial product, we have an issue.
I’d also expect for any stack overflow code to include a comment with a link to the stack overflow page.
I think one of the key points is to make sure any code taken from another source is cited appropriately. If it isn’t, or the junior dev is passing it off as their own work, then we have problems.
If I found out a junior dev had been copying copy-left or proprietary code then I'd have to rip out that code, have a chat with them and figure out what to do from there. Even if the code isn't copy-left it's still someone else's code, sometimes that's ok but sometimes it's definitely not.
No matter how complex a program is, and no matter whether it uses techniques sometimes described as "AI" in its implementation, it's not a person. Copilot is just a very complex pipeline from other people's code to your editor, which ignores the license of those other people's code.
This is a good thought exercise. I wouldn't call it stealing, though I am not sure how legal liability is assessed, say if they picked up GPL code unknown to the company, and the company is later sued over it.
This isn't derived from principled reasoning, but I think of it as similar to community norms. Not the best example, but you wouldn't mind someone subletting their homes to Airbnb, but if all of your apartment complex does it, it invites regulation. A product like copilot enables copying code (even if inspired, and not verbatim) at a scale that individual developers can't. So respecting software licenses needs to be codified (legally?) while previously it was left unmonitored.
Could you elaborate on why you think a computer program and a person should be treated the same way in this respect?
We can take as self-evident that a human is capable of reading about something, conceptualising it, and then writing something completely new with the knowledge they have gained.
I think it's also pretty uncontroversial that the primitive "AI" we currently have is nowhere near the level of even an average human at these things, and thus we can't just blindly assume it is conceptualising rather than copying. Copilot regularly produces verbatim copies of existing code when working on non-trivial things.
Forget about the "AI" label: Copilot is just a complex computer program, that takes code from other people and inserts various permutations of it into your editor, whilst ignoring the license of that code.
I think it's best if we sidestep these big conceptual questions about what cognition or creativity really are. It's hard to find agreement, and perhaps it is not necessary to do so.
My position is that if a person hired in a company can currently use Google, Stack Overflow and GitHub to help develop their custom scripts, and no moral or copyright issues are infringed (ie, you don't try to say you came up with it on your own, and you use only enough that it is clearly fair use), then I think an AI should be able to assist in that task. There is no need to complicate things by legislating what the AI is doing and what Google is doing, as they are very similar things and in fact even use similar methods.
I would agree with you if the AI was genuinely assisting with that task, but it isn't.
It's taking inputs, ignoring their licenses, permuting them in ways that are not understandable to the user, and then outputting them.
That's an entirely different task than the user reading SO or using Google and then writing their own code, because the "AI" is not capable of writing its own code at that level.
Relying on this tool means ignoring the license of code that you're copying, without even knowing that you're doing it.
> That's an entirely different task than the user reading SO or using Google and then writing their own code, because the "AI" is not capable of writing its own code at that level.
I would say it's a very similar task. If I need to remember how to use a certain function, I can Google for documentation and examples, or I can tell Copilot what I want to do. The fact that the solution was presented by Copilot or a SO thread is, in my view, irrelevant. And to compound on that, I doubt anyone checking SO truly knows where that answer came from. The person could simply be reproducing a snippet from somebody else, you have no way of knowing if it was licensed.
I don't think this is bad either. Even our current shitty copyright laws protect that kind of use. I shouldn't have to worry whether my little prime number generator uses an algorithm first created by John Carmack or Microsoft. Programming has evolved rapidly in great part because we can all use other people's work and use it to improve ours. Of course you shouldn't just copy and paste everything and call it a day, but that's hardly what Copilot enables anyway.
You really seem to be ignoring the core issue by focusing on SO though. Everything on SO is fair game, but code on GitHub is under a variety of licenses, and when Copilot regurgitates it, no matter how complex and inscrutable the process is that leads it to do so, it may be causing the user of Copilot to misuse that code because it doesn't even give them the opportunity to know where it came from or what license it was released to the public under.
> Do you go and check whether a given reply belongs to a licensed project?
All SO questions, answers and comments are CC BY-SA. The terms of the site say that anyone submitting this content agrees that it's licensed that way, and when you visit the site you agree that you are provided with the content under that license. It's not necessary for you to check whether the submitter had the right to offer it under that license; that's their problem. The same goes for any content offered to you under a given license on any platform. I don't understand what your question has to do with the conversation.
The problem with Copilot, and I really can't believe this has to be restated over and over again, is that it takes code from projects with various licenses, and outputs it in your editor in various transformed-or-not-transformed ways (the fact that the transformation is extremely complex doesn't change anything), and gives you no way to know where the code came from, how it was licensed or how it has been transformed. So, despite the fact that if you use it enough you are virtually guaranteed to use code in contravention of its license, you cannot even know which projects you have stolen code from or which licenses' terms you are breaking.
> Also, please consider that there is a toggle that allows you to block Copilot from using public code.
Great. I'm sure its utility doesn't go down at all if you turn that toggle off...
> All SO questions, answers and comments are CC BY-SA. The terms of the site say that anyone submitting this content agrees that it's licensed that way, and when you visit the site you agree that you are provided with the content under that license.
Have you ever read GitHub's conditions to know whether they also have the right to use your code that way, no matter how you decide to license it? I feel that you are overly focused on the legal part here, which I'm sure was handled by Microsoft's lawyers. I'm more interested in the question of principle.
No matter what the terms of use at SO say, anyone can give you an answer that is a copy of some code they don't own. You may consider that immoral, but I don't, not at the scope SO is used for. In addition, the vast majority of cases at SO and Copilot are not about complex functions being stolen, it's about some dumb code you would have found in 2 minutes of googling. What I'm trying to argue here is that if we are all cool with SO and think it's useful, there is no fundamental difference here. We never cared too much about licenses for boilerplate code, and I think we shouldn't start now.
> Have you ever read GitHub's conditions to know whether they also have the right to use your code that way, no matter how you decide to license it? I feel that you are overly focused on the legal part here, which I'm sure was handled by Microsoft's lawyers. I'm more interested in the question of principle.
I have, and there is not. Neither could there be — in many cases the person uploading code to GitHub is not the copyright holder — they are just doing something permitted under the license — and for a large open source project there could be thousands of copyright holders. A random person mirroring some source code to GitHub is in no position to negotiate different license terms on behalf of the copyright holder(s).
> No matter what the terms of use at SO say, anyone can give you an answer that is a copy of some code they don't own. You may consider that immoral, but I don't, not at the scope SO is used for. In addition, the vast majority of cases at SO and Copilot are not about complex functions being stolen, it's about some dumb code you would have found in 2 minutes of googling. What I'm trying to argue here is that if we are all cool with SO and think it's useful, there is no fundamental difference here. We never cared too much about licenses for boilerplate code, and I think we shouldn't start now.
I don't understand why you think a person writing an answer on SO and a computer program outputting some permutation of its inputs into your editor are the same thing. The person writing an SO answer is intelligent and capable of conceptual understanding, the computer regurgitating code without regard to its license is not.
>> Have you ever read GitHub's conditions to know whether they also have the right to use your code that way, no matter how you decide to license it?
> I have, and there is not.
At least one IP lawyer strongly disagrees, suggesting anything you host on GitHub is fair game [1].
> The person writing an SO answer is intelligent and capable of conceptual understanding, the computer regurgitating code without regard to its license is not.
From a copyright perspective, that is irrelevant. In fact I would think Copilot has more incentives to not infringe than a random SO user, who is very unlikely to be sued. I already argued in another post that in my view, from any perspective, it is also irrelevant whether it's a person or AI doing the same work Copilot does.
> At least one IP lawyer strongly disagrees, suggesting anything you host on GitHub is fair game [1].
The question is whether Copilot's users can use the regurgitated code without following the license terms, not whether Copilot was allowed to train their model on it. I agree it's likely fine for them to train the model, but the use of Copilot would seem to be a legal minefield.
A little thought makes it clear that an affirmative answer would be absurd. This would mean that using a simple tool (let's say `cat`) to make a copy of some code and subsequently ignoring its license terms is infringement, but if the software used to make the copy is more complex (or perhaps if it has the "AI" label stuck to it!) the same actions are not infringement.
If I make a script and train it on Windows source code do you think MS will like it if I use that script on Wine ? I am sure MS will say the license did not allows it and your script transformations are not original, so GPL or similar license should be respected by Microsoft too.
>My position is that if a person hired in a company can currently use Google, Stack Overflow and GitHub to help develop their custom scripts, and no moral or copyright issues are infringed (ie, you don't try to say you came up with it on your own, and you use only enough that it is clearly fair use),
Only a judge will determine if it is actually free use, if you by change copied some super clever and unique code into your code base then I am sure a judge will not say it is fair use, copilot was proven it will do this(though MS said they put some IF-ELSE checks in the AI to prevent the plagiarism to be detected by removing obvious results and maybe obfuscating stuff more).
Maybe Stack Overflow license allows you to copy paste the answers in your code, but GitHub code has repo specific license that you need to respect.
If MS trained the model on all their private repos too and made the model free software then many would not have this issues. Or keep the model proprietary and train it only on the MS repors, BSD and similar licensed repos.
You are saying that the AI should be treated the same way as a person would regarding its 'output'. I disagree. This is a conceptual disagreement and you cannot just sweep under the rug "what cognition or creativity really are".
At the end, when in several (2-5) years we start seeing structural unemployment emerging because of AI deployments, this will be resolved by the legal system, most likely by some sort of partial prohibition of training/monetizing such systems.
I think I still have not understood your argument. Are you saying that you are afraid that AIs will become too powerful and cause unemployment, and therefore we should regulate them now before they do so?
Many people are worried about this, which is why there is a lot of debate about minimum income programs. However, at present, what Copilot is doing is similar to what Google does, and it is certainly not going to replace devs any time soon. Personally, I think we should exploit technology to its fullest, and the only reason we can have this conversation is because in the past, we haven't given too much consideration about the mailmen, secretaries, delivery workers and everyone else who got displaced by our use of the internet and similar technologies. We merely adapted to better exploit them.
I am not saying (in that last comment) what should happen, I am saying what will happen.
Past automation in terms of impact is nothing compared to what's coming and people and lawmakers will react accordingly - not in favor of the automators.
Copilot understands concepts as well as may humans. You can see primitive versions of this in the old Word2Vec demos showing how those models understand how London:England ~= Paris:France
Copilot is much more sophisticated than that, and it no more copies code than a human does. It generates on a character by character basis given the contextual probability of the next character conditioned on the previous set of tokens with the "heat" being a factor how how randomly it will choose characters.
This is much more similar to how a human writes than "copying".
"it no more copies code than a human does" < that's a very big call right there, considering how much verbatim copying has already been documented in Copilot. The primitive understanding Copilot has of what it is generating doesn't even approach that of the most average programmers. It's classic AI: impressive on the surface.
All the "copied code" I've seen is where the person prompts it with a large amount of very unique preamble and then it fills in the exact example they are quoting from.
Try it without doing that.
And it's weird people think it can't understand conceptual relationships. Word2Vec demonstrated that nearly 10 years ago and that's a much weaker model in terms of both size and techniques than this is.
> And it's weird people think it can't understand conceptual relationships. Word2Vec demonstrated that nearly 10 years ago and that's a much weaker model in terms of both size and techniques than this is.
Saying that Word2Vec or Copilot have "understanding" of their input requires a redefinition of the word "understanding".
If we're all standing on the shoulders of giants (specifically code that other people wrote) then really what Copilot is selling is a ladder to get onto those shoulders faster. I think that's a legitimate aim, as such. However it should be careful about not including unlicensed code and should have a specific 'GPL' option for a model trained with GPL code included.
I suppose it should also generate appropriate copyright notices to satisfy many open licenses. I'd be surprised if copilot could actually link back to the original code like that, though.
Let us also assume that this snippet, unsurprisingly, has been in several copyrighted repos that didn't grant Github the right to share this code.
So I start tying "getName" and copilot suggests the exact snippet above. If I use this snippet, is it plagiarism? Even though the above code is the most "obvious" way to write this getter and I would have written it this way even without copilot's suggestion? Or does the "uniqueness" or "non-trivial quantity" of the suggestions have any bearing in determining copyright violation? How/where do we draw the line?
Lucky for you if you, if you wrote a noise function that copilot returned as an implementation of Perlin noise you'd be breaching a _patent_! Said patent just expired a 20 year run, so you'll be okay this time!
I disagree. Most large projects, software or otherwise, use existing parts. If you design an innovative device you'll still use some standard components like chips, memory modules etc.
There's already a way to quickly solve the boring parts in development - libraries which were built and licensed around that purpose. But Copilot passes you code of unknown origin, with unknown license terms and no information about how close it is to an existing codebase. It's like a person trying to sell you Macbooks for a hundred bucks per unit but you don't know where they came from and who made the holiday photos stored on the harddrive.
99% of the "problems" I'm solving when I'm working even on very interesting and challenging problems, are boring subproblems. If I can get those out of the way then that would be great.
The most interesting problem will have extremely boring bits. If you write a cli tool to solve all of world problems by changeling magic, you'll still need to add the parameter handling and do some error management. Which is repetitive and likely well generalised and predictable based on other projects.
That hypothesis is easily disproven by spending an afternoon on a side project with Copilot.
No matter how interesting your problem is, translating it into code is going to involve a lot of grunt work. This isn’t just boilerplate, but also the large portion of your code which is going to be gluing things together.
The time you spend working through those menial parts of your code is time when the context of the interesting part of the problem fades. Once you get the mechanical stuff out of the way, you have to load the interesting stuff back into your brain.
This is where AI coding tools really shine. They dramatically reduce the intervals between when you can think about the actual problem you’re solving by letting you get the boring mechanics out of the way more quickly.
I'm very curious to see some examples where Copilot autocompleted something truly useful and saved you time - and that also disproves my hypothesis that you are doing something boring or with the wrong tools/languages/frameworks. Things that a non-ML autocomplete could do don't count.
I can give you an example of an entire (well, I still consider it alpha) library I wrote several months ago, using Copilot: https://github.com/osuushi/triangulate
This is an implementation of a 1991 paper on polygon triangulation into Go. So the deepest thinking about how to solve the problem was obviously already done for me, but there were a number of edge cases that I had to invent my own solutions to, and the translation itself involved keeping a lot of context in my head.
I can’t tell you in precise detail what Copilot did, and what I wrote by hand. I wasn’t taking notes or recording my screen. But there’s a reason you don’t see a lot of blocks in there where I forgot to comment anything, because my entire process for this was “type what I want to do in English, and see if Copilot will generate the next snippet, or something close”. I didn’t do this out of bloodyminded dedication to the AI cause, but because it continued to be an extremely effective way to get the code written quickly.
I can give a few specifics:
- My linear algebra is rusty, and Copilot was extremely helpful here. I would often just type the basic thing I was trying to do in pretty vague linear algebra terms, and it would generate the formula.
- I wrote a lot of tests like this https://github.com/osuushi/triangulate/blob/main/internal/sp.... This is a minor thing, but those aren’t copy-pasted. Instead, I would write the first test, and for the most part, I could just type something like `func TestConvertToMonotones_SquareWithHole`, and it would figure out how to adapt the previous test automatically.
- It generates exactly the error strings I want based on context an enormous percentage of the time.
I want to stress that I’m just giving a few examples of things that I specifically remember because I talked about them at the time, not characterizing the majority of the experience of using Copilot. The majority of the experience of using Copilot is that you write comments, and then the things you were about to type appear on the screen before you have to type them.
When I find myself writing comments of this style I see, I usually ask myself if this thing would be better extracted into a function. These comments are primarily stating the obvious.
If I find myself writing a 200 line function with nested or repetitive loops I expect to hear from colleagues about how I should refactor it.
I feel that the solution to writing boring, repetitive boilerplate shouldn’t be to automate writing more of it, but to reduce or remove it entirely. Seeing things like this just reinforces my preconception that Copilot acts in low quality code environments to produce fittingly low quality code, or with languages like Java where the language is married to boilerplate.
The problem may not be boring. Typing boilerplate code is.
I work on games as hobby. Sometimes I implement mechanics requiring vector math. Working on mechanics is interesting. Writing down math is not. Copilot helps with later.
Then another hypothesis: you probably haven't found the right tools for it yet. I find myself writing biolerplate mostly around some obscure system framework calls (iOS/macOS), but that's rather rare. But even OS API's and frameworks do evolve over time into requiring less boilerplate. Just take the evolution of CoreAudio, the modern Swift interface is so much better. So at the end of the day it's about the tools and interfaces: boilerplate is rarely absolutely necessary with the right tools.
That’s not how you use Copilot, any more than it’s how you’d use any other autocomplete tool. I don’t know why so many people seem to think that using Copilot is just closing your eyes, hitting tab fifty times, and then committing.
You work on your code, Copilot makes a suggestion. You read that suggestion and verify that it’s close to what you were already going to do. If it is, you hit tab, then you tweak it. There’s nothing blind about this process.
Seems like a narrow vision. Is every line of code you write to solve a problem “not boring”? I solve problems I find interesting, but writing matplotlib code to visualize data never is.
This is true for the current iteration of the model. Probably won't be true at least to an extent in 5 years.
Besides, there is nothing wrong with solving boring problems. Not everyone can be Bjarne Stroustrup.
We stand on the shoulders of giants. That had been the way for decades. A newer stack over the older one without much thought. And someone in the future will build even a newer stack over the current ones.
No, it doesn't do any of that. However, it does not "copy code" except in marginal use cases, the far more common scenario is that it will suggest you very basic code that is akin to a Stack Overflow reply.
I read a lot of open source code and might subconsciously absorb techniques and patterns that are common. When I write code I might be influenced by what I read, not line per line, but rather generally.
Kinda, but I think you are imagining something bigger than it is. At least in my experience, it works well for simple stuff like "iterate over x and extract y" or similar queries that I imagine are well represented in its training data. When you get to very specific functions, its answer will be less reliable and more likely to be a wonky rehash of the few examples it has for that case.
My personal reasons for not using copilot are a bit simpler. I believe the act of researching which solutions to use for a given problem is not so much about time, or the code you end up with, but about developing a better understanding of what you are doing. You may end up just cutting, pasting and modifying a piece of code you found, but hopefully, you were exposed to a few different ways to accomplish the same thing, and it made you aware of other choices that could have been made.
You could think of the evolution of practical problem solving in software engineering like this:
1. I have to invent a solution (because nobody else in the world has a computer)
2. I have to know of a solution (education, word of mouth...)
3. I have to look up a solution in the books I have (commoditized knowledge)
4. I can look up solutions on the internet <-- (we are here)
5. The computer suggests something and I accept (some are here too)
From 1 to 4 the amount of cleverness required to solve small problems drops a bit, but your productivity and exposure to knowledge probably goes up.
I'm not quite sure what happens from 4 to 5. Personally I'm actually more interested in the context solutions are presented in than just the solution. In fact, I rarely copy and paste code from the Internet, but I often look at multiple suggestions/solutions and then borrow ideas or combine ideas from several sources.
At least the way I use it, it's not taking much away from my problem solving. It's just that instead of having to type `particlesGeometry.setAttribute('position', new THREE.BufferAttribute(positions, 3))` I just write `//Add as an attribute` and then hit TAB, since Copilot is smart enough to see that I've just prepared some geometry and populated an array of positions (both operations also sped up by not having to type the obvious bits).
You're still having to think through the solutions (I'm not just typing '//make a cool particle sim') but no longer need to hit SO every few minutes for syntax examples when using a new library or something.
And yet after all of these decades, after countless advances in libraries and languages, I am still writing boilerplate in C, JS, Python, et al.
I’m not sure that a language or library can ever understand the context of code without following an ML approach.
Languages and libraries will always allow for more than the immediate task at hand. The innovation is that this tool understands which specific language or library features are probably going to be needed next!
It replaces a few google searches to look up how to do something with a new language or library. Keeping you in your editor and from having to context switch, and possibly distract/derail you, is worth it.
I would be interested to know how many people are actually using copilot to generate entire chunks of code that they don't understand. For me it's just autocomplete on steroids, its not answering any questions I don't know the answer to (other than syntax ive forgotten), it's just making the boilerplate faster to write so I can think about the actual problem I need to solve.
I might start considering Copilot if Microsoft were to train it on their own internal codebases (Windows, Office, SQL Server). Until they do, it's clearly a "tool for thee but not for me" type of situation.
Sorry for the unproductive tone of this comment, but there's something about the attitude of this tweet that really grinds my gears.
Any time someone invents something new and incredible, there's always a crowd of negative nancies eager to discredit and explain why the invention is nothing new and a detrement to society.
I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
'co-pilot just sells code other people wrote' is such a ridiculous understatement of what co-pilot does. Instead of marvelling at the human ingenuity that went into creating it, they sneer at the audacity of openAI to do something without first asking their permission.
They own their code and it either has a license for use or is implicitly rights retained if not. If Copilot regurgitates their code, from a project that is public but with a non-permissive license they are having their IP rights violated so are totally correct in being unhappy about that.
Just because you’ve made something cool doesn’t give you the right to harm others in the process.
If MS or OpenAI don’t think this is the case then they should have also included their private repositories.
The entire point of a fair use right is that you don’t need the copyright owner’s permission to be able to exercise it. Fair use allows you to do things that the copyright owner doesn’t like.
Is fair use on a massive scale still fair use? Courts generally think so, otherwise Google would have been out of business a long time ago.
Is this fair use? I don't think that's been established yet. And if it is why didn't MS and OpenAI train it on their private code repositories? Fair use for thee not for me isn't very in keeping with the spirit of that claim.
Gosh, can you imagine if they had trained it on their internal source code repositories and it constantly suggested using Hungarian notation for your variables? ;-)
I sometimes read people's open source code on github and use the ideas from that to develop my own ideas. In fact sometimes I copy and paste short passages and then rework them. I also employ a team of people who may do the same. Is that fair use, yes of course it is. Is co-pilot automating that fair use, I would say so.
Many people would claim what you're doing is a derivative work. I'm not sure the "of course it is" is very clear-cut (at least in the US). I've worked at big companies that have lawyers that care very much about this topic and what you're describing is prohibited. But, maybe it's different if you're not distributing your source.
> I've worked at big companies that have lawyers that care very much about this topic and what you're describing is prohibited.
They are doing this to make sure that any lawsuit can be easily dismissed. It has nothing to do with the legality of the action (which sounds like fair use as the parent described it), and everything to do with the expense of a potential lawsuit compared to the cost effectiveness of simply telling developers “don’t do that”.
Most people think that the law has two shades: lawful vs unlawful. But the more practical distinction is expensive lawsuit vs dismissed lawsuit. This is the lens through which corporate lawyers see copyright and it might explain why so many programmers think that copilot is “obviously” breaking the law and “stealing” their code.
If the usage was very clearly fair use, there'd be no need to be defensive about it; the case could be dismissed trivially. In reality, the question would need to be sorted out in court.
Questions of derivative works and fair use come up fairly frequently even in the open source world. This isn't solely a question of corporate lawyer posturing. I don't know any copyleft authors that would be okay with someone copying & pasting their code, making trivial changes, and saying it isn't a derivative work. Of course, their understanding of the law may be flawed. You'll get to find out in court.
You're right. A lot of this boils down to how much you want to spend in court proving your usage is just under fair use. We've moved beyond the question of ethics if you're intentionally violating a project's source license and relying on fair use to do whatever you want with the code. If you want to poke someone with a stick, you can't be surprised when they hit back. I contend what the OP described isn't clearly fair use (note I'm not saying that it clearly isn't fair use either). It ultimately doesn't impact me because I'm just not going to copy & paste code from projects without attribution and following the license, but I'd be worried about anyone reading that comment as objectively true.
Or alternately, "I sometimes listen to other people's songs and use those ideas to develop my own. In fact sometimes I copy and paste short melodies and then rework them."
Courts have held that it doesn't apply to music, why do you think different rules apply to code?
Courts are definitely aware of the need to protect the creative process and that no song is truly “original” in all aspects. e.g. the Katy Perry case[0]
Songs are different from code, in that the “hook” that makes the money may be only a few seconds long. There are many creative choices that a songwriter/producer can fit into just a few seconds: the harmony, melody, rhythm, lyrics, timbre, effects, ...
Whereas for code, the space of creativity is limited by functional considerations. A creative choice is protected by copyright but not all choices that programmers make are creative. Often the choices are limited by the API/interface or by efficiency considerations and it turns out that there’s only one good way to do something.
A function may be very intricate, yes, while still containing almost no creative value (e.g. a Vulkan setup function). Music doesn’t have an equivalent to this - the placement of every note is a creative act.
It reminds me of the Google vs Oracle case. Apparently a court found that copying even a small amount of code in breach of its licence is not permitted. [0]
Just because there has not been a test case yet does not make it illegal! If MS think it is fair use then they are free to go ahead. Business is all about recognising and assesing risks like this.
And even if there had been a legal test case, that does not make it moral! If people think this is socially wrong then they're free to argue their case. Business is all about ignoring ethical quandaries if it gives them an edge.
"Microsoft does it, therefore it must be right" does not a sound argument make.
> I don't think releasing a commercial product that copies people's code without complying with the license is anywhere near fair use.
The whole point of fair use is that the license doesn't matter. You can have a license that says I'm not allowed to use what you wrote for any purpose ever and I can still use it under fair use.
Yes but among the four factors that are used to evaluate fair use claims are whether it is being used commercially (it is) and how it affects the market for the thing that was copied (it clearly would since one way code is used is being imported by other code, if Copilot didn't insert my code into the new app, they might very well use my open source project that provides the same code).
I wasn't staking a position on whether Copilot is fair use, just pointing out that fair use doesn't care about license.
That said, copilot itself is not a replacement for your open source project that it was trained on. The code it generates may or may not be, but that's probably not Github's problem as far as copyright law is concerned.
IANAL but fair use is primarily about the public interest. What public interest is served by allowing proprietary software vendors to copy GPL code that's reserved for the commons?
> I don't think releasing a commercial product that copies people's code without complying with the license is anywhere near fair use.
It's just automating the copying and pasting (and slight reworking) of boilerplate code that would normally take me much longer to do, especially when I am working with a language I'm less familiar with but is necessary for my stack. I've literally never seen it suggest code that is more or less almost exactly what I would have come up with given a lot more time. In essence, it eliminates tedium- exactly the point of all of programming: Work elimination.
It seems fairly similar, at least to me, to a search engine copying snippets of other people's web sites and displaying them on a page. Admittedly, there's still some discussion as to whether or not _that_ is fair use, but I think enough of the population think it is (with many news organizations disagreeing).
Unless Copilot is "commenting" or "parodying" the code you've wrote, it's not fair use. Copying and using the code in another project sure as heck IS NOT fair use.
An automated system will devour all my code, which is under a case-tested copyleft license, and regenerate its parts in any place, without respecting the license terms, and call it "fair use".
I have two questions:
1. Why have licenses, then?
2. What if I just use leaked sources of closed source software and call it fair use?
The default under copyright law is that any substantial copy is infringement.
A license is a legal document that grants someone permission to use a work that they otherwise would not have had.
However the law also gives its own permissions to use a work - it defines what is unlawful infringement and what is lawful fair use.
The code snippets that copilot generates look more like fair use than infringement. They are small, adapted to the destination context, and usually not direct copies of one source but more of an average of many different sources. And usually the programmer does not keep the suggestion that copilot suggests unmodified - the programmer does their own editing of the snippet afterwards to further tune it to the surrounding context.
2. What if I just use leaked sources of closed source software and call it fair use?
As pointed out upthread, if it the source code is leaked then there may be trade secret protections. The GPL specifically allows the code to be posted online, so by design it is not secret.
> As pointed out upthread, if it the source code is leaked then there may be trade secret protections. The GPL specifically allows the code to be posted online, so by design it is not secret.
The reverse maybe true. I may be GPL'ing a code to prevent a useful algorithm from being buried deep inside a commercial code with an incompatible license. What makes it a "trade secret" level code? I have a 25 line algorithm which is worthy of its own paper. What if I open its reference implementation with AGPLv3+?
I have no problems with you reading the paper, and implementing it. I don't obfuscate my papers, but I put the implementation out with AGPLv3+. You can't use that in a codebase with an incompatible license. I expect and want you respect the license of my implementation.
> The code snippets that copilot generates look more like fair use than infringement. They are small, adapted to the destination context, and usually not direct copies of one source but more of an average of many different sources. And usually the programmer does not keep the suggestion that copilot suggests unmodified - the programmer does their own editing of the snippet afterwards to further tune it to the surrounding context.
Emphasis mine. First, there's no consensus on fair use, yet. Second they may be direct copies of the code. Third, they're remixed with other code pieces, which makes it a derivative work of many code pieces at once, then lastly, programmer re-derives the derived work. Which is clearly a derivative of GPL code, which brings in GPL license with itself (if what copilot derives the code from GPL licensed repositories, which it does).
I have no problem with Copilot as a technology. I have no problems with other licenses, which are not breached when used by Copilot and derived and used. The point which makes my blood boil is copilot using this GPL corpus, and don't admitting it publicly, breaching the terms of GPL en masse, and outright ignoring it. Then feeding this GPL derived code to any and all projects which pay for a copilot membership, and calling it a day.
When co-pilot reproduce substantial parts of someone else code without respecting the license terms, it is not fair use,it is just a disguised license abuse.
> otherwise Google would have been out of business a long time ago.
I do think there are ethical questions around whether it's right for google to digitise physical books without the permission of the authors, and keep them on their servers and make money from them without recompensing the authors. That's something an individual would not get away with doing, so it seems wrong that it's OK for google.
Fair use is quite narrowly defined though. This doesn’t look like fair use to me, especially when its been shown that copilot does, at least sometimes, spit out code that is completely unchanged from the source material, without advising the user of any license requirements (most permissive licenses require at least attribution).
The SCO vs IBM lawsuit was over only a few lines of code, after all.
I cant use a derivative of Mickey Mouse in my product, even if I change his colour and give him a hat, even if these changes were made by an AI. Why would it be different for code? I cab only use Mickey Mouse as fair use if its done for a specific barrow set of proposes (satire, news reporting etc).
There are just minor deviances, not relevant to this case, such as how long Disney bullied the countries to protect a work.
Software is usually considered a work. The AI needs to know if has permissions to copy and use the code, and then offer derived work on the proper terms and conditions. copilot doesn't do that. It might copy GPL code into non-GPL code, thus violating the GPL license, thus being an extreme risk.
What are examples of Disney getting countries to extend copyright terms?
In the US there have only been two extensions of copyright terms since Disney came into existence.
The first was in 1976, as part of a major overhaul of US copyright law to update the previous law (from 1909) to take into account the large changes in technology since then, and to make US law work more like the rest of the world to pave the way for the US later joining the Berne Convention. The changes for Berne compatibility included longer terms.
I assume Disney did support this, but only because as far as I can tell it had pretty widespread support. It had enough support that it would have passed even if Disney had adamantly opposed it.
The second was in 1998, and that was specifically a term expansion (as opposed to a term expansion like that of 1976 that was a side effect of harmonizing US law with the rest of the world). Europe had expanded terms a few years earlier, so the 1998 change in the US might have been motivated at least in part by harmonization, but I don't think the differences in terms between the US and the EU would have been enough to get it passed without some major interests pushing for it, so it is probably fair to give Disney a good part of the credit or blame for this one.
I was referring to the extension in 1976 from 28 to 50 years, and the subsequent extension in 1998 to 70 years, which everybody agreed upon that both were on Disney's request (hence its name "Micky-Maus-Schutzgesetz"). Other lobbying partners were the George Gershwin heirs and the Movie Industry (Jack Valenti).
There was of course no widespread support for these extensions, as all its arguments were flawed and not only violated logic but also several constitutions.
https://de.wikipedia.org/wiki/Copyright_Term_Extension_Act#G... (the en version is mostly cleaned on these counter arguments)
I'm sure the MS lawyers thought long and hard about this and are patiently awaiting any actual lawsuits with confidence in their position. It would be very hard to prove ownership of any snippets. To the point where you can argue that it is just fair use and to the point where companies would think long and hard before committing any resources to fighting MS on this in court at great expense.
I don't think that will happen but it might be interesting if it did.
MS is unlikely to be sued here because the infringement claim wold be against their users and my guess is the license indemnifies them against you suing them for defects in the tool you use (ie use at your own risk and if you get sued you agree you won’t sue us).
> It would be very hard to prove ownership of any snippets. To the point where you can argue that it is just fair use and to the point where companies would think long and hard before committing any resources to fighting MS on this in court at great expense.
One of the (ex?) programmers from Valve managed to get it to spit out parts of the Source engine verbatim. He posted a Twitter thread yesterday I believe.
If it's minus a license then it should be assumed that rights are retained (in the same way you can't just take ownership of an image you find on the internet) so if it were filtering it shouldn't take code from repo's without explicit and favorable licenses. If it is taking code only from repo's with permissive licenses (e.g. MIT) then why aren't they following the attribution requirements?
I don't think you can have your cake and eat it on this one.
If I steal some code and put it on Github under MIT that doesn't really make it MIT, I'm just lying that it is. If Copilot then uses that it's still in violation of the law I'd assume (ignorance doesn't exonerate you etc.). So they'd have to verify on a case by case basis, which they obviously haven't given the volume of data they had to feed the thing.
It's kinda shocking that they think they can sell this, even providing it for free is extremely sketchy but at least complies with BSD/GNU/CC licensed stuff I guess.
> Is every product user liable when a vendor ships some stolen code?
The user would be unlicensed, and in lieu of the vendor resolving this then the user would need to purchase licences to continue using the software legally (ie if a vendor gives you a pirate version of photoshop, you can’t just use it forever just because someone sold it to you).
There are usually clauses in enterprise software agreements that attribute liability for unlicenced components to the vendor for this reason. But ultimately if there isn’t a contract or the vendor vanishes, the user will need to go get a licence.
If you want to test the theory, I’ll send you a few images to put on your website, and when you get a claim through from the copyright owner you can try to argue that I sent it across without a copyright notice so I am liable ;)
> Is every product user liable when a vendor ships some stolen code?
No, but the difference is the users of a product are typically not making and distributing copies. That’s not the case if you use someone else’s code in your project.
It would prove that it doesn't honour all licences - just because the source code exists on Github without a licence doesn't automatically grant a licence to Copilot from a legal perspective.
I mean, even if the license was placed on the code, that doesn't mean, if it's not protected by copyright, then it's fair game for copilot to scrape, learn from, and emit variations of, the code.
I believe github's lawyers would have had hundreds of hours of dicussion about this and at this point, they believe they are in the right, and anybody who disagrees should use the legal system to resolve the matter.
In the meantime, what it is and isn't doing wrt licenses seems to be poorly understood externally.
All they could do is filter by the LICENSE file in the repo.
Unfortunately for them, by law copyright and license are determined by the authors and merely represented by a LICENSE file, which could be lying about both.
The court isn't going to accept that excuse when this goes to trial.
And you can have multiple licenses in the same repository, folders with copyright exceptions, etc.
It's hard enough for us human to find our way in this mess, I've little hope for an AI.
But maybe it's just the first step. The final step being able to sell an AI that understands Copyright management. I'm sure there is a big market for that.
I feel like a few guidelines and standards could help simplify a baseline process:
1) Require each repository to opt-in to be learned from.
2) Require any source file used for learning to have an SPDX license heading.
3) Have a list of approved permissive licenses to avoid any proprietary or copyleft arguments.
Using SPDX headings as the explicit guide would solve the problem of different code content using a different license within a project. An example being QtWayland: the client pieces are Proprietary/LGPL/GPL, whereas the compositor parts are Proprietary/GPL. That's not something you'd know from the license files at the root of the project (and post-6.3 they use SPDX instead of the prior license template heading).
Granted, this doesn't solve the problem of the chain of trust (is the individual publishing the code truly the copyright owner), but I think it would be a basic start for a program like this. The opt-in nature would make things... difficult, but I think that's a fair trade-off for something like this.
There was a tweet by Nora Tindall (which is deleted) having a screenshot of a mail direct from GitHub stating that GPL code is included in the training of the Copilot and will indeed use it.
> from a project that is public but with a non-permissive license
Permissive or not doesn't matter. Public Domain or not is what matters. Permissive licenses still require you to propagate the copyright notice, which Copilot strips.
Unfortunately the way IP law works, at least in the US, is that you can use essentially whatever you want as training data and it's up to the user to make sure none of the generated code violates licensing agreements.
If that's the case then GH/MS should at least disclose that for the code generated to actually be legal you have to hunt down the actual source (will be hard in a lot of cases) and check the license against your own license.
However, even if infringement occurs during machine learning, training AI with copyrighted works would likely be excused by the ‘fair use’ doctrine.[ii] For example, in Authors Guild v. Google, Inc.[iii], Google had scanned digital copies of books and established a publicly available search function. The plaintiffs alleged that this constituted infringement of copyrights. The Second Circuit held that Google’s works were non-infringing fair uses because the purpose of the copying was highly transformative, the public display of text was limited, and the revelations did not provide a significant market substitute for the protected aspects of the originals.
That's training for search to lead to a full copy of the original work with citations, not training for regurgitating verbatim chunks of copywritten works to be incorporated at scale into other copyrighted works while obfuscating their original source.
The Second Circuit's tests listed in your citation specifically fail in this case. It's not highly transformative since it's just regurgitating snippets to be used in other competing works rather than applying the body of works to a different domain. And it's specifically to provide a market substitute for the protected aspects of the original works.
Additionally, none of this says 'its all great and it's on the user to figure it out'.
In the US copyright violation is a strict liability statute. Regardless of whether or not a court directly confirms or denies Microsoft's right to use code in that way, the end developer is still liable for whatever he or she uses.
Has the exact issue of a remixing AI been tested in court? No. But everything even remotely similar has been deemed legal. Considering the legal and financial backing on both sides of the issue I expect it to go Microsoft's way even if it does end up before a judge.
> the revelations did not provide a significant market substitute for the protected aspects of the originals.
test of your cited case law though. The courts clearly drew a line at developing AI to inject snippets of copy-written works in similar copy-written works. And in context it would be the developer of the AI at fault (in addition to the end users who also used it to infringe other works in the creation of partially derived works; multiple parties can be at fault).
Basically the courts are making it pretty clear that they would have been against what Google had made if it were suggesting phrases in a plugin for a word processing program to create books that would compete with the original books. But being a separate domain of simply collating existing books and providing better search for their corpus (which led you to the original) was allowed.
I think you might be missing the point of their frustration.
Lots of companies do not put their code in public repositories, granted I understand the perspective of violating a license, but the point is if you don’t want your code used by someone else (even with the risk of not getting credit, don’t know why that matters) then don’t make your repo public period.
To that point, what’s to stop GitHub from making a policy that states: “All public repositories will be utilized in AI training”?
> even with the risk of not getting credit, don’t know why that matters
The point is that it’s not respecting the license, not just that it’s not giving “credit”. If I release code under a GPL license, I damn well don’t want someone using that code under a license that’s not GPL-compatible, no matter how it got there.
if not is it because it is too small? what’s the minimum line number that ownership kicks into? what if I change the function name and the variable names?
I don't expect, or support, licensing that small amount of code, and suing everyone to oblivion.
The point I'm trying to make is if something is under a copyleft license, you can't copy and paste it verbatim to something non-copyleft. It's what the license says.
Also, to be pedantic, the function I'm commenting on is pure maths, and you can't license/patent mathematics.
On the other hand, if there's some magic sauce of doing something, let it be 25 lines, what will you say? It's just 25 lines, so you can't license it? To be more pedantic, I actually have an algorithm, which is around 25 lines and does something novel. I've published a paper on it.
If I license the reference implementation with AGPLv3+, and you use it and close it, and if I can't go after you, what's the purpose of the license?
You can read the paper and try to implement it. It's free in that regard.
Let's say I have 25 line function which does something novel and can be published as research (which I did, BTW, no joke), and I opened its reference implementation with AGPLv3+.
if you write it yourself, it's fine. if you directly copy it from somewhere you arent allowed to copy from, then it is wrong.
There are no rules about the form of the code itself that governs whether or not someone owns it. Common sense applies. Sure you could "steal" very small, common, code snippets and get away with it; but that doesnt make it less wrong.
When a commercial entity explicitly does it, however, some times we can catch them. Like if they do it through algorithms that we more or less know how they work - i.e. the algorithm is using advanced control flow logic to copy and paste from it's training data set and copyrighted material is in that data set
Point is that you can ask 100 programmers to write an average function and probably most of them will come up with this answer verbatim. How can copyright law handle this? There is also the opposite problem, I can copy a complicated snippet and change the variable names. Am I absolved from liabilities now?
If they come up with it on their own, it shouldnt be an issue. Likewise, swapping the variable names does not absolve you from liability.
Copyright really is not only concerned with what exactly is on the page, but also how you got there, and where the knowledge came from to get you there.
What if I read your codebase, and then years later while programming for myself I inadvertently use solutions you came up with while thinking I came up with it myself?
There really are no hard set rules, and this is something that is handled on a case-by-case basis based on whether or not a convincing argument can be made that you copied a novel idea from someone else and claimed it as your own.
We can argue the semantics of it all we want, but the subject area is an active battleground. Typically it only matters when money starts to get involved, since no one usually presses the issue or gets involved with random personal projects. So when an enterprise level company leverages that lack of caring into a proprietary pay-to-use project that operates by copying and pasting code from copyrighted material, then it seems like a case might be able to be made for it.
It doesn’t really “regurgitate code” all that much in practice though. It’s a super impressive product and these arguments seem more like people looking for an excuse to hate new, scary technology.
In light of this potential new paradigm it's bewildering how people still manage to focus on the license of training material as if it even moved the needle in this context, even a little bit.
OSS knights: THE LICENSE.
MS: Aight, I guess we have a few lines of hq src to help out with…
Github: Same.
Other OSS people: We really don't care one way or the other.
As long as the word of the lincense was upheld for another 2 weeks before it ceased to matter for the rest of all time.
Jesus fucking christ. People. I get that oss licensing is dear to the collective hn heart – but, at best, it's completely irrelevant in regards to where this will inevitably lead, regardless of current questions/issues with license violations. You can (if all the repos of MS and Github are not enough to train this thing on, which is a laughable idea) even fucking buy additional source code if that's what it takes to strengthen Copilots legal foundation. The cost is insignificant. People will be happy to sell for super cheap. It's a non issue.
Why do you wilfully choose to be distracted instead of facing and thinking about the future together?
> Instead of marvelling at the human ingenuity that went into creating it, they sneer at the audacity of openAI to do something without first asking their permission.
Something being cool doesn't exempt it from discussion of its ethics and certainly doesn't exempt it from legal consequences. Often what people call "disruption" is often just exploiting resources/people/their work in unsustainable ways until oversight is introduced.
If CoPilot is copy/pasting large amount of code with unknown licenses, that is a large and real risk for users aside from violating open source projects licenses.
Moreover it's a genuine danger for non-hobbyist developers since you could be including stolen code into a market product.
Even including something banal like Linux is already problematic since it's GNU licensed, which by extension makes your entire project GNU licensed and you can't keep the exclusive rights to it.
Just to clear this up, since I’ve heard this a lot before:
> since it's GNU licensed, which by extension makes your entire project GNU licensed and you can't keep the exclusive rights to it
This is incorrect. Including GPL code in your product cannot automatically relicense your code. It’s just a copyright violation if your product’s license isn’t GPL-compatible and you don’t abide by the GPL.
> I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
Because they shared the code under a license, and they have the right to complain if people use that code but don't follow the license.
For example, what happens if Github Copilot spits a copy of some copyrighted code verbatim? Is laundering open source code through a machine learning model a loophole for not having to follow the license?
Often following the license is as simple as giving credit to the original author.
I've done a fair number of technical due diligence projects on acquisitions and potential partnerships, and on some project I've hired outside firms to analyze the code and figure out its origins and what licenses apply.
There are tools that will analyze a codebase and identify where chunks of varying size seem to come from. Mostly to determine if the code is encumbered by problematic licenses, but also to detect where the programmers may have borrowed code from.
If memory serves, some of these companies also have closed source codebases in their database, enabling them to detect if unpublished code has been re-used.
The times I've used this in due diligence it has rarely been a deal-breaker when we do find large chunks of code that may be problematic. For instance due to licensing terms that are not acceptable. You just make a note of it and have them rewrite the code before the transaction can take place. (Or you figure out if you can accomodate the license terms).
Yeah, but wouldn't it be great if the tool that performed "AI-generated code" were also required to run such analysis themselves, to eliminate this licensing violation at the moment it were inserted?
It's as if Microsoft were banking on the fact that most violations will be unnoticed
People don’t have a problem that AI is being used in some form to provide the service.
The complaint is pretty clearly that code is being lifted from repositories without attribution or compensation, and being redistributed into other applications.
How impressive the work behind copilot is or is not really isn’t relevant.
I've made use of a ton of open source tools and have not paid any attribution or compensation. By made use of, I mean I used them as their intended purposes and not their source code. I have a FOSS OS, server, CMD tools and libraries powering my ideas, it's part of the deal that I don't have to pay.
If I modify them I know what I have to do, but Co-pilot is somewhere in-between the two, it's abstracting knowledge from these codebases. We don't yet know how to deal with it properly, but this will change with time, that's why having these conversations are important.
I think that AI models will gain a new legal state, whatever they learn will be considered original work if it's not repeating non-trivial work 1:1.
I tried it for the first time today, so treat this with a grain of salt.
https://twitter.com/ayourtch/status/1539928018138931200 is my experiment. The code in question has a very specific format - it’s C with a lot macro sauce. I described the intent in the comment and pasted the includes lines. Then I started the #define of a unique looking token, and it added the lines with the correct boilerplate. What you see in gray is more boilerplate that it suggests when prompted.
I would dare to assert that “xxxayourtchtestxxx” is not going to be in anyone else’s code than mine.
So you can see the example of copilot generating completely new code.
Not saying it’s 100% of what it does - but this side looks very useful.
I also did a test with Rust: described a function canonicalizing MAC address, and then when it saw ![test] prompts, it started to make very passable unit tests for the function which was not even written yet - it was only the comment of what it would do.
Also a massively useful lever to have, if it can do so consistently.
My attempts to make it generate a bug-free canonicalization function didn’t work - but it was interesting to see it try different approaches based on the existing test code (and no, they didn’t always satisfy the tests, unlike one would expect :)
So this angle is “pair programming with a creative novice”, which also can be useful - it can give ideas to explore that you didn’t think of.
Of course this was all fairly trivial code, I do not know yet how it will behave in a more tricky situation.
But it kind of is when you think about it. Network weights are just a db written in an incomprehensible format and the synthesis part is searching and converting it back to readable data.
Even if it changes the var names and formatting a bit, it's still at best highly derivative. And at worst it spits out the exact code verbatim.
> Network weights are just a db written in an incomprehensible format
That makes all the difference IMHO, its complexity makes it much more than "just a DB". The synthesis part takes into account the context also, so it does intelligent things automatically, a smart SQL query does not.
My brain also works kinda like this. My knowledge is encoded in an incomprehensible format and I convert my knowledge into code based on the problem at hand.
This is how it always works, though. Moderna is standing on the shoulders of centuries of cumulative human knowledge without compensating all the sources of that knowledge. Musicians learn from other musicians and imitate to an extent, which is why all the musicians in a genre sound very similar, and we don't see present day rappers compensating the previous generation of rappers.
This is where some modest taxation comes in. To reallocate a slice of the output of value creation to its actual source in a rough kind of way wherever more direct compensation isn't feasible.
> Musicians learn from other musicians and imitate to an extent, which is why all the musicians in a genre sound very similar, and we don't see present day rappers compensating the previous generation of rappers.
You clearly don't know how copyright around sampling works. Yes rappers are paying shitloads to previous generation musicians for samples they use.
Sure, if we're talking about sampling, which is analogous to co-pilot copy and pasting chunks of code verbatim (which we've seen happen). But the complaints about co-pilot go far deeper than that. Quoting from the tweet: "it just sells code other people wrote". Do musicians "just" copy from all the people they've been inspired by and learned from?
Yes, but those humans are humans, not machines. With machines the scale changes dramatically. Which, incidentally is something copyright law has addressed explicitly: if you mechanically transform at best you end up with a derived work.
I don't understand the difference between Co-Pilot on the one hand and Moderna (on the shoulders of medical research) or SpaceX (on the shoulders of physics knowledge and cumulative rocket engineering knowledge) on the other. They all heavily use technology, automation and machines. I don't see where the distinction is coming from, and if there is a technical legal distinction, is it an ethically important one?
The distinction is a legal one: intellectual property can not be re-used without permission of the rights holder, be it a patent or a chunk of source code.
And you can bet that SpaceX using physics knowledge and cumulative rocket engineering knowledge are very careful to either license the tech they use or be very explicit about documenting their own.
That you can't see the difference is entirely on you, going 'against the flow' of society sometimes leads to change but more often it simply results in friction and a lack of comprehension.
Keep in mind that open source is based on copyright law, and without copyright law the protections that open source offers would be gone.
To give an extreme example: if you had a chunk of software that was constructed in such a way that it would spit out a complete copy of 'the Gimp' without the license file if you started to write an image processing program that would be a very clear case of copyright violation.
If you then start breaking the Gimp down into smaller and smaller re-usable fractions at some point you might be able to argue that such a generic and oft used snippet should be free of copyright. But that only works as long as you then don't string together a whole pile of pieces that you each copied somewhere else, the whole idea is that your creation is an original one.
Medical research (which quite often leads to patents, which I don't believe should be possible, especially if that research was publicly funded) and physics knowledge are of a different kind than copyrighted program code. The latter would be better compared to universally present language constructs and constraints, such as 'memory management', 'data manipulation' etc. Once you make those explicit in an implementation copyright applies.
Or, to make another analogy: it's like comparing the skill of writing to the product of that skill. The skill isn't protected, but the output of the act of writing is.
There are thousands of novel decisions in the work of Moderna and SpaceX beyond their cultural starting points. Same thing with art. Copilot isn't inventing nor is DALLE-2 being artistic.
> I don't understand the difference between Co-Pilot on the one hand and Moderna (on the shoulders of medical research) or SpaceX (on the shoulders of physics knowledge and cumulative rocket engineering knowledge) on the other. They all heavily use technology, automation and machines. I don't see where the distinction is coming from, and if there is a technical legal distinction, is it an ethically important one?
They are all in compliance with intellectual property laws? Seriously, that's a bloody big difference.
Co-pilot is not in compliance with many of the source code it is using!
Whether you like it or not, compliance with the law is necessary.
> This is where some modest taxation comes in. To reallocate a slice of the output of value creation to its actual source in a rough kind of way wherever more direct compensation isn't feasible.
I was with you until this statement. The vast majority of society consumes, but doesn't create something new in the process. I'm bewildered as to why you think taxation is a solution rather than a disincentive towards creating. As far as compensating the giants upon whose shoulders most stand, there are plenty of vehicles for that: royalties, patents, copyrights, pensions, awards and prizes, paid fellowships, etc. These are relatively easy to calculate and write a contract for.
If I enter 'Mickey Mouse' into an ML-TTI thing like Craiyon (Dall E mini) do you think I will be able to sell the resulting image on a Tshirt?
No, I won't, because Disney has fancy lawyers, the average open source developer hasn't. What you are saying is: Screw little people, let M$ make their money.
Either copyright is for everyone, or for no one. I prefer the latter, but this is not the world we live in.
There big difference is that by copying Micky Mouse you are hurting one of the most known and very powerful corporations in the world, by copying code you are just hurting open source projects and individual developers.
It should not be different, or if anything, it should be worse to punish people with less resources. But here we are.
This is more like entering "cartoon mouse nose" into Craiyon though. You're getting incohesive code snippets returned to you based off a single line (appropriate word for code and a drawing).
My code is shared under a license (MIT) that mandates attribution.
That’s all I ask — if you use my code, give me credit.
Stealing my code to train your bot — which will replicate portions verbatim! — is no different whatsoever than the casual plagiarist that copies and pastes a novel snippet manually.
Its absolutely my legal and ethical prerogative to complain about people stealing my code by failing to respect the license under which it was freely provided.
It does not matter what the internal representation is. What matters is that Microsoft is selling a tool which reproduces non-public domain works while claiming to grant the user ownership of the output.
They are complaining about license violations, they are not pissing on this incredible (is it?) achievement.
Reselling other people's content like this without attribution (which, is a pretty mild form of payment) is not nice. But at least you now have one more reason in the list of reasons why Microsoft acquired Github: to be able to launder their open source contributions and resell them.
I also disagree with the tone of that tweet, but your dismissal is equally shallow and gear-grinding.
There are real, serious, and genuinely interesting issues to be discussed regarding Copilot. It is neither "just selling code that other people wrote", nor is it something that we should applaud merely because it demonstrates "human ingenuity".
The comments here regarding this are honestly a total dumpster fire. It's mostly a bunch of paper-thin hot takes, either:
- The blatantly stupid "you willingly shared your code so why are you complaining that one of the world's biggest companies is now hoovering up code from your carefully-selected open-source license and reselling it as a service!!!"
- The blatantly lying "I have literally never looked at any other computer software while developing any obviously anybody who has ever seen other source code is a plagarist"
It's dumb because there is an actual interesting discussion here but I guess we're not going to bother having it.
I actually didn't intend for my comment to be an argument in favour or against, and I am a bit surprised it is the most upvoted of the section.
I agree that there's a pretty interesting discussion to fair-use and the limits of copyright, and that my original comment was not conducive to having that discussion. In my defense, neither was the tweet this thread is about!
I share my code without a license because I want others to be able to see how I solved things. However, this doesn't mean I'm okay with wholesale copying my code. If it's some random guy, then whatever. If it's a corporation like Microsoft, then yeah, I have a problem with it. Under German law, the code is legally not allowed to be reproduced or used without explicit permission even if it doesn't have a license. I retain ownership of it until and unless I explicitly relinquish my ownership rights.
Well, it depends on where you post it, right? Because if you are using a GitHub which probably is US based, you follow the laws related to US?!
Demanding that the law of a country should be followed by another is totally no sense. They can agree, make agreements about it, and even take legal action to the Highest court, so it could be evaluated, but using your nationality as an argument of what you can do, it's just plain wrong.
I always find it weird how people respond to my comments. Why didn't you check what the US law is like for source code? A lot of places have similar laws around source code, primarily in the West because of efforts to normalise laws across countries, driven by US efforts. And other countries? Well, it's the same for any kind of IP. Either the country has strong IP law and you have the resources to pursue an issue or not and you can't do anything about it.
US law is pretty similar in this regard, isn't it? If you don't have a license for a particular piece of code, you can't use it without the author's/copyright holder's permission, even if you found it posted online.
You're expected, wherever you are, to look into where any code you use comes from and what legal rights you have to use it. (The author not offering you a license means you can't use the code, nearly anywhere in the world - pretty basic Berne Convention stuff.)
This is the legal expectation in general, not just for software - you can't just come across a design for a neat widget somewhere and start using it in your product, there's probably both copyright and patent on it. Software isn't special. Not everything in Github can be copied into your code verbatim.
That’s how us law works too. Works are automatically under copyright, even if you don’t say so. It needs a license to lessen the copyright restrictions.
> I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
People like you should understand that publicly available code doesn't mean "do whatever you want" code.
The majority of publicly available code hosted on Github as a license that tells you what you can and what you cannot do with that code.
If someone uses this code without respecting the license, authors have the right to complain and even legally enforce the license if they want.
Now, you should know that there's nothing "cool" to take other people's work without permission.
Meanwhile, creators of FOSS projects are often underfunded and lots of people are in such dire straits that rich people talk of mollifying them with a few paltry dollars via UBI rather than fix anything.
That's likely the crux of the issue. If you do it right, you can steal from other people and get rich. Meanwhile, those same people (whose work was stolen) may be left out in the cold no matter how original, creative, hardworking etc they are.
> why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
For other individuals to collaborate, to make the software available to other people, etc. Certainly not for github's profit and much less for the benefit of github's customers who will have access to open code that violates license agreements.
My problem with this conversation is how we can have a 200 comment thread without anyone providing any kind of proof to these claims. Is there any instance of this bot printing an actual copyrighted algorithm instead of a mundane uncopyrighteable piece of logic?
I mean I'm not an expert but it's a valid point as people share code under a given license, and as far as I'm aware Copilot does not make this knowledge available. Nothing to do with the fact that Copilot is an amazing technological achievement.
If I, as a human, go to a public repository on Github and copy/paste a non-trivial 200 line code snippet into my proprietary code base I have to abide by the license of that original code, even if I slightly modify it. I don't see how this cannot be true for Copilot. I'm sure the legal folks at Github have thought of a response though, you could e.g. argue that the snippets produced by Copilot are not affected by the copyright of the original author as they do not reach the required treshold of originality. Seems rather shaky for me though.
I think copilot is amazing. I don't care what, if any, of my code snippets it uses because I also gain from it by skipping boilerplate (as well as things like bash idiosyncrasies). Using it feels like I am working with dozens of invisible collaborators
> Any time someone invents something new and incredible, there's always a crowd of negative nancies eager to discredit and explain why the invention is nothing new and a detrement to society.
It is not true. Whenever there is something really useful, everybody is happy, and while of course they always are some nansayers, they're very few.
However, when you do something controversial, you can expect to hear criticism. You are of course free to dismiss that criticism, but when a lot of people are telling you what you are doing is unethical, maybe it's time to stop and think about it.
Wouldn't you rather have a healthy dose of skepticism and pessimism surrounding new inventions? Even if the negativity is off base, it's far more preferable to a world where everyone is always positive and praises what geniuses the creators are. The former atleast breeds discourse while the latter only serves to make people feel good.
Why can’t startups understand what a open source license is ? Apache 2.0 could be ingested by this tool but it is a horrible license for your database as a service. AGPL would be a great license for a database as a service but should not be ingested by OpenAi / GitHub copilot.
Both things can be true. It's clear that it violates the licenses of many software projects. But I do agree that denigrating it as "just selling other peoples code" is missing the whole point of the product and of what you pay for when you subscribe to it.
This doesn't address the point of the Tweet, you are simply attacking the form of their argument.
Moreover it is possible to BOTH marvel at the human ingenuity that went into making copilot AND disagree with their methods. Some things can be marvelous and wrong at the same time.
Sorry for the unproductive tone of this comment, but there's something about the attitude of this tweet that really grinds my gears.
Any time someone invents something new and incredible, there's always a crowd of negative nancies eager to discredit and explain why the invention is nothing new and a detrement to society.
I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
'co-pilot just sells code other people wrote' is such a ridiculous understatement of what co-pilot does. Instead of marvelling at the human ingenuity that went into creating it, they sneer at the audacity of openAI to do something without first asking their permission.
whoa, I think this should definitely be highlighted far and wide on the internet, think of the ingenuity of the people who made the HN-Comment-AI, it's probably the smartest comment bot out there, able to take the ramblings of people on HN and nonetheless generate a comment so astute!
Although I have to say the use of the phrase 'negative nancies' shows that even the best machine-learning algorithm still comes up with unlikely to occur in real life text.
People get paid to write code having learned from writing code for others and from reading code others wrote. In this regard I dont see why github copilot is any different.
People often do write things because they learned a common approach at a previous job or because they saw such an approach when reading someone else’s code. People are often hired specifically because they have experience in a certain area from a previous employer, so are dointhe same sort of thing at a higher level.
We fought this battle over a couple of decades with remix culture (“you stole that line/beat out of my song!”) and the world is better because the over-clingers lost.
There is no shortage of reasons not to like copilot, but I don’t consider this one of them.
Sometimes people do, and in any case copyright isn't limited to just verbatim copies. You can't, for example, reuse characters or plots from other works of fiction in your own novel, even if you rewrite it in your own words: https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...
‘Facebook just sells personal information of other people’ is such a ridiculous understatement of what Facebook does. Instead of marvelling at the human ingenuity that went into creating surveillance capitalism, they sneer at the audacity of Facebook to do something without first asking their permission.
Of course you can and you should. But going from the Twitter bio and personal website, it doesn't appear to be the case here. They're an activist who lives from soliciting donations and selling 30$ videos on how to be anti-racist (like, literally).
Fully agreed. It's just people getting mad and jealous but hear me out.
Copilot is NOT SELLING coed other people wrote, it is simply acting as a curator to show you all the solutions people HAVE WRITTEN for free.
Copilot does NOT write entire programs, it's simply an assistant. And there is not much copyright you CAN apply to 3-4 lines of generally understandable code.
I've used Copilot and am actively paying for and I have not seen many cases where it's generating bad code. It's only there to remove boilerplate and common problems, not there to write entire applications.
Thanks. Personally, I feel like such small and widely used mathematical algorithms should not be copyrightable (or using them should fall under fair use). It even has its own Wikipedia page[0], where the source code is also reproduced without copyright notice.
% Constants
FISRCON GREG #5FE6EB50C7B537A9
THREHAF GREG #3FF8000000000000
% Save half of the original number
OR $2,$0,0
INCH $2,#FFF0
% Bit level hacking
SRU $1,$0,1
SUBU $0,FISRCON,$1
% First iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
% Second iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
(Note this assumes that the input number is not too small; if it is, then it will not be possible to compute half by this algorithm. Also, like with the original code, the second iteration may be omitted if desired.)
(This comment and the MMIX code it contains, and all other comments that I wrote on here, are I agree release it to public domain.)
So what? Selling code other people wrote is the foundation of the free software movement. It is the entire business model of countless companies, and it is a good thing. Among them are most major linux distro vendors like Red Hat and Canonical.
The value added by Copilot is that they sell you the lines "code other people wrote" you want out of billions.
I still think it is derivative work, and that they should only process code under permissive licenses, or, if they want to include GPL code, make a GPL-only version, usable only for GPL projects. I thought it is what they did, there is so much code under permissive licenses that is should be enough to train their model, but apparently, they don't care, as long as it is public, it is included. For me, they are shooting themselves in the foot, several companies have already banned Copilot due to the potential issues with copyright.
I started self hosting when Microsoft bought github and with this mass theft of copyrighted material and then reselling it for money I'm even more happy with my decision.
Copilot very rarely copies code verbatum, and when it does it's very short snippets. When Oracle sued Google over allegedly copying short and fairly trivial snippets of code they were justly derided.
I can't speak to the legal side, but I just don't understand the moral outrage over very occasionally copying such short snippets of code. The key innovations and the actual value that licenses are intended to protect aren't in these short snippets.
And what does copilot bring to the community? Free use by students, free use by open source maintainers, and a huge boost in productivity for a modest fee for professional devs, for a service that no doubt costs a lot to run, even on the margin.
On a side note, I do believe that short programs or functions should be copyright free by law.
Or we as a community need to create a better bsd, a cc0 for everything.
Almost everything is nontrivial, and almost everything is copyrighted, at least with the pressure to name the original author (BSD, GPL, other major permissive licenses).
Say you want to use a library, then you check for examples in the documentation, now you have to denote somewhere that the example is from the documentation (best if you put it in the source code, so you don't lure other people to copy what you copied and refer you as the author).
What about a law that makes all code available but then requires you to use a portion of your earnings to compensate the people their dependencies you used?
When my last company got acquired, part of the due diligence process was a scan of our codebase for snippets from stack overflow. Every snippet found that wasn't posted with a clear license by the author was challenged and we rewrote it.
Now, I'm not entirely sure how necessary this was from a legal perspective. But introducing an AI into the mix will bring up a lot of uncertainty when it comes to how much change is required for something to no longer be considered a copy/derivative.
This is exactly where it gets murky. We had the usual 1-4 line snippets. We went the extra mile to change them, rewriting them from scratch, partially with different implementations. Did we need to do that? Would it have been enough to just change a variable name or some spacing or similar? I don't think there's a clear standard.
The music industry has struggled with this for a long time. When is a song derivative, when a copy, when is it "inspired by"...
Copilot is a new way for corporations to break copyright while enforcing it for everyone else, this will be the first big use for AI when other corpos follow.
Technically, programmers search, copy and modify code all the time.
One might argue copilot puts into software an algorithm that humans are already doing. Software like that is usually inevitable.
Still, it sucks there's no benefit for the contributors.
The most ethical thing I can think of is some kinda 'Spotify-like' revenue sharing model, based on how often their code is used by others. Not that they'd ever implement that if they can get away with it!
> One might argue copilot puts into software an algorithm that humans are already doing.
That argument only works if you think what Copilot is doing is meaningfully similar to what humans are doing. The debate about how these models relate to human thought might have legal implications.
As I understand it (IANAL) copyright doesn't protect ideas and concepts. It protects the content itself. In theory, if I read some copyrighted work, understand some idea in it and then create a new work using that idea, without copying that original work, then that is not a derivative work. (I think this is at least how it's supposed to work - would love to be corrected if that's wrong.)
So if I took a copyright work and rot-13ed it before distributing copies, I think that would be clear copyright violation, but if I made my own works using concepts I gleaned from reading it, it wouldn't be.
So should Copilot be treated like the rot13 algorithm or like me understanding concepts and generating new works using them? That sounds like a fascinating legal debate to be had.
I don't consider copying a 3 liner from stack overflow and not writing an attribution plagiarizing (regardless if technically speaking it is or isn't according to the law).
> Plagiarism isn’t a legal concept, it’s an ethical one.
Well if it isn't a legal but an ethical concept, then that's just your opinion, since there isn't some universal body that establishes exactly what is ethical and what isn't. And as I said in my previous comment, "I don't consider".
> You need to either attribute the source, or rewrite it in entirely your own words — just like when writing a paper.
Often times a three liner can not be changed in any way, and is the only solution to a problem. In some cases you may be able to change it only in terms of indentation and variable names (in others you can't even change that).
But assuming you can do that, it makes no sense at all just changing indentation and variable names just for the sake of changing it.
> Confirming to the license is also required; iirc, SO requires attribution under the CC-SA license.
You should use the example to understand the underlying problem, at which point you will be well-equipped to write your own one-liner.
If you can’t write it using your own understanding of the problem, you’re not an adequate programmer and need to improve your skill-set … which won’t happen if you just keep plagiarizing code you don’t understand.
You're basically just repeating that your opinion is the right opinion.
I don't agree that such example is plagiarism and I'm sure a lot of people also would disagree that that's plagiarism.
> You should use the example to understand the underlying problem, at which point you will be well-equipped to write your own one-liner.
> If you can’t write it using your own understanding of the problem, you’re not an adequate programmer and need to improve your skill-set … which won’t happen if you just keep plagiarizing code you don’t understand.
Who says you can't write it by your own, or you don't understand it? Stack overflow and tools such as copilot are often about saving time, not that you would be unable to figure it out by yourself.
And besides that, the point of those examples is that a lot of people without searching for those stack overflow posts, would type that exact same code character by character.
> The most ethical thing I can think of is some kinda 'Spotify-like' revenue sharing model, based on how often their code is used by others. Not that they'd ever implement that if they can get away with it!
Based on my understanding of how NNs work, I'm not sure its even possible to implement something like that.
Yes, though in a way so does stackoverflow & friends. Large chunk of dev ecosystem is copy paste and I don't think this is inherently problematic. It is always a case of standing on the shoulders of giants.
Its more of a licensing issue to me. As far as I can tell it was train on a blend of licenses which to me makes it inherently non-compliant. At least some of it is going to be copyleft and find its way into closed source.
I'm not a lawyer, nor very well versed in the vast world of licenses and their definitions in court contexts, but I've been wondering about something with the growing appeal ML-generated content has for the average person (and the "high" barrier for entry in the market) — are licenses in some form or another going to adapt to this phenomenon? From a brief search, I have not found any new license with a no-dataset-usage clause (assuming fair use does not apply, that's another big question). What are the chances anything of the sort will become an option for any "creative" work that's usually shared freely (such as artwork, code, et cetera) even despite copyright? What about the ownership of the dataset? It seemed to be questionable years ago already that possibly IP-protected content goes through the black box and resembling material gets on the other side, whose ownership is it really? I'm guessing some notable court cases in the future could define this in the following years if the popularity continues growing.
People share and reuse snippets of unattributed snippets of MIT-licensed and GPL-licensed code on the internet all the time, StackOverflow, etc.
StackOverflow is profiting from that activity indirectly by facilitating it. They profit passively through ad revenue, and actively through the Teams subscription offering.
But nobody seem too upset about that.
How is an AI which facilitates the same code sharing fundamentally any different? Because it’s scraping it itself, rather than humans contributing it?
Traditional 'real' (as opposed to 'imaginary') programming is like writing in assembly code; It's outmoded because of generative models, in a way similar to 'C' outmoding assembly code.
The most important thing, I think, is that free (libre) software developers are able to work with the language models directly, so that libre software is allowed to continue progressing into what I call Imaginary Programming.
That's because with a generative internet all you really need is blockchain + prompting.
The models themselves should be clear about where the data came from.
However, this is only possible in a fair world which we do not live in.
Compromise must be made to protect national interests.
Generative models are license blind and there's very little that could be done to prevent progress. Like what the invention of the camera has done for art.
Large language models including Codex are a transformative technology.
Bi-directional fair-use is probably the best result we can hope for.
So long as Microsoft and OpenAI are not selling back usage of the model to the
open-source community, I think it's OK, though it's the bare minimum
obligation.
I know this isn't really related to the whole copying ethics debate, but I definitely feel like there's some sort of foul play happening here. For all of the unlicensed projects out there, the license that is automatically granted to Github includes:
> the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time
It's insane how vague this is. Is Copilot a "Service"? Sure, by its definition:
> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.
And since much of the code was published before Copilot's inception, this means Github can just arbitrarily add more "services" and milk the code for whatever it wants. Automatically service-ify any public repository? Sure, pay us for quotas. It's like a legal loophole to let Github just bypass any license restrictions you put on it.
The point of this Tweet is about licensing. When using an MIT licensed library for example, you would have to give attribution. But you can easily rewrite portion of that library yourself using Copilot, which could potentially use code from the initial lib, without any attribution or whatsoever. It's even more problematic with licenses such as the GPL.
I guess Copilot could address this by checking the licenses of the projects it uses. Even when combining code, it could pull in the required attribution or avoid GPL licensed code (unless enabled) for example.
Code plagiarization is not a thing by all practical purposes (it's even almost impossible to go to court with that for very obvious reasons). And that's good. Because with that insane lockdown of "Intellectual Property" nothing would ever get done. So, think what you want.
I don't think that's true and, if it was, it would be the death knell for open source.
Code Plagiarism is taken very seriously by every company I have worked with. Multiple companies have been sued for violating the GPL. The SFC is currently fighting Vizio in court for example. While not commonplace, to say it's "almost impossible" is a stretch. Every large company complies with code copyright obligations for a reason. My company publishes changes to GCC and a dozen other GPL projects. Entire products like Protocode and BlackDuck exist to ensure code compliance. Even small code snippets are flagged.
Over the past few years the source code for Windows, SQL server, Bing and Cortana have all been leaked. If someone built a product using that code, how long do you think it would take Microsoft to sue? CoPilot is one rule for mega-corps and another for everyone else.
> Ok, then there doesn't exist a single reputable company with a tech division and we're all unethical. Have a nice unethical day.
I’m deeply disturbed that you think this form of plagiarism is universal — I can assure you that is not the case.
I work at a FANG currently, and plagiarism is absolutely not tolerated.
In fact, plagiarism has been considered a fireable offense at every other company I’ve worked at over my 25 year long career, and prior to that, considered a serious form of academic misconduct in school.
It’s clearly unethical and I’ve never plagiarized in my life.
I’ve only run into one instance of someone else plagiarizing code in my career, and that individual was fired.
> I’m deeply disturbed that you think this form of plagiarism is universal
This thread is an eye opener for me too. Do engineers not get trained on their legal obligations? My company is old and not a tradition tech company but we have been running workshops on the issue for years. Even if they don't, what about their legal teams? Or CI tools to scan for licence violations? Some of the responses here are so naive it's crazy. I hope no one is identifying the companies they work for.
Obviously we do. Don't copy paste 10 pages of source code unaltered and sell it as your own.
But that's something entirely different from small code snippets, changed and adapted to solve the same problem a thousand other people already had. Nothing else are developers doing going on GitHub, StackOverflow or any other website to find answers to their questions. That's not naivety, that's how coding works (partially). If you would have to re-invent the wheel everytime you build something new, good luck.
There isn't a threshold for copyright violation. If you copy a 3 line function from a GPL library, you have to comply with the licence. Tools like BlackDuck will pick it up.
Snippets aren't exactly defined but I see them as more than just a single line like "here's how to flatten a list in Python", it's some functionality - e.g. an algorithm implementation or some task.
> I’m deeply disturbed that you think this form of plagiarism is universal — I can assure you that is not the case.
> I work at a FANG currently, and plagiarism is absolutely not tolerated.
It's universal in any company that doesn't take measures against it. So basically startups, small, medium, and even some large companies.
So, rearranging conditionals or loops or variables then, problem solved. You cannot 1:1 copy paste anyway. That never works. You always have to adapt it to your particularity. So it's "reworded" by default. And CoPilot is doing nothing else. It's not just 1:1 memorising code, it's a tiny bit smarter than that. I strongly believe you're not a developer. Point taken. I understand your considerations. You should write code sometimes to solve a complex problem that uses some libraries and see how far you get without consulting the internet or books.
I absolutely attribute things I find on SO to where I found them. You finished college maybe a year ago and are already making some absolute judgments about what makes other people qualified to call themselves developers simply because they don’t develop as you do.
As to your first point, there are many repositories on github that the author of code did not upload there or where not all contributors to the code are on github or agreed to let their work be used in such a case.
That's really no different than somebody uploading proprietary code they don't own (stolen, leaked, whatever reason etc) on Github. Github has to assume that you are allowed to do so. What are they going to do otherwise, somehow manually verify that each repository is legit?
Now you might say, what about GPL code you don't own. You are allowed to redistribute it (upload to github). But because you are not the owner you can't license it to Github under new terms (that allow them to use it for ML training). But the question still is, is there anything in the GPL that forbids it's code being used for ML training? Even if the generated model is proprietary, has no attributions, etc?
Ok, takedown requests exists. Say Qualcomm finally wises up and asks github to takedown a copy of the millions lines of their super proprietary 4G modem firmware implementation from github. Will github retrain the model after each such takedown? :D
If not, then it's kinda stupid to argue the point about the lack of knowledge, since lack or not lack of knowledge clearly doesn't matter. Github will happily continue using confidential code even from trigger happy companies like Qualcomm for copilot.
I guess they would add some kind of filter to copilot output that removes results that clearly come from code that was DMCAd.
It's kind of like some employee that worked at Qualcomm and has seen the code. Do you retrain him (aka hit his head until he forgets) after leaving the company?
The comparison might seem silly but as AI advances I expect more and more arguments (especially in court) to come from analogies of humans and AIs.
What kind of filter? I thought copilot does not output the input data verbatim.
Creating an output filter based on millions lines of DMCAd code that would not cripple the copilot output completely at the same time, sounds like one of those hard problems. Especially if there's no agreed upon definition of copyright "violation" here.
> 1. You most likely agreed to that by using GitHub.
Are you saying that I would need all the original authors consent to upload a repo to github even if I include all the original attribution and licenses? Because what you are implying is that when uploading I'm granting github a license far outside the bounds of the license included, which only all the contributors can do. For example, would the linux project need to contact each and every contributor ever to upload a mirror to github, since their contributions were under GPL but you are implying that the license given to github is much, much broader?
This would make any project not originally started on github and with a few contributors basically impossible to host there.
> 2. Copy&Pasting Code by manual search exists.
The question is who is doing the infringement here. Github copilot is obfuscating the copying and telling it's users that the code is theirs to use, own, etc. as they please but is also taking large chunks of code it does not have the right to redistribute, even less grant licenses to.
I don't think that something like CoPilot is what most GH users had in mind when they published their code. Also, licenses exist (which CP demonstrably doesn't give a shit about).
There's some truth there, but there is more negative in outright dismissing the uncomfortable but important ethical dilemmas one might be introduced to.
Humans make original patterns, but since Copilot cannot think, then Copilot does not. It squashes together a bunch of small individual patterns, each under their own license, but at no stage does it do anything more than pick a line from here, and a line from there.
It doesn’t think, and it doesn’t create new IP.
It is like making a picture out of small snippets of a thousand other pictures, and then selling it.. clearly not OK. You still ripped off the original artists.
Or like plagiarising 100 of your class mates’ assignments. Are you less guilty because you went to the effort to steal just a few sentences from each?
A criminal who steals a cent from every account at the bank is a more sophisticated thief than someone who holds up a petrol servo.
If Copilot doesn’t create new IP (it doesn’t; we established this), then it uses existing IP. And in that case it is no different to any of the three analogies above.
I think this problem has no good solution until IP laws around the world are properly reimagined from the ground up. I'm of the quite radical stance that code, music, art in terms of their intellectual existence should be free for anyone to take. (you can own a harddrive with code on it, and claim noone should steal it, but not the idea of the code itself)
If you have ideas, code, music or art which you wish for noone to partake in, do your best to keep them secret. Certainly, breaking into secret areas should be illegal, but once the cat gets out of that bag it gets out of the bag.
The creative people behind these ideas I believe will be able to find good compensation nonetheless in society, IP-laws nowadays only serve to protect megacorporations to the detriment of creativity and ideas.
I agree. This will fix it. I think that copyright and patent should be abolished, but that if it is secret then it is still secret (unless someone else manages to come up with the same thing (e.g. by decompiling a published computer program to reconstruct the source code), which case it can be public). And so then also the AI can copy the code too just as much as you may do so manually; if it is published then you can do it and it should not be illegal to write such things.
I don't think any professional community is aligned on how to think about ML-generated content yet. We don't know how to apportion rights between the data owner, the model owner, and the end user, and I don't think existing copyright law is ready for it. At least for software, I think the way forward is for the next generation of software licenses to explicitly state whether the code can be used to train ML models and what those models can be used for. Without explicit language, we'll be squabbling over interpretations of fair use.
There's going to be some big cases here. It's going to end up in the Supreme Court sooner or later, and if it were to go there today I think I know what they'd say.
If the portion of code that Copilot lifts is the "heart" of the original work, that would be much less likely to be considered fair use[1], regardless of the length.
> For example, it would probably not be a fair use to copy the opening guitar riff and the words “I can’t get no satisfaction” from the song “Satisfaction.”
I wonder how this could be integrated into the system?
There's a good argument that demanding copyright protections on scraped datasets and short snippets is a double-edged sword. It could harm search engines, distribution of news, and non-commercial ML research too.
At every turn, in every instance, for decades, all stories involving Microsoft end in "...and then Microsoft fucked people over." I've witnessed this firsthand since the 80s.
Should the snippets that Copilot is regurgitating be considered for copyright in the first place?
It seems akin to trying to copyright a certain drum pattern or chord progression.
Also, the history of the GPL, MIT, commercializing lisp machines, Symbolic, infighting, etc… seems a very different context than Copilot so I am having difficulty seeing the systemic problems that tools like this encourage.
There is of course a surface level similarity in that a corporation is profiting from IP in the public domain but the devil is in the details.
It'd be nice to see some proof here. Copyright is not absolute and does not extend, for example, to things that have no creativity in them. There are only so many ways to write a for loop or an if condition. Training an ML model from a large body of code IMHO violates copyright no more than any of us reading code and learning from it, as long as GH Copilot doesn't spit out code that's exactly the same as something already existing.
Programmers are fine when their creations, pretty much all of tech, resells content that other people wrote for free, but no, not code, that one must be expensive
I also don't think it's acceptable for TurnItIn to monetize content without paying the authors. My opinion about whether students should have their work stolen and monetized by a company doesn't seem to have much impact though.
It is incredible to use though. I pasted the return value of an API call in comment, then started to write a schema class. Codepilot just created the entire class for me. wanted to extract a subset of the data, I typed get_<_name_of_the_subset>(), it wrote the code I would have written.
So even without using someone else code, just the pattern understanding and the production of simple boiler plate code is great.
Why is it a bad thing? You either have people spending time reading code and learn every little thing and produce the same work in days, or have Copilot saves human life time for hours. Coding would be more efficient, it is a win-win for everyone in this industry, right? I know people attach to the code they write, but we all learn from books, and the result is common enough.
I somewhat agree with that. Yesterday I edited some exotic configuration (Kubernetes CSI driver for Cinder) and Copilot suggested me config which looked like someone's config. There were no values, so it was good at filtering them out, but it definitely looked like cleaned part of code which resides in some project.
I don't think that's bad though. Code sharing is good for overall productivity.
MS and Github are thieves, all their code is closed source, yet they sell copyrighted code they don't own. If they told us years ago that our code will be automatically stolen by an "AI", most coders would not have created an account. The innovation here is that they have access to most of the worlds open source code and automated the stealing.
If GitHub could guarantee that the code Copilot had ingested was only made with OSS licenses, then I don't see what the problem is.
But as far as I understand, GitHub trained Copilot on any public repository on GitHub, meaning even if it doesn't have a license specified (so the user publishing it still has the copyright to it), then I don't see how it can be OK.
> I checked if it had code I had written at my previous employer that has a license allowing its use only for free games and requiring attaching the license.
yeah it does
That's a pretty bad example. He prompted it using the exact function header taken from the code he is complaining about.
It'd be much more interesting if he setup a function that was doing a similar thing but with different parameter types and names, and a different order of parameters (ie, like a real problem).
Does that matter? A code provided should be provided with the license needed to use the code, otherwise the user is opening themselves up to litigation.
Hence why I agree with another comment somewhere that Microsoft is banking on software developers not litigating about use of their open source code in closed source projects.
It is hard to see how verifying licenses is a solvable problem, when licensing for code dependencies can be transitive. For ex - if I copy code from a GPL codebase like Linux and create a Github repository with an MIT license.
Sorry, to be clear, I meant even if a Github user asserts their code is public-domain/no-attribution/unlicensed, they could have lifted it off a codebase that doesn't allow it. It would be tricky for Github to establish the code was indeed original and hence their agreement with the user allows them to train their models on it.
> they could have lifted it off a codebase that doesn't allow it
Ah. But then someone else is guilty of redistributing code without permission.
But you're suggesting, GitHub should implement something like ContentID but for code. Which should be cheaper (since code is cheap to analyze, while videos are much more bandwidth-intense). And this would kill two birds with one stone.
I can’t say I remember the terms saying anything to the effect of granting Microsoft a perpetual unlimited license in addition to whatever license I package with the code when I signed up. Not doubting it, but I would have expected that to raise some suspicion long before Copilot was around.
It could be something as innocuous as "you allow your code to be analyzed, processed or otherwise handled by Github software" I suppose, which wouldn't raise suspicion.
There needs to be an update to either licenses or GitHub (and other) software directly, or even software terms of services, that gives the user an opportunity to opt-out of their data being used to train proprietary AI models.
'I don't agree with having an AI trained on/with my data.'
IMHO, all other problems with copilot stem from this.
Sure, the concern is valid but I feel like this tweet adds absolutely no substance to the discussion and just repeats the same opinion that was already rehashed to death since copilot originally launched. As such, especially with the tone that the tweet has, I don't expect constructive discussion to raise here.
Reading many of the comments here I feel like one important thing is being left out that is not related to legal, but to social issues:
Who is on the side of open source? Where are the big, powerful institutions and companies that deeply care about authors and communities providing free software that so many of us rely on?
There are a few reasons why this could be considered ethical. First, open-source code is typically free to use, so the company would not be taking advantage of anyone by using it to train their AI. Second, the company would be providing a service that people are willing to pay for, so they would be generating value for society. Third, the company would be transparent about what they are doing and would not be hiding anything from the public.
...the above was generated by GPT-3 (text-davinci-002). Prompt: Write an argument for why using open-source code to train an AI and then sell the code generating service (without open-sourcing it) is ethical.
The main argument against this is that it takes away from the open-source community that contributed to the development of the code in the first place. By selling a code-generating service without open-sourcing it, the company is profiting from the work of others without contributing back. This is unfair and takes away from the overall open-source ecosystem.
It's SO frustrating that even on HN people still fall for this naive and incorrect analysis. Pasting bits I've said before on this topic:
Language models do not work like this. They can copy content but usually that's for something like the GPL language text.
Generally they work on a character by character basis predicting what is the most likely character to appear next.
This very rarely results in copying text, and almost never rare text.
Mechanically it has learnt both syntax of language and how concepts relate. So when it starts generating it makes sentence that are syntactically valid but also make sense in terms of concepts.
That's really different to just combining bits of sentences, and it gives rise to abilities you wouldn't expect in something just cutting and pasting bits of sentences. For example, few shot learning is mostly driven by its conceptual understanding and can't be done by something with no way to relate concepts.
There's enough examples of it regurgitating longish verbatim code out there, and not just comments or GPL license text.
If they are comfortable training it on code that isn't licensed for unrestricted copy/paste, I don't personally understand why they can't train it on their own code that's also not licensed for that.
Edit: They even added 'q rsqrt,' to their banned word list to squelch an example of long verbatim code passages.
Basically, it's not that I don't understand your explanation. It's that it does emit long passages of unchanged code in practice, for whatever real-world reason.
I'm going to make a bold prediction: no one will ever lose a copyright lawsuit due to usage of Github Copilot generated code. The code snippets it produces are too small or trivial to qualify for copyright infringement.
CoPilot is a new technology, and smallish snippets of code are all it is capable of at this point. Microsoft will surely work to expand its capabilities to produce larger and more complex programs, don’t you think?
It's as the saying go, "when a product is free to use, the real product is actually you".
In this case, our code is the product.
Just considering now on swapping to another git provider...
Copilot sells the service of finding the code that makes sense for what you write. Would be better if it could correctly attribute the source(s) though, I hope they will solve this problem at some point.
Beware geeks with gifts. This is Microsoft. The question isn't "is it good?" but "Why are Microsoft offering it and how is it undermining everyone else?"
If that’s their motive they should stop charging for it. Or, if they need to cover source costs, open source copilot code and allow people to host their own
What stops me from re-uploading copyrighted source, where I remove the notices and push it with an MIT license? If such a data set has been trained with, how do you get it out?
All I can think of is Steve Yegge [1]: "They have no right to do this. Open source does not mean the source is somehow 'open'."
My code is on Github so that people can read it, reuse it and learn from it. "The freedom to study how the program works", as the FSF says. If some of the people reading it are machines, why would that matter?
Because a lot of this code would be put into closed source software, which is against the licence and would prevent people from exercising the right to study how a program works.
But I don't care if closed source programmers read my GPL code! The freedom to learn is not copyleft. So long as they put independent effort into their work, they're good in my book. Shared knowledge is a vital commons, and I'm honored if I can contribute to it.
Maybe this goes back to that debunked paper that claimed that transformers were only remixing input samples?
Again, the paper that said that transformers only copypasted input samples was highly misleading.
It seems clear to me that Codex has true understanding.
(Yes, I know that people have gotten secrets to appear in the output by prompting it in clever ways. That this happens doesn't prove that Codex doesn't understand what it's doing, it just shows that Codex doesn't understand everything.)
what AI is showing is the fuzzy line between creating and copying. The truth is they are both always present in everything we do, we've just been trying to hide it.
So it should be as simple as if you're using other people's content for your own profit you should properly compensate them.
Or we could just abolish copyright law and assume that everything humans create emanates from culture so its always collectively built and everything should be open source.
Or we just do the same we've been doing. Create even more complex laws trying to define this fuzzy line in a way that companies can keep profiting from it a lot more than individuals.
I'm using it for a day now and i'm really impressed. It is so aware of stuff in old code, that it is scary. I'm working in an old application with Zend Framework.
Isn't every programmer in history (except the gall who invents her own language and writes all her own code) simply an archeologist for other people's work?
We all Duck/Google for code anyway. Why not admit and make it easier?
You don't understand the difference between many open source licenses or the concept of crediting open source code authors... it does not mean that the code is free for everyone to just use as they please...
The code Copilot suggest from any given project most of the time is not enough to credit such project, when I look up code in some GitHub repo, and copy it fully or part of it, I do not credit that project.
This does raise a point - do we now have to assume that all those services that provide free hosting/access/service to open source projects will be strip-mining the work of the open source community to sell them back to us all? I almost feel stupid believing it was an altruistic move to contribute back to the shoulders of giants they were already standing on...
I feel scammed too. At this point it should be obvious, but I’m finally savvy to the fact that every tech company that offers anything free, and you use it to create “your” content, is not your friend and you don’t even own the works you host with them. I feel scammed that GitHub was cool about 10 years ago. It was like the professional/cultural center of gravity in my career. GitHubbers we’re cool people. Everyone cool hosted their site on GitHub Pages. I didn’t want to see a resume; what’s your GitHub? Now I feel stupid for having contributed whatever tiny bit of brains I did to this AI by thinking that I was using the cool, developer-first code website.
No. You still have the option not to buy Copilot and still use GitHub's services for free on public projects. Or, if you're not comfortable with your open source code being perused by an AI, you can set up your own privately hosted public Git repo pretty easily.
I honestly don't understand the general outrage at this fair seeming deal to me.
I disagree. Copilot is selling content-aware code suggestions, which is a result of code that other people wrote in their platform, and which in no way affects the work of these people.
GitHub Copilot is a paid feature, but that's a red herring in this discussion - people are free to monetize free software, neither or the major licenses forbid this.
GitGub Copilot is an advanced autocomplete / code generation system, based on a machine learning model. The code used for training the model is taken from projects hosted on GitHub. These projects were published under different licenses.
The main questions are:
Some of the licenses need something from you if you create a derivative work. Does the Copilot training itself count as creating a derivative work?
Sometimes the autocomplete basically quotes the original code. Does the original license then apply to the autocompleted / generated code too? How much of verbatim code quoting does it need for the result to be considered a derivative work?
Those instances where people demonstrate verbatim copies, are mostly either well known snippets which have been copied a million times already, or obvious completions of a partial verbatim piece of the supposedly copied code that any coder could extrapolate.
I get the feeling this entire debate would have been non-existent had this been a Jetbrains product instead.
The whole thing is just bizarre when the vast majority of developers constantly look at OSS code daily and lift ideas/patterns/snippets from there regularly without once looking at whatever license is attached.
> I get the feeling this entire debate would have been non-existent had this been a Jetbrains product instead.
why so?
> The whole thing is just bizarre when the vast majority of developers constantly look at OSS code daily and lift ideas/patterns/snippets from there regularly without once looking at whatever license is attached.
well, yes, copying an idea or pattern is generally.. accepted, to be kosher. copy-pasting too, in small amounts (a function, a type). that said, i would (and have) attribute even a notional similarity when writing something open source.
i don’t think co-pilot even allows the user to find where the code came from.
So when you google a problem and it leads you to a code snippet that solves it that just happens to be OSS, you immediately scrub your brain and pretend you never saw it and instead instead come up with your own completely independent solution after the fact?
Google usage is outright forbidden for work in institutions that care about intellectual property rights, so the brain scrub issue is just arguing at the wrong level.
If you're googling solutions around you're already not taking intellectual property seriously enough to care about what happens after you lift ideas around.
Maybe I live in a bubble, but the likes of Google/StackOverflow have been part and parcel of a developers toolbox for many years now.
And in any case I wonder how that is enforced. Eg, Someone goes home in the evening and visits github, learns a new trick and comes into the office the next day and implements it.
Can you name these institutions? I am surprised to hear that some institutions would prevent devs from viewing e.g. documentation of the APIs they are using or academic papers about algorithms for computing the multiplicative inverses of 64-bit integers, if they accessed those things via google
This is interesting. Is the internet completely cut off? Do they have internal libraries of documentation for third party stuff they are using (paper? digital?) Do you have any example institutions, or what domain they are working in? Thanks.
A concern, which I think is legit, is that it is quite easy for someone with a strong presence in search, web advertising, analytics and mobile to puzzle together what a company is investing in based on the aggregated research and web access from known locations
I am not a lawyer, but my legal intuition / common sense says that “code snippets” are not copyrightable. There’s some sliding scale on when a code snippet would become so non-trivial that a reasonable (!) judge would consider it copyrightable, but nothing Copilot does is anywhere close to that limit, IMO.
One of the main claims in Google LLC v. Oracle America [0], was based around a 9-line rangeCheck function. Whilst some code can be too simple and small to copyright, programmers and lawyers are probably not going to view snippets the same way. Copilot creates risk.
Well, this does invite an interesting comparison. If we imagine something like Copilot applied to music I believe the chances of ending up in court would be pretty high. There are a lot of examples of plagiarism lawsuits in popular music and the outcome seems to be entirely random.
One could argue that the information density in chord progressions, bass lines and beats is extremely small. And that any recognizable part of a musical idea that has been "borrowed" would necessarily make up a larger percentage of the complete work than would be the case for a typical application with borrowed snippets.
That's not a bad argument, but it is unsatisfactory because it means that at some point someone has to make a judgement on how much you can borrow.
It feels morally wrong to me that I can spend thousands of hours working on projects on my own free will but then a company can sell the code I wrote to others in the form of snippet completion as a service. In fact they end up selling your code back to yourself if you plan to use the service.
If the answer is no, that moves the needle pretty far in the direction where I'd at least consider the idea of moving all of my repos to Gitlab. I don't care much about stars or popularity. I open source things that are interesting and useful to me and if other folks want to use it they can but I don't gain motivation from others using the projects I release. I like Github and its UI and it's no doubt "the spot" for open source but selling code written by others rubs me the wrong way a lot. It stinks because it also means no longer contributing to other code bases too. It's moving us in the opposite direction of what open source is about.