One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation
would be the answer to the following question: Was Copilot trained using Microsoft internal source code or will it be
in the future ?
As GitHub is a Microsoft company and OpenAI although
a non-profit just got a massive one billion investment
from Microsoft (presumably not for free), will it
start spitting out once in a while Windows kernel code ? :-)
And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative
work based on the work of the open source code corpus it was trained on ? IANAL...
Alternatively, wait for co-pilot to add support for C++, then start writing an operating system with Win32-compatible API using co-pilot.
There is plenty of leaked Windows source code on Github, so chances are that co-pilot would give quite good suggestions for implementing a Win32-compatible kernel. Then watch and see if Microsoft will try to argue that you are violating their copyright using code generated by their AI.
For example, the AI tool that Microsoft's lawyers use ("Co-Counsel"), will be filing the DMCA notices and subsequenct lawsuits against Co-Pilot generated code.
This will result in a massive caseload for the courts, so naturally they'll turn to their AI tool ("DocketPlus Pro") to adjudicate all the cases.
Only thing left is to enter these AI-generated judgements into Etherium smart contracts. Then it's just computers suing other computers, and being ordered to send the fruits of their hashing to one another.
Have you read Accelerando by 'cstross? It plays out kind of like this, only taken to a tangent. Notably, it's written before ethereum or bitcoin were conceived. Great storyline.
Don't forget settlements paid in Ai-generated crypto-currencies backed by Gold mined in Australia fully automated mine. Run it all on solar and humans can just fuck right off.
Who could have predicted machines would be very good at multitasking. As of today they are STIL writing code AND creating more wealth through gold hoarding AND smart contracts at the same time!
Somehow, I find this a plausible and not entirely undesirable outcome for society.
The less time humans spend interfacing with machines, the more points humanity gets anyway.
The nice thing about co-pilot is that it will suggest to do the same mistakes as in other software. If you accept all autosuggestions in C++ you might end up with Windows.
This is such a ridiculous statement to me. If this were a real problem we would have noticed by now with stackoverflow. I truly believe the vast majority of capable developers read, understand and test code they copy from somewhere. This is even more obvious with an AI that will never suggest 100% correct code all the time.
Without weighing in on the overall question of “is this a license violation”, you’ve created a false dichotomy.
“GitHub included Microsoft proprietary code in the training set because they view the results as non-derivative” and “GitHub didn’t include Microsoft proprietary code because they view the results as derivative” are clearly not the only options. They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.
Or: they used the entire open source corpus because they thought it was free for the taking, and when people point out that it is not (that there are licenses) they spin that (claim that only 0.1% of output is directly copied, but that would mean 100 lines in 100k program) and pass any risk onto the user (saying it is their responsibility to vet any code they produce). So they aren't saying that users are in the clear, just that it isn't their problem.
Use neural indexes to find the code that most closely matches the output. Explainable AI should be able to tell you where the autocompletion results came from, even if it is a weighted set of files.
That's a good idea in theory, but the smarter the agent gets, the less direct the derivation and the harder to explain it (and to check the explanation). We're already a long way from a nearest-neighbor model.
Yet the equivalent problem for humans gets addressed by the clean-room approach. This seems unfair.
Yeah, also in principle. But the cleanroom approach isn't technically required for humans either -- it became standard because the legal notion of a derived work is very fuzzy and gradually changing, and lawsuits are expensive and chancy, so you want a process that's provably not infringing. "Yeah I learned some general ideas from this code, but I didn't derive any of my code from theirs" seems to be a logical rats-nest. With the explainable-AI approach to this particular problem, the more intelligent the AI, the more this solution is like analyzing brain scans of your engineers. If your engineers could have produced "derived work" without literal copying, why can't the AI?
I agree, but we aren't anywhere close to that level yet. For that to be true, I think the AI should possibly have the ability to explain the code it created. What we have now is basically a fancy markov adlib code completion tool.
A more intelligent agent should be able to tell you where it learned all of its knowledge from. I personally would like my AI to be above "gut level instincts" otherwise it reinforces blind trust.
>saying it is their responsibility to vet any code they produce
But, if some of the code produced is covered by copyright, isn't Microsoft in trouble for distributing software that distributes copyrighted code without a license? How would it be different from giving out bootlegs DVDs and trying to avoid blame by reminding everyone that the recipients don't own the copyright?
this complicated copyright problem shows we're still using last century concepts on new and emerging technology that surpassed it; it's time to think hard about it because we need neural nets and they need training data
Some are more equal than others though, aren't they? I mean, if MS throws out licensed code from others, as if to say: "Ahh, software licensing, such an outdated concept ..." but then keeps its own code out of that loop. "Yeah, but that's our own code, no one is allowed to copy that!"
I doubt they will corner the market for AI code assistants. ML models are replicated or surpassed in a few months by the competition. We will all benefit from them, it won't remain concentrated in a few hands.
Also, 0.1% of output is directly copied doesn't include the lines where the variable names were slightly changed, but the code was still copied.
If you got the Microsoft codebase and Ctrl+F'd all the variable names and renamed them, I bet they would still argue that the compiled program was still a copy.
How is designing a very large system even close to the same thing as writing a few small functions? That's like saying an architect designing a building is doing the same thing as a brick layer putting down cement.
“””Computers can already author documents at near human quality. Research is continuing to increase the accuracy and volume of these models.
Language processing research will not only help doctors, but will allow machine-based language translation, and eventually automated chat bots that can converse in our languages.
The next steps in human-machine collaboration are to allow people and machines to co-create. A recent Chinese report suggests that 50% of scientific papers in this field will be written without human intervention by 2033, compared with only 11% today.
One of the biggest challenges of machine learning is giving the machine what it lacks. This usually means gaining enough training data to teach the algorithm how to make inferences from data points it has never encountered before.
Many of the large organisations involved in advancing AI's ability to develop documents can improve how the algorithms learn by building on the knowledge and experience of human workers.”””
The above text was automatically written by
https://app.inferkit.com/demo . It uses a language model to predict the next word in a sequence. In other words, to use your example, it not only architects, but builds, the entire building simply by predicting where to put the next brick.
So to answer your question: Yes. That’s exactly how it’s done.
And such a thing has never been achieved with code. Besides very often the texts such an ai creates are non-sensical. And they are very short. Writing a few pages of text would equivalent to small tool of a few hundred lines. Or about the same as building a wooden shed. You don't need much skill for that. Come back when an AI can write multiple internally consistent books such as LOTR and the Dilation or the Harry Potter series. That's the scale of architecting a system.
True, but I also think this is showing a lack of imagination about where things are going.
You're trying to say architecting is some big woo idea that's somehow different from writing code. Kind of, maybe. But I bet you could build a functional kernel with central design. Given that's how biological systems work, I'm sure it could be done. Then what say you?
> They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.
They don't claim they used an “open source corpus” but “public code” because such use is “fair use” not subject to the exclusive rights under copyright.
> One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation
They don't claim it wouldn't be a license violation, they claim licensing is irrelevant because copyright protection doesn't apply.
> And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ?
No, that would just show them to not want to expose their proprietary code. It doesn't prove anything about derivative works.
Also, their own claim is not that the results aren't a derivative work but that training an AI is fair use, which is an exception to the exclusive rights under copyright, including the exclusive right to create derivative works.
Late to the thread but: OpenAI is not a non-profit since 2019 (technically they call it a capped profit company [1], but until the singularity you can ignore the cap). I guess this does impact the dynamic with Microsoft
That's an interesting employment detail, but what does it have to do with the other parts of the organization? I happen to know that they work together on security and contract areas, and it wouldn't surprise me if there were other similar arrangements in place.
It's not clear to me that verbatim would be the only issue. It might produce lines that are similar, but not identical.
The underlying question is whether the output is a derivative work of the training set? Sidestepping similar issues is why GCC and LLVM have compiler exemptions in their respective licenses.
If simple snippet similarity is enough to trigger the GPL copyright defense I think it goes too far. Seems like GPL has become an obstacle to invention. I learned to run away when I see it.
It's not limited to similar or identical code. The issue applies to anything 'derived' from copyrighted code. The issue is simply most visible with similar or identical code.
If you have code from an independent origin, this issue doesn't apply. That's how clean room designs bypass copyright. Similarly if the upstream code waives its copyright in certain types of derived works (compiler/runtime exemptions), it doesn't apply.
So if you work on an open source project and learn some techniques from it, and then in your day job you use a similar technique, is that a copyright violation?
Basically does reading GPL code pollute your brain and make it impossible to work for pay later?
If so you should only ever read BSD code, not GPL.
> Basically does reading GPL code pollute your brain and make it impossible to work for pay later?
It seems to me that some people believe it does. Some of the "clean room" projects specifically instructed developers to not even look at GPL code. Specific examples not at hand.
Microsoft appears to believe this (or maybe just MacBU) because I've met employees who tell me they're not allowed to read any public code including Stack Overflow answers.
If that's the case then GPL code should not have been used in the training set. Open AI should have learned to run away when they saw it. The GPL is purposely designed to protect user freedom (it does not care about any special developer freedom) which is it's biggest advantage.
It wasn't trained on internal Microsoft code because the training set is publicly available code. It has nothing to do with whether or not it suggests exactly identical, functionally identical, or similar code. MS internal isn't publicly available. Copilot is trained on publicly available code.
You stated a fact "Copilot is trained on publicly available code".
The question (and implication) is: why not train it on MS internal code, if the claim that the output isn't license-incompatible is true.
If the output doesn't conflict with any open-source license (ie. it springs into existence from general principles, not from "copying" licensed code -- then MS-internal (in fact, any closed-source code) should be open-season.
I can imagine a few of the non-obvious segments of code I've written being "recognizable" methods to solve certain problems. And, they are certainly licensed (GPL + Commercial, in my case).
I think, at the very least, that a set of AIs should be trained on different compatible sets of code, eg. GPL, AGPL, BSD, etc. Then, you could select what amount of license-overlap is compatible with your project.
Honestly I think a large part of the value add of machine learning is going to be the ability for huge entities to launder intellectual property violations.
As an example, my grandfather (an old school EE who got his start on radar systems in the 50s, who then got his radiology MD when my jewish grandmother berated him enough with "engineer's not doctor though...") has some really cool patents around highlighting interesting parts of the frequency domain in MRIs that should make detection of cancer a whole lot easier. As an implementation he did a bunch of tensor calculus by hand to extract and highlight those features because he's an incredibly smart old school EE with 70 years experience cranking that kind of thing out with only his trusty slide rule. He hasn't gotten any uptake from MRI manufacturers, but they're all suddenly really into recurrent machine learning models to highlight the same sorts of stuff. Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.
I'm personally pretty anti intellectual property (at least how it's implemented in the states), but a system where large entities that have the capital investment to compute the large ML models can launder IP violations, but little guys get stuck to the letter of the law certainly seems like the worst of both worlds to me.
I don't understand how your example relates to "launder intellectual property violations". What you're saying is that your grandfather hand wrote some feature extractors that look similar to the neurons that ML models have learned from backpropagation. There's no stealing of IP there at all.
He has a set of patents on certain types of highlighting frequency domain patterns in MRIs. In a lot of ways recurrent neural networks can be frequency domain feature extractors as the backwards data flows create sort of delay line memories tapped at interesting periods. The MRI manufacturers after refusing to license his patents, heavily invested in ML models that focus on using recurrent networks for frequency domain feature extraction. Patents aren't like copyright where independent reinvention is a way out; he has a monopoly on the concepts regardless of how someone came about them, even by growing them a bit organically like how ML works.
You can't patent a concept. You can only patent a process, a machine, an article of manufacture, or a composition of matter. And the invention must be described sufficiently such that a practitioner skilled in the relevant art can reproduce the subject matter.
You can patent a software process or machine running classes of software in the US as long as it doesn't conflict with Alice Corp's test which is analogous to what I mean by concept in this case. And his patents are extremely well documented in the patents so that they can be reproduced. I guess if someone manually cranked through the math for each video's set of pixels they wouldn't be infringing, but oncologists aren't really ok with waiting a year for results. Any practical implementation would be infringing.
And like I said, I'm pretty anti US structures around intellectual property (including software patents), but I'm not for the only ones being able to circumvent the legal process being entities with large banks of capital.
> Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.
How many models are back-propagated first and then hand-tuned?
That's a great question. I had assumed that the workflow of an ML engineer consisted of managing the data and a relatively high level set of parameters around a search space of layers and connectivity, as the whole shtick of ML is that the hyperparameter space of the tensors themselves is too complex to grok or tweak when generated from training. But I only have a passing knowledge of the subject, pretty much just enough to get myself in trouble in these kinds of discussions.
Any chance some fantastic HNer could chime in there?
I'm no data scientist but many statistical methods rely on prior knowledge and even computed inputs.
Two examples I can think of are doing linear regression on the square of your input. For deep learning, people have improved visual representation by taking samples of the colors at various frequencies. [1]
Yeah, that's a better way of saying what I meant by managing the data. Mentally projecting data through, massaging said data, and building reproducible pipelines rather than manually tweaking the learned weights after the fact.
I don't see the point of this tool, independent of the resulting code being derivative of GPL code or not.
being able to produce valid code is not the bottleneck of any developer effort. no projects fail because code can't be typed quickly enough.
the bottleneck is understanding how the code works, how to design things correctly, how to make changes in accordance with the existing design, how to troubleshoot existing code, etc.
this tool doesn't make anything any easier! it makes things harder, because now you have running software that was written by no one and is understood by no one.
It doesn’t calm to solve the bottleneck either. On the contrary, it clearly states that its mission is to solve the easy parts better so developers can focus better on the true challenging engineering problems as you mentioned.
This reminds me of a startup pitch where it’s always “oh we take care of x so you don’t have to,” but the problem is now I just have another thing to take care of. I cannot speak for people who use Copilot “fluently,” but I know for every chunk of code it spat out I would need to read every line and make sure “Is this right? Is the return type what I want? Will this loop terminate? Is ‘scan’ the right API? Is that string formatted properly? Can I optimize this?” etc. To me it’s hardly “solving the easy parts,” but rather putting the passenger’s hands on the wheel.
if it doesn't claim to help any code production bottlenecks, then what good is it? it's just piping in code that may or may not contain a subtle bug or three.
That doesn't help anyone!!
I am usually pretty pro-Microsoft, but this tool is a security nightmare and a bad idea all around. It will cause many (most? all?) who use it far more work than it saves them, long-term.
To me that - and really any form of common boilerplate - is just evidence that we're lacking abstractions. If your editor is generating code for you, that means that the 'real' programming language you're using 'in your head' has some metaprogramming facilities emulated by your IDE.
I think we should strive to improve our programming languages to make less of this boilerplate necessary, not to make generating boiler plate easier. The latter is just going to make software less and less wieldy. Imagine the horror if instead of (relatively) higher level programming languages like C we were all just using assembly with code generation.
In a very real sense, we are all just using assembly with code generation.
I really like your point on symptoms of insufficient abstraction. I do worry that we always see abstraction as belonging in language. Which in turn we treat as a precious singleton, and fight about.
At least in my own hacking, I'm surprised how infrequently I see programmers write programs that write programs. I'm surprised how infrequently I see programmers programming their shell, editor, or IDE.
Completely agree. If anything, I see tools like this actually decreasing engineering speed. I don't see how it doesn't lead to shipping large quantities of code the team didn't vet carefuly, which has is a recipe for subtle and hard to find bugs. Those kinds of bugs are much more expensive to find a squash.
What we really need aren't tools that help us write code faster, but tools that help us understand the design of our systems and the interaction complexity of that design.
Have to fully agree; just seems like a "cool" tool where if you had to actually use it for real world projects, it's going to slow you down significantly, and you'll only admit it once the honeymoon period is over.
What happens when someone puts code up on GitHub with a license that says "This code may not be used for training a code generation model"?
- Is GitHub actually going to pay any attention to that, or are they just going to ingest the code and thus violate its license anyway?
- If they go ahead and violate the code's license, what are the legal repercussions for the resulting model? Can a model be "un-trained" from a particular piece of code, or would the whole thing need to be thrown out?
By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS[1].
That effectively “overrides” any license or term that you’ve specified for your repository, since you’ve already licensed the content to GitHub under different terms. Of course, people who are not GitHub are beholden to the terms you specify.
> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
But, it goes on to say:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
I'm not a lawyer, but it seems ambiguous to me if this ToS is sufficient to cover CoPilot's butt in corner cases; I bet at least one lawyer is going to make some money trying to answer the question.
IANAL, but I wouldn't read that as granting GitHub the right to do anything like this. There's definitely a reasonable argument to be had here, but I think limiting the grant of rights to incidental copies should trump "[...] or otherwise analyze it on our servers" and what they're allowed to do with the results of that analysis.
On the extreme end, "analysis" is so broad that it could arguably cover breaking down a file of code into its constituent methods and just saving the ASTs of those methods verbatim for Copilot to regurgitate. That's obviously not an acceptable outcome of these terms per se, but arguably isn't any different in principle from what they're already doing.
Ultimately, as I understand, courts tend to prefer a common sense outcome based on a reasonable human understanding of the law, rather than an outcome that may be defensible through some arcane technical logic but is absurd on its face and counter to the intent of the law. If a party were harmed by an instance of Copilot-generated copyright infringement, I don't see a court siding with this tenuous interpretation of the ToS over the explicit terms of the source code license. On the other hand, it would probably also be impossible to prove damages without something like a case of verbatim reproduction, similarly to how having a developer move from working on proprietary code for one company to another isn't automatically copyright infringement.
I doubt that GitHub is doing anything as blatantly malicious as copying snippets of (GPL or proprietary) code to explicitly reuse verbatim, but if they're learning from license-restricted code at all then I don't see how they wouldn't be subjecting themselves and/or consumers of Copilot to the same risk.
Why are developers so myopic around big tech? Of course they can. Facebook can use your private photos. It's in their terms and services. Cloud providers have more generous terms.
The response has always been they won't do that because they have a reputation to manage. The further they grow the further they control the narrative so the less this matters.
Wait until you find out they sell your data or use your data to sell products.
Why in 2021 are we giving Microsoft all of our code? It seems like the 90s, 2000s never happened and we all trust microsoft. They have a free editor and a free operating system that sends packets of activity the user does back to microsoft but that's okay.. we want to help improve their products? We trust them.
Of course. A "private" repo is still on their servers. It's only private from other GitHub users, not the actual site administrators. This is the same in any website, of course the admins can see everything. If you truly want privacy, use your own git servers.
Why do you think people care so much about end-to-end encrypted messaging?
Yes, the concept of a "private" repo is enforced only by GitHub's service. A bug in their auth code could lead to others having access. A warrant could lead to others having access. Etc.
yes, that's what that specific section means, but as always with these documents you can't just extract a single section, you need to take the document as a whole (and usually, more than one document - ToS privacy policy are usually different)
these documents are structured as granting the service provider extremely broad rights, and then the rest of the document takes away portions of those rights. so in this case they claim the right to share any code in any repo with anyone, and then somewhere else they specify which code they won't share, and with whom they won't share it.
Fun fact: Every major cloud provider has a similar blanket term. For example, Google doesn't need to license music to use for promotional content, because YouTube's terms grant them a worldwide license to use uploaded content for purposes including promoting their services, and music labels can't afford to not be on YouTube. (It's probable even uploading content to protect it, as in Content ID, would arguably cause this term to apply.)
It all comes down to the nuance of whether the usage counts as part of protecting or improving (or promoting) their services and what other terms are specified.
The use of the definition Your Content may make GitHub's own ToS legally invalid in a large number of cases as it implies that the uploader must be the sole author and "owner" of the code being uploaded.
From the definitions section in the same doc:
> "Your Content" is Content that you create or own.
That will definitely exclude any mirrored open-source projects, any open-source project that has ever migrated to Github from another platform, and also many forked projects.
How is this different from uploading a hollywood movie to youtube? Just because there is a passage in the terms that the uploader supposedly gave them those rights, this does not mean they actually have the power to do that.
You can't give Github or Youtube or anybody else copyright rights if you don't have them in the first place. This is what ultimately torpedoed "Happy Birthday" copyright claims: while it's pretty undisputed that the Hill sisters gave their copyright to (ultimately) Warner/Chapelle, it was the case that they actually didn't invent the lyrics, and thus Warner/Chapelle had no copyright over the lyrics.
So if someone uploads a Hollywood movie to Youtube, Youtube doesn't get the rights to play that movie from them because they didn't have the rights in the first place. Of course, if the actual copyright owner uploads it, it's now permissible for Youtube to play it, even if it's the copy that someone else provided. [This has torpedoed a few filesharing lawsuits.]
Not sure how much it would matter but the main difference I see is that if I upload my own code to GitHub I have the ability to give away the IP, but if I upload Avengers End Game to YouTube I don't have the right to give that away.
I wonder how it would work if we consider you flagged your code as GPL before it hits Github.
We could end up in the same situation as the Hollywood movie even if you are also the one setting the original license on the work. Basically you have a right to change the license, but it doesn’t mean you do.
> By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS.
That's nonsense because they could claim that for almost any reason.
E.g. assume Google put the source code of Google search in Github. Then Github copies that code and uses it in their own search, since that "improves the service". Would that be legal?
It's like selling a pen and claiming the rights to anything written with it.
If the pen was sold with a contract that said the seller has the rights to anything written with it, then yes. These types of contracts are actually quite common, for example an employment contract will almost certainly include an IP grant clause. Pretty much any website that hosts user-generated content as well. IANAL, but quite familiar with business law.
I rather suspect judges would not see "improving the Service over time" as permission to create derivative works without compensation.
The person uploading files to github is also not necessarily doing so with permission from the rights holder, which might be a violation of the terms of service, but would mean there's no agreement in place.
I sort of doubt that GitHub could include GPL code in a piece of closed-source program that they distribute that "improves the service" and claim that this gives them the right.
I would bet this as applicable as the Facebook posts of my parents friends something like, 'All my content on this page is mine alone and I expressly forbid Facebook INC usage of it for any purpose.'
It's not binding because the other party hasn't agreed. You agree to terms when you use the site. One party can't unilaterally change the agreement without consent of the other party.
I see where you're coming from but it's not quite the same thing; Facebook doesn't encourage people to choose a license for the content that they post there, so there's no expectation that there are any terms aside from those in Facebook's ToS. OTOH GitHub has historically very strongly encouraged users to add a LICENSE to their repositories, and also encouraged users to fork other people's code and and push it to GitHub. That GitHub would be exempt from the licensing terms of the code pushed to it, except for the obvious minimal extent they might need to be in order to provide their services, seems like an extremely surprising interpretation.
It has nothing to do with GitHub being exempt from anything. It's that users are bound by the terms they agreed to in a ToS. If there is a conflict between a user-created license and a site's ToS, the burden is on the user to resolve it.
To be clear, I'm not suggesting this is some kind of loophole GitHub is using to trample on users' licenses, even though maybe they could. It's probably completely legal for GitHub to use even the most super-extra-double-GPL-licensed code because copyright law allows it.
The author of the Twitter post's suggestion that Copilot's output must be a derivative work is based on a naive understanding of "derivative" as it's defined in copyright law. It's not hard to find clear explanations of how this stuff works, and it's obvious she didn't bother to do any homework. Several criteria would appear to rule out GitHub's use as infringement. e.g.:
'In essence, the comparison is an ad hoc determination of whether the protectable elements of the original program that are contained in the second work are significant or important parts of the original program.'
Why not? If it does not exist you treat it as proprietary (copyrighted by default) and if it does exist at least the author claims that the given license is an option (possibly their mistake, not mine)
Also, how would you know if your code was included in the training or not?
Then, let’s say the AI generates some new code for someone, and it is nearly identical to some bit of code that you wrote in your project.
If they didn’t use your code in the model, then the generated code is clearly not a copyright violation, since it was effectively a “clean room” recreation.
If your code was included in the model, is it therefore a violation?
But then again, it comes down to how can someone prove their code was included or not?
What if the creators don’t even know? If you wrote your model to say, randomly grab 50% of all public repos to use in the model, then no one would know if a specific repo was used in the training.
They "just" have to comply with all the licenses for all the code that the program was trained on ?
I suppose that for most open source licences this at the very least involves attribution for all the people that produced the code that the program was trained on ?
I post my code publicly, but with an "all rights reserved" licence. I don't mind everyone reading my code freely, but you can't use it for anything but learning. If I found out they were ingesting my code I would be angry. It's like training your replacement. I don't use GitHub, anyways, but now I'll definitely never even think about it.
Technically then I'm infringing as soon as I clone your repo, possibly even as soon as a webserver sends your files to my browser.
"All rights reserved" makes sense on final items, like books or physical records, that require no copy or change after owner-approved manufacturing has taken place. It doesn't really make sense on digital artefacts.
So don't clone it, read it online. I reserve all rights, but I do give license to my host to make a "copy" to let you view it. I do that specifically to prevent non-biological entities like corporations or AI from using my code. If you're a biological entity, I specify you can email me to get a license for my code for a specific, defined purpose. I have a conversation with that person, then I send them a record number and the terms of my license for them in which I grant some rights which I had reserved.
Also, in your example, the copyright for the book or dvd is for the content, not the physical item. You can do anything you want with that item but not the content. My code is similar, I'm licensing my provider to serve you a visual representation of the files so you can experience the content, not giving you a license to run that code or use it otherwise.
> possibly even as soon as a webserver sends your files to my browser.
Considering how it works for personal data with the RGPD, I doubt that this is even needed ?
Also copyright is something you have by default, no licence terms necessary.
OTOH if they aren't a human, then copyright barely applies to them anyway (consider search engine crawlers indexing your website for instance), and I don't think that putting up a notice will legally change anything ?
(You'll probably have better luck with robots.txt ...)
If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work, especially if that section of code was somehow novel or specific to a narrow domain, I suspect they'd have a case. I don't know much about OpenAICodex, but if it's anything like GPT-3, or uses that under the hood, then it's very likely that certain sequences are simply memorized, which seems like the maximal case for claiming derivative works. On the other hand, if someone has GPL'd code that implements a simple counter, I doubt the courts would pay much attention.
I do wonder, though, if GPL owners worried about their code being shanghaied for this purpose could file arbitration claims and exploit some particularly consumer-friendly laws in California which force companies to pay fees like when free speech dissidents filed arbitrations against Patreon.[0] Patreon is being forced to arbitrate 72 claims individually (per its own terms) and pay all fees per JAMS rules. IANAL, so I don't know the exact contours of these rules, or if copyright claims could be raised in this way, or even if GitHub's agreements are vulnerable to this loophole, but it'd be interesting.
You don't need to have a winnable case, just enough of a case for a large company (hello Oracle) to sue a small one. Is any version of Oracle-owned Java in the corpus? Or any of the DBs they bought (MySQL)?
> If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work...
Under the right circumstances, Copilot will recite a GPL copyright header. It isn't a huge step from that to some other commonly repeated hunk of GPLed code -- I'd be particularly curious whether some protected portion of automake/autoconf code shows up often enough that it'd repeat that too.
But what would we think to the legal start-up that automatically checked all of github to see whether the ai could be persuaded to spit out a significant amount of any project code verbatim?
It doesn't matter, Copilot isn't human, therefore it isn't considered as an author, and therefore cannot do derivative works.
The issue is with the users of Copilot potentially violating copyright and licences (non-attribution for instance) and with Microsoft facilitating it. (See also : A&M Records, Inc. v. Napster, Inc.)
An interesting impact of this discussion is, for me: within my team at work, we're likely to forbid any use of Github co-pilot for our codebase, unless we can get a formal guarantee from Github that the generated code is actually valid for us to use.
By the way, code generated by Github co-pilot is likely incompatible with Microsoft's Contribution License Agreement [1]: "You represent that each of Your Submission is entirely Your original work".
This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.
I'd say that it depends on the license; for StackOverflow, it's CC-BY-SA 4.0 [1]. For sample code, that would depend on the license of the original documentation.
My point is: when I'm copying code from a source with an explicit license, I know whether I'm allowed to copy it. If I pick code from co-pilot, I have no idea (until tested by law in my jurisdiction) whether said code is public domain, AGPL, proprietary, infringing on some company's copyright.
> forbid any use of Github co-pilot for our codebase,
I have recommended as such to the CTO and other senior engineers at the startup I work at, pending some clear legal guidance about the specific licensing.
My casual read of Copilot suggests that certain outputs would be clear and visible derivatives of GPL code, which would be _very bad_ in court- probably? Some other company can have fun in court and make case law. We have stuff to build.
I'm not sure why I'm getting down voted?
"We'll forbid the use of copilot in our code base"
How????
How the fuck would anyone know how the code was written?
"We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"
If it's spitting out verbatim code 0.1% of the time, surely it's spitting out copied code where only trivial things are different at a much higher rate.
Trivial things meaning swapped order where order isn't important, variable/function names, equivalent ops like +=1 vs ++, etc.
Surely it's laundering some GPL code, for example, and effectively removing the license in a way that sounds fishy.
You certainly did! But there are a lot of people who think "OSS license means there are no requirements" and think it's okay to do things like copy without attribution when the license requires attributions. I know you didn't say anything like that either, but some others might think it.
It seems to me an important question is, "is this like a human who learns from examples, or is this really a derivative work in the copyright sense?".I'm not sure how to answer that. I'm not a lawyer. I don't know if many lawyers can answer that question either!
Copilot isn't human and therefore what it does isn't a "work".
The usual issues still apply to users of Copilot - unwitting violations of license terms of the code it was trained on (like non-attribution) are still violations.
You could say a human is laundering GPL code if they learned programming from looking at Github repositories. Would you, though? The type of model they use isn't retrieving, it's having learned the syntax and the solutions that are used, just like a human would.
> You could say a human is laundering GPL code if they learned programming from looking at Github repositories.
I don't have photographic memory, so I largely don't memorize code. I learn general techniques, and memorize simple facts such as APIs. I can memorize some short snippets of code, but these probably aren't enough to be copyrightable anyway.
> The type of model they use isn't retrieving
How do we know? It think it's very likely that it is largely just retrieving code that it memoized, and doing minor adjustment to make the retrieved pieces fit the context. That wouldn't differ much from finding code that matches the problem (whether on SO or Github), copy pasting the interesting bits, and fixing it until it satisfies the constraints of the surrounding code. It's impressive that AI can do that, but it doesn't sound like it's producing code.
I think the alternative to retrieving would actually require a higher level understanding of the world, and the ability to reason from first principles; that would be much closer to AGI.
For example, if I want to implement a linked list, I'm not going to retrieve an implementation from memory (although given that linked lists are so simple, I probably could). I know what a linked list is and how it works, and therefore I can produce working code from scratch.. for any programming language, even ones for which no prior implementations exist. I doubt co-pilot has anything remotely as advanced as this ability. No, it fully reliant on just retrieving and reshaping a pieces of memoized code; it needs a large corpus of code to memoize before it can do anything at all.
I don't need a large corpus of examples to copy, because I use my ability to reason in conjunction with some memoized general techniques and common APIs in order to produce original code.
gonna develop my own linux-like kernel soon, with my own AI model trained on public repositories
wanna see the source code of my AI model? oh, it's closed source
it's just coincidence that nearly 100% of my future linux-like kernel code looks the same as linux the kernel, bear in mind that my closed-source AI model takes inspiration from GitHub Copilot, there is no way that it will copy any source code
It may be possible to use closed source code during training and delete it, leaving just a black box model that is hard to prove was derived from that closed source code.
You get to make changes without having to respect the GPL and thus no longer obligated to provide those changes to your end users, as you have effectively laundered the kernel source code by passing it through an "AI" and get to relicense the end result.
For years people have warned about hosting the majority of world's open source code in a proprietary platform that belongs to a for profit company. These people were called lunatics, fundamentalists, radicals, conspiracy theorists, and many other names.
Well, they were ignored and this is the result. A for profit company built a proprietary system using every code hosted in its platform without respecting the code license.
There will be a lot of people saying this is not a license violation but it is, and more, it is an exploitation of other people work.
Right now I'm asking myself when people will stop supporting these kind of company that exploit people's work without giving anything in return to people and society while making a huge amount of profit.
If you read a book and use the instructions to build a bicycle you are learning a new skill and this is obviously not exploitation of people's work.
When you read a book and copy this book partially or entirely to create a new book or create a derivative work using this book without citation it's called plagiarism and copyright infringement. It is not only exploitation, it is against the law.
If you feed an entire library to an AI to generate new books without source citation and copyright agreements it is not only exploitation, it is against the law. We can call this automated plagiarism and copyright infringement, and automated or not, it is against the law. Except if you use public domain books. It wouldn't be illegal but highly unethical considering there are powerful companies with big pockets bending public domain's laws to avoid their assets to be public available (I'm looking at you Disney), but that is another story.
I think you are abstracting the matter by taking out the humanity. It's one thing to learn to do something by hand after purchasing the book. It's a totally different thing to read every single book in the world (humanly impossible) and then absorb some knowledge and train yourself to write exceptional books because you (the AI in this scenario) have learned that some words and sentence structures have lead to books having higher ratings than others. It's not humanly possible.
Of course we generate the world around us and its rules but I get angry every time we compare people to machines and say that it's the same thing. No it's not. We are constrained by time and space. I can't add more brain or more eyes to my body so I read more books can I? Microsoft can have a small city of servers somewhere and that could replace lots of people's jobs.
People have trained ML models on code thats on Github before co-pilot. (lots of examples here: https://github.com/src-d/awesome-machine-learning-on-source-...) There's nothing proprietary that other interested people or companies couldn't easily replicate here.
It certainly seems to be a laundering enabler. Say that you want to un-GPL-ify some famous copylefted code that is on the training database. You type a first innocuous characters of it, then the co-pilot keeps completing the rest of the same exact code, for it offers a perfect match. If the completion is not exact, you "twiddle" it a bit until it becomes. Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!
> Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!
Thinking that this would conveniently bypass the fact that your goal was to copy the code seems to be the most common legal fallacy amongst software developers. The law will see straight through you, and you will be found to have infringed copyright. The reason is well explained in "What Colour are your bits?" (https://ansuz.sooke.bc.ca/entry/23).
My message was sarcastic. I'm worried about accidental conversion of free software into proprietary. I mean, "accidental" locally, in each particular instance; but maybe non accidental in the grand scheme of things.
EDIT: to I can write my worry, semi-jokingly, as a conspiracy theory: Microsoft is using thousands of unsuspecting (and unwilling) developers to turn a huge copylefted corpus of algorithms into non-copylefted implementations. Even assuming that developers that use the co-pilot use non-copyleft licenses only 50% of the time, there's still a constant trickling of un-copyleftization.
I suppose someone should make a OS-generating AI, conceptually it can just have windows, osx and some linux distros in it and output one based on a question about favorite color or something.
You'd just have to wrap it in a nice complex model representation so it's a black box you fed example OS's with some meta-data into and it happens to output this very useful data.
After all, once you use something as input to a machine learning model apparently the license disappears. Sweet.
* Someone uses copilot to generate a Windows clone and starts selling it
I wonder how Microsoft would react to that. I wonder if they've manually blacklisted leaked source code from Windows (or other Microsoft products) so that it doesn't show up in Copilot's training data. If they have, that means Microsoft recognizes the IP risks of having your code in that data set, which would make this Copilot thing not just the result of poor planning/maybe a little incompetence, but something much more devious and malicious.
If Microsoft is going to defend this project, they should introduce all of their own source code into the training data.
why do you think it has to be source code? it could be the compiled code after all.
If what we're talking / fantasizing about here works in the way of `let x = 42` it should equally well work with `loda 42` &cpp, so source code be damned. It was ever only to be an intermediate step, inserted between the idea and the working bits, to enable humans to helpfully interfere. Dispensable.
Come on, there is a huge gap between 1) writing a single function (potentially incorrectly) with a known prototype/interface and a description and 2) designing interfaces, datatypes and APIs themselves.
You probably won't get an entire operating system out of it, but I could totally see a project like Wine using it to implement missing parts of the Win32 API and improve their existing implementations.
That's what I was wondering. I've never been interested enough to steal anyone else's code, but with all the code transformers and processing tools nowadays, I imagine it's trivial to translate source code into a functionally equivalent but stylistically unique version?
Assuming ML models are causal, then bits of GPL code that fall out of the model have to have the color GPL, because the only way they could've gotten there was to train the ML using GPL-colored bits. It seems to me like the answer here is pretty obvious, it doesn't really matter how you copy a work.
I don’t think most of us are scared enough of being “tainted” by the sight of a GPL snippet that we’d bother. Besides, if you want to target a specific snippet so you can type the start to prime the recognition - you already saw it?
Why not just copy it and then edit it? If a snippet is changed both logically and syntactically to not resemble the original, then it’s no longer the original and you aren’t in any licensing trouble. There is no meaningful difference between that manual washing and a clean room implementation.
All the ML changes here is the accidental vs deliberate. But it will be a worse wash than your manual one.
Yes this is a concern, but I'm not sure if the AI is actually able to "generate" a non-trivial piece of code.
If you tell it to generate "a function for calculating the barycentric coordinates of a ray-triangle intersection", you might get a working implementation of a popular algorithm, adapted to your language and existing class/function/variable names.
But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.
And if that's true, it means that copilot isn't really generating anything. It's just a (high-tech) search engine that knows how to adapt the code it finds to fit your codebase. That's still a really cool technology and worth exploring, but it doesn't do enough to justify ignoring software licenses.
>But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.
But since now APIs are unprotected you could feed it all of the class structure and method signatures to have it fill in the blanks. I don't know if that gets you a working operating system but it seems like it will get you quite a long way
The second tweet in the thread seems badly off the mark in its understanding of copyright law.
> copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this
Copyright law is very complicated (remember Google vs Oracle?) and involves a lot of balancing different factors [0]. Simply saying that something is a "derivative work" doesn't establish that it's copyright infringement. An important defense against infringement claims is arguing that the work is "transformative." Obviously "transformative" is a subjective term, but one example is the Supreme Court determining that Google copying Java's API's to a different platform is transformative [1]. There are a lot of other really interesting examples out there [2] involving things like if parodies are fair use (yes) or if satires are fair use (not necessarily). But one way or another, it's hard for me to believe that taking static code and using it to build a code-generating AI wouldn't meet that standard.
As I said, though, copyright law is really complicated, and I'm certainly not a lawyer. I'm sure someone out there could make an argument that Copilot is copyright infringement, but this thread isn't that argument.
Edit: Note that the other comments saying "I'm just going to wrap an entire operating system in 'AI' to do an end run around copyright" are proposing to do something that wouldn't be transformative and therefore probably wouldn't be fair use. Copyright law has a lot of shades of grey and balancing of factors that make it a lot less "hackable" than those of us who live in the world of code might imagine.
Google copied an interface(declarative), not code snippets/functions(implementation).
Copilot is capable of copying only Implementation. IMO that is quite different and easily a violation if it was copied verbatim.
I think the core argument has much more to do about plagiarism than learning.
Sure, if I use some code as inspiration for solving a problem at work, that seems fine.
But if I copy verbatim some licensed code then put it in my commercial product, that's the issue.
It's a lot easier to imagine for other applications like generating music. If I trained a music model on publicly available Youtube music videos, then my model generates music identical to Interstellar Love by The Avalanches and I use the "generated" music in my product, that's clearly a use that is against the intent of the law.
Many behaviors which are healthy and beneficial at human-level scale can easily become unhealthy and unethical at industrial automation scale. There's little universal harm in cutting down a tree for fire during the winter; there is significant harm in clear-cutting a forest to do the same for a thousand people.
Exactly. This comes up with personal data protection as well. There's no problem in me jotting down my acquaintances' names, phone numbers, and addresses and storing it in my computer. But a computer system that stores thousands of names, phone numbers, and addresses must get consent to do so.
The AI doesn't produce its own code or learn, it is just a search engine on existing code. Any result it gives exists in some form in the original dataset. That's why the original dataset needs to be massive in the first place, whereas actual learning uses very little data.
If I read something, "learn" it, and reproduce it word for word (or with trivial edits) even without referencing the original work at all, it is still copyright infringement.
As the original commenter said, you have the capability for abstract learning, thought, zand generalized learning, which the "AI" lacks.
It is not uncommon to ask person to "explain in your own words..." - as in use your own abstract internal representation of the learned concepts to demonstrate that you have developed such an abstract internal concept of the topic, and are not merely regurgitating re-disorganized input snippets.
If you don't understand the difference...
edit: That said, if you can create a computer capable of such different abstract thought, congratulations, you've solved the problem of Artificial General Intelligence, and will be welcomed to the Trillionaires' Club
The AI most certainly does not lack the ability to generalize. Not as well as humans, but generalization is the key interesting result in deep learning, leading to papers like this one: https://arxiv.org/abs/1710.05468
The ability to generalize actually seems to keep increasing with the number of parameters, which is the key interesting result in the GPT-* line of work that Copilot is based on.
I've seen some very clever output from GPT-*, but zero indicating any kind of abstract generalized understanding of any topic in use.
Being able to predict the most likely succeeding string for a given input can be extremely useful. I've even used it with some success as a more sophisticated kind of search engine for some materials science questions.
But I'm under no illusions that it has the first shadow of a hint of minor understanding of the topics of materials science, nevermind any general understanding.
It seems we're discussing different meanings of the word "generalize".
I propose we as developers, start a secret society where we let the AI write the code, but we still claim to write it manually. In combination with the new working from home policies, we can lay at the beach all day and still be as productive as before.
This would be the demise of the human race. I’m not entirely opposed to that, though. When AI inevitably outperforms humans on almost all tasks, who am I to say humans deserve to be given those tasks?
In this case we should be able to work less and enjoy the benefits of automation. We just need to live in an economic system where the economic value is captured by the people at large, and not a minority that owns capital.
> When AI inevitably outperforms humans on almost all tasks
Correct me if I’m wrong, but is that even possible? I kind of thought that AI is just set of fancy statistical models that requires some (preferably huge) data set in order to infer the best fit. These models can only outperform humans in scenarios where the parameters are well defined.
Many (most?) tasks humans regularly perform don’t have clean and well defined parameters, and there is no AI we can conceive of which are theoretically able to perform the task better then an average human with the adequate training.
> Correct me if I’m wrong, but is that even possible?
It's not possible because of comparative advantage - someone being better than you at literally everything isn't enough to stop you from having a job, because they have better things to do than replace you. Plus "being a human" is a task that people can be employed at.
> Correct me if I’m wrong, but is that even possible?
Why should it be impossible? Arguing that it's impossible for an AI to outperform a human on almost all tasks is like arguing that it's impossible for flying machines to outperform birds.
There's nothing magical going on in our heads. It's just a set of chemical gradients and electrical signals that result in us doing or thinking particular things. Why can't we design a computer that does everything we do... only faster?
"Why can't we design a computer that does everything we do... only faster?"
I think the key word in that sentence might be "we". That is, you could hypothesize that while it's possible in principle for such a computer to exist, it might be beyond what humans and human civilization are capable of in this era. I don't know if this is true or not, but it's kind of intuitively plausible that it's difficult for a designer to design something as complex as the designer themselves, and the space of AI we can design is smaller than the space of theoretically conceivable AI.
> it's difficult for a designer to design something as complex as the designer themselves
AlphaGo ... hello? It beat its creators at Go, and a few months later the top players. I don't think supervised learning can ever surpass its creators in generalization capability, but RL can.
The key ingredient is learning in an environment, which is like a "dynamic dataset". Humans discovered science the same way - hypothesis, experiment, conclusion, rinse and repeat, all possible because we had access to the physical environment in all its glory.
It's like the difference between reading all books about swimming (supervised) and having a pool (RL). You learn to actually swim from the water, not the book.
A coding agent's environment is a compiler + cpu, pretty cheap and fast compared to robotics which require expensive hardware and dialogue agents which can't be evaluated outside their training data without humans in the loop. So I have high hopes for its future.
There might be limit to how efficiently a general purpose machine can perform a specific task, similar to the Heisenberg uncertainty principal in quantum physics. That is to say, there might be a natural law that dictates that the more generic a machine is, the more power it requires to perform specific tasks. Our brains are kind of specialized. If you want to build a machine that outperforms humans in a single task, no problem, we’ve done that many times over. But a machine that can outperform us in any task, that might just be impossible.
I'm not arguing that machines will be more efficient than human brains. A airplane isn't more efficient than a goose. But airplanes do fly faster, higher and with more cargo than any flock of geese could ever carry.
Similarly, there is no contradiction between AI being less efficient than a human brain, and AI being preferable to humans because it can deal with data sets that are two or three orders of magnitude too large for any human (or even team of humans).
Even so, such AI doesn’t exist. All the AIs that exist today operate by fitting data. And to be able to perform a useful task it has to have well defined parameters and fit the data according to them. I’m not sure an AI that operates outside of these confinements have even been conceived of.
To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.
I’m sorry, I confused the debate a little by talking about efficiency. My point was that there might be an inverse relation of generality of a machine and it’s efficiency. This was my way of providing a mechanism in which building a machine that outperforms humans in any task could be impossible. This mechanism—if it exists—could be sufficient in preventing such machines to be theoretically possible, as at some point you would need all the energy in the universe to perform a task better then a specialized machine (such as an organism).
Perhaps this inverse relationship doesn’t exists. The universe might conspire in a million other ways to make it impossible for us to build an AI that will outperform us in any task. The point is that “AI will outperforme humans in any task” is far from inevitable.
> All the AIs that exist today operate by fitting data. And to be able to perform a useful task it has to have well defined parameters and fit the data according to them. I’m not sure an AI that operates outside of these confinements have even been conceived of.
Such an AI has absolutely been conceived of. In Superintelligence: Paths, Dangers, Strategies, Nick Bostrom goes over the ways such an AI could exist, and poses some scenarios about how a recursively self-improving AI could "take off" and exceed human intellectual capacity on its own.
Moreover, we're already building such AIs (in a limited fashion). Deepmind recently made an AI that can beat all Atari games [1]. The AI wasn't given "well defined parameters". It was just shown the game, and it figured out, on its own, how to map inputs to actions on the screen, and which actions resulted in progress towards winning the game. Then, the same AI went on to do this over and over again, eventually beating all 57 Atari games.
Yes, you can argue that this is still a limited example. However it is an example that shows that AIs are capable of generalized learning. There's nothing, in principle, that prevents a domain-specific AI from learning and improving at other problem domains. The AI that I'm conceiving of is a supersonic jet. This AI is closer to the Wright Flyer. However, once you have a Wright Flyer, supersonic jets aren't that far away.
> To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.
That's fair, but besides the point. The AI doesn't have to be better than humans at everything that humans can do. The AI just has to beat humans at everything that's economically valuable. When all jobs get eaten by the AI, it's cold comfort to me that the AI is still worse than humans at, say, enjoying a nice cup of tea.
The second time around is easier. The hard part was evolution, took billions of years, used huge resources and energy, but in a single run it evolved nature and humans. AI agents can rely on humans to avoid the enormous costs of blind evolution at least until they reach parity with us, then they have to pay the price and do extreme open-ended learning (solving all imaginable tasks, trying all strategies, giving up on simple objectives).
We know it’s possible for a brain to outperform most other brains. Think Einstein et al. A smart AI can be replicated(unlike super-smart human), so we can get it outperform human race, on average. That’d be enough to render people obsolete.
Hate to break it to you, but that wouldn’t lead to communism. The people it replaces are useless to the ruling class. At best we’d go back to feudalism, at worst we’d be deemed worthless and a drain on the planet.
I'm always confused when I see people talking about automated luxury communism. Whoever owns the "means of production" isn't going to obtain or develop them for free. Without some omnipotent benevolent world government to build it out for all, I just don't see it happening. It's a beautiful end goal for society, but I've never seen a remotely plausible set of intermediate steps to get there
The very concept of ownership is a social artifact, and as such, is not immutable. What does it mean for the 0.1% to own all the means of production? They can't physically possess them all. So what it means in practice is that our society recognizes the abstract notion of property ownership, distinct from physical possession or use - basically, the right to deny other people the use of that property, or allow it conditionally. This recognition is what reifies it - registries to keep track of owners, police and courts to enforce the right to exclude.
But, again, this is a construct. The only reason why it holds up is because most people support it. I very much doubt that's going to remain the case for long if we end up in a situation where the elites own all the (now automated) capital and don't need the workers to extract wealth from it anymore. The government doesn't even need to expropriate anything - just refuse to recognize such property rights, and withdraw its protection.
I hope that there are sufficiently many capitalists who are smart enough to understand this, and to manage a smooth transition. Because if they won't, it'll get to torches and pitchforks eventually, and there's always a lot of collateral damage from that. But, one way or another, things will change. You can't just tell several billion people that they're not needed anymore, and that they're welcome to starve to death.
The problem I see is that once the pitchforks come out, society will lose decades of progress. If we're somewhat close to the techno-utopia at the start, we won't be at the end. Who's going to rebuild on the promise that the next generation won't need to work?
Revolutions aren't great at building a sense of real community; there's a good reason that "successful" communist uprisings result in totalitarian monarchies.
What it means for the 0.01% to own the means of production is that they can offer access to privilege in a hierarchical manner. The same technology required for a techno-utopia can be used to implement a techno-dystopia which favors the 0.01% and their 0.1% cronies, and treats the rest of humanity as speedbumps.
There are already fully-automated murder drones, but my dishwasher still can't load or unload itself.
I suspect "the 0.01% own and run all production by themselves" isn't possible in the real world. My evidence is that this is the plot of Atlas Shrugged.
If they're not trading with the rest of the world, it doesn't mean they're the only ones with an economy. It means there's two different ones. And the one with the 99.9% is probably better, larger ones usually are.
Revolutions aren't great, period. But they happen when the system can no longer function, unless somebody carefully guides a transition to another stable state.
That said, wrt "communist" revolutions specifically - they result in totalitarian dictatorships because the Bolshevik/Marxist-Leninist ideology underpinning them is highly conductive to that: concepts like dictatorship of the proletariat (esp. in Lenin's interpretation of it), vanguard party, and democratic centralism all combine to this inevitable end result.
But no other ideological strain of Marxism has ever carried out a successful revolution - perhaps because they simply weren't brutal enough. By means of example: Bolsheviks violently suppressed the Russian Constituent Assembly within one day of its opening, as soon as they realized that they don't have the majority there. In a similar way, despite all the talk of council democracy, they consistently suppressed councils controlled by their opposition (peasant ones were, typically).
Bolsheviks were the first ones who succeeded, and thereafter, their support was crucial to the success of other revolutions - but that support came with ideological strings attached. So China, Korea, Vietnam, Cuba etc all hail from the same authoritarian tradition. Furthermore, where opposition leftist factions vied for dominance against Soviet-backed ones, Soviets actively suppressed them - the campaign against "social fascism" in 1930s, for example, or persecution of anarchists in Republican Spain.
Anyway, we don't really know what a revolution that would stick to democratic governance would look like, long term. There were some figures and factions in the revolutionary Marxist communist movement that were much more serious about democracy than Bolsheviks - e.g. Rosa Luxemburg. They just didn't survive for long.
idk. Countries used to build most of their infrastructures them selfs. There are still countries in western Europe that run huge state owned businesses, such as banks, oil companies, etc. that employ a bunch of people. The governments of these countries were (and still are) far from omnipotent. I personally don’t see how building out automated production facilities is out of scope for the governments of the future while it hasn’t been in the past.
Perhaps the only thing that is different today is the mentality. We take capitalism so much for granted that we cannot conceive of a world where the collective funds are used to provide for the people (even though this world existed not to long ago). And today we see it as a natural law that means of production must belong in private hands, that is simply the order of things.
By submitting any textual content (GPL or otherwise) on the web, you are placing it in an environment where it will be consumed and digested (by human brains and machine learning algorithms alike). There is already legal precedent set for this which allows its use in training machine learning algorithms, specifically with heavily copyrighted material from books[1].
This does not mean that any GitHub Co-Pilot produced code is suddenly free of license or patent concerns. If the code produces something that matches too closely GPL or otherwise licensed code on a particularly notable algorithm (such as video encoder), you may still be in a difficult legal situation.
You are in essence using "not-your-own-code" by relying on CoPilot, which introduces a risk that the code may not be patent/license free, and you should be aware of the risk if you are using this tool to develop commercial software.
The main issue here is that many average developers may continue to stamp their libraries as MIT/BSD, even though the CoPilot-produced code may not adhere to that license. If the end result is that much of the OSS ecosystem becomes muddied and tainted, this could slowly erode trust in open licenses on GitHub (i.e. the implications would be that open source libraries could become less widely used in commercial applications).
I assume that google had legal access to those books. In the case of GPT-3 derived models, they contain common crawl and webtext2 corpuses, which may include large amounts of pirated content (books and magazines uploaded in random places, paywalled content that's been uploaded elsewhere).
Attempts to litigate any license violation are going to get precisely nowhere I bet, but I find the actual license violation argument persuasive.
This is an excellent example of how the AI singularity/revolution/whatever is a total distraction and that a much bigger and more serious issue is how AI is becoming so effective at turning the output of cheap/free human mental labour into capital. If AI keeps getting better and better and status quo socio-economic structure don't change, trillions in capital will be captured by the 0.01%.
I would be quite a turn up for the books if this AI co-pilot gets suddenly and dramatically better in 2030 and it negatively impacts the software engineering profession. "Hey, that's our code you used to replace us!" we will cry out too late.
I think programming is one of the many domains (including driving) that will never be totally solved by AI unless/until it's full AGI. The long tail of contextual understanding and messy edge-cases is intractable otherwise.
Will that happen one day? Maybe. Will some kinds of labor get fully automated before then? Probably. But I think the overall time-scale is longer than it seems.
64-bit floats should be fine; I think that tweet is only sort-of correct.
The problem with floats-storing-money is (a) you have to know how many digits of precision you want (e.g. cents, dollars, a tenth of a cent), and (b) you need to watch out if you're adding values together.
Even if certain values can't be represented exactly, that's ok, because you'd want to round to two decimal places before doing anything.
Is there a monetary value that you can't represent with a 64-bit float? E.g. some specific example where quantization ends up throwing off the value by at least 1/100th of whatever currency you're using?
> "'Hey, that's our code you used to replace us!' we will cry out too late."
Are we in the software community not the ones who have frequently told other industries we have been disrupting to "adapt or die" along with smug remarks about others acting like buggy whip makers? Time to live up to our own words ... if we can.
>Are we in the software community not the ones who
No.
I'll politely clarify that for over a decade that I - and many others - have been asking not to be lumped in with the lukewarm takes of west coast software bubble asshats. We do not live there, we do not like them, I wish they would quit pretending to speak for us.
The idea that there is anything approaching a cohesive software "community" is a con people play on themselves.
To go on a bit of a tangent, I’m somewhat pessimistic that western societies will plateau and hit a “technofeudalism” in the next century or two. Combine what you mention with other aspects of capital efficiency. It’s not a unique idea, and is played out in a lot of “classic” sci-fi like Diamond Age.
Now it’s also not necessarily that bad of a state. That’s depending on ensuring a few ground elements are in place like people being able to grow their own food (or supplemental food) or still being free to design and build things on their own. If corporations restrict that then people will be at their mercy for all the essentials of life. My take from history is that I’d prefer to have been a peasant during much of the Middle Ages than a factory worker during the industrial revolution. [1] Then again Chinese people have been willing (seemingly) to leave farms in droves for the last decades to accept the modern version of factory life so perhaps farming peasant life isn’t as idyllic as it’d sound. [2]
But the rate of product/services that machinery will produce will make that even a small tax to corporations producing everything autonomously will be enough to feed and give a quality of life to everyone with an UBI or partial time jobs.
You really want to push for high productivity across all industries, even if that means sacrificing jobs in the short term, because history demonstrated after that, new and more human jobs emerge latter.
Every decade was supposed to see fewer hours working for higher pay and quality of life. It didn't happen, as business owners (not just 1% fat cats, the owners of mom and pop shops are at least as guilty as anyone, they just sucked at scaling their avarice).
So the claim that this technological revolution will be different and that it will result in a broad social safety net, universal basic income, and substantive, well-paid part-time work is a joke but not a very good one. It will be more of the same - massive concentration of wealth among those who already hold enough capital to wield it effectively. A few lucky ones who manage to create their own wealth. And those left behind working more hours for less.
You are right that this won't happen by itself. We need another economic system, and not just hope that this time things will magically fix themselves.
I wasn't talking about free market, but the state of present economy. Unfortunately, those trillions of dollars aren't being distributed to the people, but instead is concentrated in the hands of the richest.
I'd agree that many business owners are blameworthy (specifically the ones who have sought monopolies for their product or monopsonies for their labour supply), but we shouldn't forget landlords. A huge fraction of people's income goes to paying rent, especially in urban areas, yet the property tax is relatively low. This leaves a fat profit margin for landlords, even subtracting off the capital cost of the building. The proliferation of "single family house" zoning hasn't helped either. Preventing the construction of high density housing drives up rents, and benefits landlords at the cost of everyone else.
Well as long as humans are more energy-efficient to deploy than robots you will always have a job. It might mean conditions for most humans will be like a century ago.
> as long as humans are more energy-efficient to deploy than robots
Energy efficiency isn't relevant. When switchboard operators were replaced by automatic telephone exchanges, it wasn't to reduce energy consumption.
The question is whether an automated solution can perform satisfactorily while offering upfront and ongoing costs that make them an economically viable replacement for human workers (i.e. paid employees).
Yeah, for sure, the corporations that already pay effectively $0 in tax today are going to suddenly decide in the future to be benevolent and usher in the era of UBI and prosperity for all of humankind. They definitely won't continue to accumulate capital at the expense of everything else and use that to solidify their grasp of the future.
It would be a lot easier if more people on this website would just be honest with themselves and everyone else and simply admit they think feudalism is good and that serfs shouldn't be so uppity. But not me, of course; I won't be a serf. Now if you'll excuse me, someone gave me a really good deal on a bridge that I'm going to go buy...
The problem with this is that you increasingly have to put your trust in the hands of a shrinking group of owners (people who have the rights to the automated productivity). At some point, those owners are just going to stop supporting everyone else (will probably happen when they have the ability to create everything they could ever want with automation - think robot farms, robot security forces, all encompassing automated monitoring, robot construction, etc.)
Just look at authocratic countries. That top 1% still need something like 3-4% to work for beaurocracy and 3-5% for armed and police forces. And there are always family connections and relatives of relatives who want better living. So fortunatelly no AI will ever replace corruption and other human society flaws.
But yeah remaining 80-90% of population will have quality of life and bullshit jobs because it's how the world is right now outside of western countries bubble.
If AI can replace us with difficult tasks, it can repress us. How are you going to agitate for a UBI when AI has identified you as a likely agitator and sends in the robots to arrest you?
The current state of most wealthy countries do not show any hint of any significant corporation tax. Wealth will continue to accrue in the hands of the few.
In the current arrangement, capital by itself is useless - you need workers to utilize it to generate wealth. Owners of capital can then collect economic rent from that generated wealth, but they have to leave enough for the workers to sustain themselves. This is an unfair arrangement, obviously; but at least the workers get something out of it, so it can be fairly stable.
In the hypothetical fully-automated future, there's no need for workers anymore; automated capital can generate wealth directly, and its owners can trade the output between each other to fully satisfy all their needs. The only reason to give anything to the 99.99% at that point would be to keep them content enough to prevent a revolution, and that's less than you need to pay people to actually come and work for you.
I was debating bringing up disruptors when I made the grandparent comment. My 2 cents: they can shift the balance of power at the very small scale (e.g. "some random nobody" getting rich, or some rich person going bankrupt), but the large scale power structures almost always remain largely intact. For instance, that "random nobody" may well get rich through the sale of shares in their company - now the company is owned by the owner class, who were previously at the top of the power hierarchy.
Nothing new, certainly, but still worth examining. If we are not content with the current power structures, then we should be wary of changes that further intensify them.
We need not totally avoid such changes (i.e. shun technological advancements entirely because of their social ramifications), but we need to be mindful of their effects if we want to improve our current situation regarding the distribution/concentration of wealth and power in the world.
Exactly, in all cases the disruption was localized, and the broader power structures were largely unaffected. The richest among us - the owner class - were not significantly affected by all of these disruptions. They owned diversified portfolios, weathered the changes, and came out with an even greater share of wealth and power. Those who were most affected by the disruptions you listed were the employees of those companies/industries - not the owners/investors.
> If AI keeps getting better and better and status quo socio-economic structure don't change, trillions in capital will be captured by the 0.01%.
This is absolutely one of the things that keeps me up at night.
Much of the structure of the modern world hinges on the balance between forces towards consolidation and forces towards fragmentation. We need organizations (by this I mean corporations, governments, unions, etc.) big enough to do big things (like fix climate change) but small enough to not become totalitarian or decrepit.
The forces of consolidation have been winning basically since the 50s with the rise of the military-industrial complex, death of unions, unlimited corporate funding of elections (!), regulatory capture, etc. A short linear extrapolation of the current corporate/government environment in the US is pretty close to Demolition Man's dystopian, "After the franchise wars, all restaurants are Taco Bell."
Big data is a huge force towards consolidation. It's essentially a new form of real estate that can be farmed to grow useful information crops. But it's a strange form of soil that is only productive if you have enough acres of it and whose yield scales superlinearly with the size of your farm.
Imagine doing a self-funded AI startup with just you and a few friends. The idea is nearly unthinkable. How do you bootstrap a data corporation that needs terabytes of information to produce anything of value?
If we don't figure out a "data socialism" movement where people have ownership over the data derived from their life, we will keep careening towards an eventuality where a few giant corporations own the world.
> I would be quite a turn up for the books if this AI co-pilot gets suddenly and dramatically better in 2030 and it negatively impacts the software engineering profession. "Hey, that's our code you used to replace us!" we will cry out too late.
And that's why I won't be using it, why give it intelligence so it can work me out of a job?
> This is an excellent example of how the AI singularity/revolution/whatever is a total distraction [...]
Umm, no it's not. It's possible we just have two problems - the economic problem you mention might be correct, but also that people who believe in the problems of the singularity are right as well. The existence of a certain problem doesn't negate the existence of the other problem.
The difference between this model and a human developer is quantitative rather than qualitative. Human developers also synthesize vast amounts of code and can't reference most of it when they use the derived knowledge. The scales are different, but it is the same principle.
> I find the actual license violation argument persuasive.
I'm curious as to why it seems persuasive. Open source licenses largely hinge on restrictions tied to distribution of the software, and training a model does not constitute as distribution.
Unlikely. If this use counts as a derivative work, then it's already a violation, and no update is needed.
OTOH if laundering through machine learning is a fair use, then licenses can't do anything about this. Licenses can't override the copyright law, so the law would have to change.
Could disincentivize open source? If I build black boxes that just work, no AI will "incorporate" my efforts into its repertoire and I will still have made something valuable.
First in was lands, then other means of productions, and for the past 150 years, capitalists have turned many types of intellectual creations into exclusively owned capital (art, inventions). Now some want to turn personal data into capital (the “right to monetize” personal data advertised by some is nothing else) and this aims to turn publicly available code into capital. This is simply the history of capitalism going on: the appropriation of the commons.
Google has the opposite problem. They make infinite money from ad platforms and hire people just for fun so nobody else can have them. They're working on AI because they need to stop them from getting bored.
As a human programmer, I've also been trained on thousands of lines of other people's code. Is there anything new here, from a code copying perspective? Aren't I liable if segments of my own code exactly match someone else's code, even if I didn't knowingly copy/paste it?
Well to me those are fundamental questions that need to be addressed one way or the other. Are systems like GPT-x basically plagiarising (doesn't matter the nature of the output, be it prose, code, or audio-visual) or are the results so transformative in nature that they can be considered to be "original work"?
In other words, are these systems to be treated like students that learned to perform the task they do from a collection of source material, or are they to be viewed as sophisticated databases that "just" perform context-sensitive retrieval?
These are interesting and important questions and I'm glad someone is publicly asking them and that many of us at least think about them.
I think the distraction is against how disconnected reality is becoming from copyright/intellectual property regulations.
It's still amazing to me that (US-centric context here), it's well established that instructions how to turn raw ingredients into a cake are not protectable but code that results in transforming one set of numbers into another are protectable.
AI is just making the silliness of that distinction more obvious.
Code is not the same as a recipe. Recipes are more like specifications. They leave out the implementation. Code has structural and algorithmic details that just have no comparable concept in recipes.
>They leave out the implementation. Code has structural and algorithmic details that just have no comparable concept in recipes.
That is really quite debatable in some contexts. Declarative languages like Prolog, SQL, etc. declare what they want and the system figures out how to produce it. Much like a recipe, really.
Humans are just sets of atoms, so protecting them is disconnected from reality?
These reductionist arguments lead nowhere. Fortunately, IP lawyers -- including Microsoft's who are fiercely pro IP when it suits them -- think in a more humanistic way and consider the years of work of the IP creator.
Food recipes are irrelevant; the often go back centuries and it's rather hard to identify individual creators. Not so in software.
Good idea, but if carved up into small enough chunks, it may be considered fair use.
What is confusing is that the neural net may take lots of small chunks and link them to one another, and then reproduce them in the same order verbatim.
With music sampling, copyright protects down to the sound of a kick drum. No doubt Microsoft has a good set of attorneys working on their arguments as we speak.
One of the examples pointed out in the reply threads was the suggestion in a new file to insert the GPL disclaimer header.
So, the length of the samples being drawn is not necessarily small: the chunk size is based on its commonality. It could easily be long enough to trigger a copyright violation.
That would be a legal no-op. Either their use is covered by copyright and they are violating your license, or it isn't covered by copyright and then any constraints that your license sets are meaningless.
Licenses hold no power outside of that granted to it by things being copyrighted by default.
I'd assume this: In the same way as you can not forbid a human to learn concepts from your code, you can not forbid an automated system to learn concepts from your code, regardless the license. Also, if you would it would make your code non-free.
At least as long as the system really learns concepts. If it just copy & pastes code, then that's a different story (same as with humans).
This feature is effectively impossible to replicate. Only Microsoft positioned itself to have:
- dataset (GitHub)
- tech (openai)
- training (azure)
- platform (vscode)
I'm impressed. They did an amazing job from a corporate strategy standpoint. Also directionally things are getting interesting
I don't think that GH code is easily accessible, with rate limiting and TOS forbidding it. GPT is an open model (for the most part), but its training cost is in the order of tens of millions of $
I can think of no one but a handful of companies being able to compete there. And they won't be ok with extending a Microsoft IDE, nor breaking GitHub TOS.
When you start competing on R&D costs the game changes.
There's always the chance that training costs will significantly decrease. But even at an order of magnitude less (ie. tens of Ks) it's still beyond reach for open projects and indie devs
Is this really anything more than a curiosity toy and a marketing tool?
I took a look at their examples and they are not at all compelling. In one example it generated SQL and somehow knew the columns and tables in a database that it had no context on. So that's a lot of smoke and mirrors going on right there.
Do many developers actually want to work in this manner? That is, being interrupted every time they type with a robot interjection of some Frankenstein code that they now have to go through and review and understand. Personally, this is going to kick me out of the zone/flow too often to be useful. Coding isn't the hard part of my job. If this tool can somehow guess the business requirements of the task at hand, then I'll be impressed.
Even if the tool generates accurate code, if I don't fully understand what it wrote, then what? I'm still stuck digging through documentation and stackoverflow to verify that whatever is in my text editor is correct code. "Code confidently in unfamiliar territory" sounds like a Boeing 737 Max sized disaster in the making.
The dataset is all freely available open source code, right? Just because GH hosts it doesn’t mean the rest of the world can’t use it for the same purpose.
They'd find a way to keep it practically difficult to use, at the least, if that dataset is vital to the process. Hoarding datasets that should either be wholly public or unavailable for any kind of exploitation is the backbone of 21st century big tech. It's how they make money, and how they maintain (very, very deep) moats against competition.
[EDIT] actually, I suspect their play here will be to open up the public data but own the best and most low-friction implementation, then add terms that let them also feed their algo with proprietary code built using their editors. That part won't be freely available, and no free version will be able to provide that further-improved model, even assuming all the software to build it is open-source. Assuming using this thing ends up being a significant advantage (so, assuming this matters at all) your choice will be to either hamstring yourself in the market or to help Microsoft build their dataset.
I get the sense that GitHub wants this to be litigated so the case law can be established. Until then it’s just a bunch of internet lawyers arguing with each other.
In the discussion yesterday I pointed to the case of some students suing turnitin for using their works in the turnitin database and the studemts lost [1]. I think an individual suing will not go anywhere. The way to create a precedent is someone feeding all the Harry Potter books and some additional popular books (twilight?) to GPT 3 and letting them write about some kids at a sorcerer school. The outcomes of that case would look very different IMO.
Not a lawyer, but in that case it seemed to be a factor that turnitin was transformative, because it never sold the texts to others and thus didn't reduce the market value of them. But that wouldn't apply to copilot which might reduce the usage of libraries since you can "code" equivalent functionality with copilot now.
Would it be a stretch to assert that GPL'd libraries have a market value for their creator in terms of reputation etc.?
While we're worrying about ML learning to write our codes we should also break all the automated looms so people don't go without jobs. Do everything manually like God intended! /s
Maybe a code that is easily recreated by GPT with a simple prompt is not worth copyrighting. The future is in making it more automated, not protecting IP. If you compete against a company using it, you can't ignore the advantage.
If GitHub Copilot can sign my CLA, stating that it is the author of work, that it transfers the IP to me in exchange for the service subscription price and holds responsibility for copyright infringement, that would be acceptable. Otherwise it's a gray area I don't want to go.
If it's trained with GPL licensed code, doesn't that mean the network they use includes it somewhat? Then, someone could sue that their networks must be GPL licensed too, right?
The potential inclusion of GPL'd code, and potentially even unlicensed code, is making me wary of using it. Fair Use doesn't exist here and if someone was to accuse me of stealing code, saying "I pressed a button and some computer somewhere in the world, that has potentially seen your code as well, generated it for me" is probably not the greatest defense.
The core problem which would allow laundering (that there isn't a good way to draw a straight, attributive line between generated code and training examples) to me also presents a potential eventual threat to the viability of co-pilot/codex. It seems like the same thing would prevent it from knowing which published code was written by humans vs which was at least in part an output from the system. Training on an undifferentiated mix of your model's outputs and human-authored code seems like it could eventually lead the model into self-reinforcing over-confidence.
"But snippet proposals call out to GH, so they can know which bits of code they generated!".
Sometimes; but after Bob does a co-pilot assisted session, and Alice refactors to change a snippet's location and rename some variables and some other minor changes and then commits, can you still tell if it's 95% codex-generated?
While I think this will continue to amplify current problems around IP, aren't current applied-ML approaches to writing software the equivalent of automating the drawing of leaves on a tree? Maybe a few small branches? But the whole tree, all its roots, how it fits in to the surrounding landscape, the overall composition, the intention? If I'm wrong about that than I picked either a good or a bad time to finally learn programming. There's only so many ways you can do things in each language though. Just like in the field of music, only so many "Original" tunes. The concept of IP is incoherent, you don't own patterns (at least not at arbitrary depth), though you may be owed some form of compensation for the billions made off discovering them.
musicians, artists, all kinds of athletes, all grow by watching observing and learning from others. as if all these open source projects got to where they are without looking at how others did things.
i don't think a single function, similar syntax or basic check function is worth arguing about, its not like co-pilot is stealing an entire code base and just plopping it out by reading your mind and knowing what you want. i know developers that have certainly stolen code and implementation details from past employers and that was just fine.
I mean this is already happening. When you hire a specialist in C# servers, you're copying code that they already wrote. I find people tend to write the same functions and classes again and again and again all the time.
We have a guy that brought his task manager codebase (he re-wrote it) but it's the same thing he used at 2 other companies.
I have written 3 MPIs (master person/patient index) at this point all with the same fundamental matching engine.
I mean, one thing we can all agree on is that ML is good at copying what we already do.
It may be hard to believe, but there are sick and twisted individuals in this dangerous world who copy from github without even a single glance at the license, and they live among us.
Yes, and those people are violating the licenses of the code when they do that. It's not unreasonable to expect a massive company like microsoft to not do this on a massive scale.
There are always exceptions (maybe they might be norm in this case) but its still not 100%, still not all encompassing. This "AI" seems to be. I think that is like the entire concern. ALL the code is affected for all the instances.
But if I do it under a copyleft license like GPL, I expect those who copy to abide by the license and open source their own code too.
But sure, people shit on IP rights all the time, and I am guilty of it too. Let's say I didn't pay what I should have paid for every piece of software I have used.
If I read a lot of GPL code, absorb naming conventions, structures, patterns, tricks and later when it comes down to writing a P2P Chat server, I happen to recall similar patterns, naming structures, conventions and many of the utility methods are pretty much how they are in the GPL code bases out there.
Now is my produced code is also GPL derivative because I certainly did read through the code base to be able to write larger programs?
"but eevee, humans also learn by reading open source code, so isn't that the same thing"
- no
- humans are capable of abstract understanding and have a breadth of other knowledge to draw from
- statistical models do not
- you have fallen for marketing
> humans are capable of abstract understanding and have a breadth of other knowledge to draw from
this may be a matter of time and thus is not a fundamental objection.
If mankind should fail to answer the perennial question of exploitation of the other and the same, it will be doomed. And rightly so, for mankind must answer this question, it must answer to this question. Instead what we do is increase monetary output then go and brag about efficiency. Neither is this efficient, nor is it about efficiency, nor has the Universe ever cared about efficiency. It just happens to coincide with what Society has decided to be its most looked-upon elements have chosen to be their religion.
I agree that this is different from humans learning to code from examples and reproducing some individual snippets. However, I disagree with the author’s argument that it's because of humans’ ability to abstract. We actually know nothing about the AI’s ability to abstract.
The real difference is that if one human can learn to code from public sources, then so can anyone else. Nobody is explicitly barred from accessing the same material. The AI, however, is kept proprietary. Nobody else can recreate it because people are explicitly barred from doing so. People cannot access the source code of the training algorithm; people cannot access enough hardware to perform the training; and most people cannot even access the training data. It may consist of repos that are technically all publicly available, but try downloading all of GitHub and see if they let you do that quickly, and/or whether you have enough disk space.
This puts the owners of the AI at a significant advantage over everyone else. I think this is the core of the concern.
So it’s using a massive-scale public good (non-rivalrous and non-exclusionary access to source code) to create a private product that is rivalrous in the software labour pool? Or is the problem just that it’s not open-access?
> previous """AI""" generation has been trained on public text and photos, which are harder to make copyright claims on, but this is drawn from large bodies of work with very explicit court-tested licenses
This seems pretty backwards to me. A GPL licensed data point is more permissive than an unlicensed data point.
That said, I’m glad that these data points do have explicit licenses that say “if you use this, you must do XYZ” so that it’s clear that our large ML projects are going counter to creators intent when they made it open.
I’d love to start seeing licenses about use as training data. Then maybe we’d see more open access to these models that benefit from the openness of the web. I’d personally use licenses that say if you want to train on my work, you must publish the model. That goes for my code, my writing, and my photography.
Anyways GitHub is arguing that any use of publicly available data for training is fair use, but they also admit that it’s all new and unprecedented, regarding training data.
If I as an alleged human have learned purely from GPL code would that require code I write to be released under the GPL too?
We should probably start thinking about AI rights at some point. Personally I'll be crediting GPT-3 as any other contributor because it sounds cool but maybe morally too in future
A machine learning isn't really the same as a person learning - people generally can code at a high level without having first read TBs of code, nor can you reasonably expect a person to have memorised GPL code to reproduce it on demand.
What you can expect a person to do is understand the principles behind that GPL code, and write something along the same lines. GitHub Co-Pilot is not a general ai, and it's not touted as one, so we shouldn't be considering whether it really knows code principles, only that it can reliably output code that fits a similar function to what came before, which could reasonably include entire blocks of GPL code.
Well if it is actually straight up outputting blocks of existing code then get it in the bin as a failed attempt to sprinkle AI on development and use this instead
That's what I wanted to ask, where do we draw the line of copyright when it comes to inputs of generative ML?
It's perfectly fine for me to develop programming skills by reading any code regardless of the license. When a corp snatches an employee from competitors, they get to keep their skills even if they signed an NDA and can't talk about what they worked on. On the other hand there's the no-compete agreement, where you can't. Good luck making a no-compete agreement with a neural network.
Even if someone feeds stolen or illegal data as an input dataset to gain advantage in ML, how do we even prove it if we're only given the trained model and it generalizes well?
Copyright is going to get very muddy in the next few decades. ML systems may be able to generate entire novels in the styles of books they have digested, with only some assist from human editors. True of artwork and music, and perhaps eventually video too. Determining "similarity" too, may soon have to be taken off the hands of the judge and given to another ML system.
> It's perfectly fine for me to develop programming skills by reading any code regardless of the license.
I'd be inclined to agree with this, but whenever a high profile leak of source code happens, reading that code can have dire consequences for reverse engineers. It turns clean room reverse engineering into something derivative, as if the code that was read had the ability to infected whatever the programmer wrote later.
>how do we even prove it if we're only given the trained model and it generalizes well?
Someone's going to have to audit the model the training and the data that does it. There's a documentary on black holes on Netflix that did something similar (no idea if it was AI) but each team wrote code to interpret the data independently and without collaboration or hints or information leakage, and they were all within a certain accuracy of one-another for interpreting the raw data at the end of it.
So, as an example, if I can't train something in parallel and get similar results to an already trained model, we know something is up and there is missing or altered data (at least I think that's how it works).
Take it further. You could easily imagine taking a service like this as an invisible middleware behind a front-end and start asking users to pay for the service. Some could argue it's code generation attributable to those who created the model, but reality is that the models were trained by code written by thousand of passionate users at no pay with the intent of free usage.
Possibly. We won’t know until this is tested in court. Traditionally one would want to clean room [1] this sort of thing. Co-pilot is…really dirty by those standards.
Unless you were using structures directly from said code, probably not?
Compare if you had only learned writing from, say, the Bible. You would probably write in a very Biblical manner, but would you write the Psalms exactly? Most likely not.
That's super cool. As long as you do the things you specify at the bottom of that doc (provide attribution if copied so people can know if it's OK to use) then a lot of the concerns of people on these threads are going to be resolved.
Pretty much! There's only three major fears remaining
* Co-pilot fails to detect it, and you have a potential lawsuit/ethical concern when someone finds out. Although the devil on my shoulder says that if Co-pilot didn't detect it, what's to say another tool will?
* Co-pilot reuses code in a way that still violates copyright, but is difficult to detect. I.e. If you checked via a syntax tree, you'd notice that the code was the same, but if you looked at it as raw text, you wouldn't.
* Purely ethical - is it right to take licensed code and condense it into a product, without having to take into account the wishes of the original creators? It might be treated as normal that other coders will read it, and pick up on it, but when these licenses were written no one saw products like this coming about. They never assumed that a single person could read all their code, memorise it, and quote it near-verbatim on command.
> Purely ethical - is it right to take licensed code and condense it into a product, without having to take into account the wishes of the original creators? It might be treated as normal that other coders will read it, and pick up on it, but when these licenses were written no one saw products like this coming about. They never assumed that a single person could read all their code, memorise it, and quote it near-verbatim on command.
It's gonna be really interesting to see how this plays out.
I get where they're coming from but they are kinda just handwaving it back the other way with the "u fell for marketing idiot" vibe. I wish someone smarter than me could simplify the legal ramifications around this but we'll probably have to wait till it kills someone (or at least costs someone a bunch of money) to get any actual laws set up.
I think you are missing the mark here with this comparison, Copilot and its network weights are already the derived work, not just the output it produces.
Perhaps someone at Github can chime in, but I suspect that open source code datasets (the kind they are trained on) should require relatively permissive licenses in the first place. Perhaps they filter for MIT licenses in Github projects and StackOverflow answers used to train the models?
So, I can't see how they can argue that the code generated is not a derrivative of at least some of the code that it was trained on, and therefore encumbered by a complicated, and for anyone other than GitHub, impossible to disentangle, copyright claims. If they haven't even been careful to only use software under one license that does not require the original author to be attributed, then I don't see how it can even be legal for them to be running the service.
All that said, I'm not confident that anyone will stop them in court anyway. This hasn't tenmded to be very easy when companies infringe other open source code copyright terms.
Until it is cleared up though, it would seem extremely unwise for anyone to use any code from it.
You can automate this process by providing existing GPL source code and see what CoPilot comes up next.
I am sure at some point it WILL produce exact the same code snippet from certain GPL project, provided that you have attempted enough times.
Not sure what the legal interpretation would be though, it is pretty gray-ish in that regard.
There would always be risk for CoPilot, had it digested certain PII information and people found it out...it would be much more interesting to see the outcome.
it doesn't have to be exact to be copyright infringement, see non-literal copying.
basic idea behind it is if you copy paste code and rename variables that doesn't mean its new code.
Yeah, you'd have to assume they are parsing and normalizing this data in some way. There would still be some AST patterns or something similar you could look for in the same way, but it would be much trickier.
Plus considering this is a legal issue ... good luck with "there is a statistically significant similarity in AST outputs related to the most unique sections of this code base" type arguments in court. We're currently at the "what's an API" stage of legal tech understanding.
The real question is whether it constitutes derived work, though. And that is not a question of similarity so much so as provenance - if you start with a codebase that is GPL originally, and it gets gradually modified to the point where it doesn't really look anything like the original, it's still a derived work, and is still subject to the license.
Similarity can be used to prove derivation, but it's not the only way to do so. In this case, all the code that went into the model is (presumably) known, so you don't really need any sort of analysis to prove or disprove it. It is, rather, a legal question - whether the definition on the books applies here, or not.
Regarding PII, I think you have a very good point. I wouldn't be surprised to see working AWS_SECRET_KEY values appear in there. Indeed, given that copypaste programmers may not understand the code they're given, it's entirely possible that someone may run code which uses remote resources without the programmer even realising it.
This question about the amount of code required to be copyrightable starts to sound familiar to the copyright situation with music, where currently the bar seems to be set too low, legally, to prove plagiarism.
If I recall correctly, it has been already determined that using proprietary data to train a machine learning system is not a violation of intellectual property.
I don't think anyone is interested in stealing small code snippets. It's easy enough to rewrite them. Where GPL does matter is in complete products. In other words, it's never "if only we could use this GCL-licensed function", it's almost always "if only we could link this GPL-licensed library or executable".
And this GitHub co-pilot in no way infringes on full codebases.
I think copyright is a problem for GPL-like licenses. They should have restricted the training data to MIT/BSD-like.
Anyway, there is another problem that is patents and is huger, much huger. I think the Apache license has a provision about patents, but most of other licenses may have code that has patents and if the AI generate something similar it may be included in the patent.
I think the argument has merit. Unfortunately it won't be decided on technical merit, but likely in the manner expressed in this excellent response I saw on Twitter:
"Can't wait to see a case for this go in front of an 80 year old judge who rules something arbitrary and justifies it with an inaccurate comparison to something nontechnical."
This implies that by just changing the variable names, the snippets are classed as non-verbatim.
I don't buy that this number is anywhere close to the actual figure if you assume that you can't just change function names and variable names and suddenly say you have escaped both the legality and the spirit of GPL.
There isn't that much enforcement of open source license violations anyway. I bet there are lots of places where open source code gets taken, copyright/license headers stripped off and the code used in something proprietary as well as the bog-standard "not releasing code for modified versions of Linux" violation.
Isn't most of modern coding, just googling for someone who had solved the same problem that you are currently facing and then just copy/paste from Stack Overflow?
To the extent that GPT-3 / co-pilot is just an over-fitted neural net, then it's primary value is as an automated search, copy, and paste.
That's brilliant: I would argue that since MS used code with GPL type of licenses to train the Co-Pilot algorithm it shall release the Co-pilot model in its entirety. The ones who differentiate data and code missed their classes on Gödelization and functional programming.
> github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this
I don't understand the second sentence, i.e. where's the proof?
That's like saying that making a blurry , shaky copy of star wars is not derivative but original work. Thing is, the 'verbatimness' of the generated code is positively correlated with the number of parameters they used to train their model
So as I understand it, AGPL was introduced to cover an unforeseen loophole in GPL that adapted code could be used to power a web service. Could another new version of the license block allowing code from use to train such GitHub co-pilot like models?
Just wanted to say that everything is ultimately "derivative", and this literal ordinary meaning differs from the legal meaning, which is informed by context, policy and what makes sense.
Where do you draw the line? That's for the courts to decide!
I'm getting a lot of suggestions that make no sense. What's worse, the suggest code has invalid types, and won't compile. I'm surprised they didn't prune the solution tree via compiler validation.
Our software has violated the world and people's lives legally and illegally in many instances. I mean none of us cared when GPT-3 did the same for text on the internet. :)
Reminder - Software engineers, our codes, GPLs are not special.
The amount of people not knowing the difference between Open Source and Free Software is astonishing. With the amount of RMS memes I see regularly I would expect things to be settled by now.
> "but eevee, humans also learn by reading open source code, so isn't that the same thing"
- no
- humans are capable of abstract understanding and have a breadth of other knowledge to draw from
- statistical models do not
- you have fallen for marketing
Machines will draw on other sources of knowledge besides the GPL code. Whether they have the capacity for "abstract thought" is probably up for debate. There's not much else said in those bullets. It's not a good argument.
I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.
Reposting a summary of my reply: if you memorize a line of code and then write it down somewhere else without attribution, that is not fair use, you copied that line of code. If this model does the same, it is the same.
I was just musing about whether this kind of tool has been written (or is being written) for music composition, business letter writing, poetry, news copy.
Interesting copyright issues.
Anyone who thinks their profession will continue as-is for the long term is probably mistaken.
There are much bigger things in this world to worry about. I bet you that by the time that this AI has taken your job, it'll have taken many other jobs, completely rearranging entire industries if not society itself.
And even once that happens you shouldn't be worried about your job. Why? Because economically everything will be different and because your job isn't that important, it likely never was. The problems humanity faces are existential. Authoritarianism, ecosystem collapse and mass migration of billions of people.
So if you really want to "prepare", then try to make a difference in what actually matters.
It's astonishing to me that HN+Twitter believe that Github designed this entire project, without speaking to their legal team and confirming that training on GPL code would be possible.
The tone of the responses here is absurd. Guys, be grateful for some progress. Instead of having to retype boilerplate code, your productivity is now enhanced by having a system that can do it for you. This is primarily about reducing the need to re-type total boilerplate and/or copy/paste from Stackoverflow. If you were to let some of the people here run things we'd never have any form of progress with anything ever.
Questions like this go much deeper and illustrate issues that need to be addressed before the technology becomes standard and widely adopted.
It's not about progress or supressing it, it's a fundamental question about whether it is OK for huge companies to profit from the work of others without as much as giving credit, and if using AI this way represents an instance of doing so.
The latter aspect goes beyond productivity or licensing - the OP asserts that AI isn't equivalent to a student who learned from examples how to perform a task, but rather replicates (recalls) or reproduces the works of others (e.g. the training material).
It's a question that goes beyond this particular application: what about GAN-based generators? Do they merely reproduce slight variations of the training material? If so, wouldn't the authors of the training material have some kind of intellectual property rights to the generated works?
This doesn't just concern code snippets, it's a general question about AI, crediting creators, and circumventing licensing and intellectual property rights.
> Instead of having to retype boilerplate code, your productivity is now enhanced by having a system that can do it for you
We already invented something for that a couple decades ago, and it's called a "library". And unlike this thing, libraries don't launder appropriation of the public commons with total disregard for those who have actually built that commons.
This goes into one of my favorite philosophical topics: John Searle's Chinese Room. I won't go into it here, but the question of whether an AI is actually learning how to code or simply substituting information based on statistically common practices (or if there really is a difference between either) is going to be one hell of a problem for the next few decades as we start to approach fine points of what AI is and how it could be defined.
However, legally, the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is an API (A self-modifying one, but an API nonetheless), Microsoft has a good defense.
In the near-future, when we have AI-assisted reverse engineering along with Github co-pilot, then, with enough obfuscation there's nothing that can't be legally created or recreated on a computer, proprietary or not. This is simultaneously free software's greatest dream and worst nightmare.
Edit: changed Hilary Putnam to John Searle
Edit 2: spelling
> However, legally, the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is API (A self-modifying one, but an API nonetheless), Microsoft has a good defense.
That's... a mind-bendingly bad take. Google took an API definition and duplicated it; Copilot is taking general code and (allegedly) duplicating it. This was not done in order to enable any sort of interoperability or compatibility.
The "API defense" would apply if Copilot only produced API-related code, or (against CP) if someone reproduced the interfaces copilot exposes to consumers.
> Microsoft has a good defense.
MS has many good defenses (transformative work, github agreements, etc etc), but this is not one of them.
> the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is API (A self-modifying one, but an API nonetheless), Microsoft has a good defense
That's a wild misconstrual of what the courts actually ruled in Oracle v. Google.
(And to the reader: don't take cues from people banging out poorly reasoned quasi-legal arguments in off-the-cuff comments.)
'This case implicates two of the limits in the current Copyright Act. First, the Act provides that copyright protection cannot extend to “any idea, procedure, process, system, method of operation, concept, principle, or discovery . . . .” 17 U. S. C. §102(b). Second, the Act provides that a copyright holder may not prevent another person from making a “fair use” of a copyrighted work. §107. Google’s petition asks the Court to apply both provisions to the copying at issue here. To decide no more than is necessary to resolve this case, the Court assumes for argument’s sake that the copied lines can be copyrighted, and focuses on whether Google’s use of those lines was a “fair use.”
"any idea, procedure, process, system, method of operation, concept, principle, or discovery" sounds suspiciously like an API. Continuing:
Pg. 3-4
'To determine whether Google’s limited copying of the API here constitutes fair use, the Court examines the four guiding factors set forth in the Copyright Act’s fair use provision... '
(1) The nature of the work at issue favors fair use. The copied lines of code are part of a “user interface” that provides a way for programmers to access prewritten computer code through the use of simple commands. As a result, this code is different from many other types of code, such as the code that actually instructs the computer to
execute a task. As part of an interface, the copied lines are inherently bound together with uncopyrightable ideas (the overall organization of the API) and the creation of new creative expression (the code independently written by Google)...
(2) The inquiry into the “the purpose and character” of the use turns in large measure on whether the copying at issue was “transformative,” i.e., whether it “adds something new, with a further purpose or different character.” Campbell, 510 U. S., at 579. Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different computing environment without discarding a portion of a familiar programming language .... The record demonstrates numerous ways in which reimplementing an interface can further the development of computer programs. Google’s purpose was therefore consistent with that creative progress that is the basic constitutional objective of copyright itself.
(3) Google copied approximately 11,500 lines of declaring code from the API, which amounts to virtually all the declaring code needed to call up hundreds of different tasks. Those 11,500 lines, however, are only 0.4 percent of the entire API at issue, which consists of 2.86 million total lines. In considering “the amount and substantiality of the portion used” in this case, the 11,500 lines of code should be viewed as one small part of the considerably greater whole. As part of an interface, the copied lines of code are inextricably bound to other lines of code that are accessed by programmers. Google copied these lines not because of their creativity or beauty but because they would allow programmers to bring their skills to a new smartphone computing environment. The “substantiality” factor will generally weigh in favor of fair use where, as here, the amount of copying was tethered to a valid, and transformative, purpose.
(4) The fourth statutory factor focuses upon the “effect” of the cop- ying in the “market for or value of the copyrighted work.” §107(4). Here the record showed that Google’s new smartphone platform is not a market substitute for Java SE. The record also showed that Java SE’s copyright holder would benefit from the reimplementation of its interface into a different market. Finally, enforcing the copyright on
these facts risks causing creativity-related harms to the public. When taken together, these considerations demonstrate that the fourth factor—market effects—also weighs in favor of fair use.
'The fact that computer programs are primarily functional makes it difficult to apply traditional copyright concepts in that technological world. Applying the principles of the Court’s precedents and Congress’ codification of the fair use doctrine to the distinct copyrighted work here, the Court concludes that Google’s copying of the API to reimplement a user interface, taking only what was needed to allow users to put their accrued talents to work in a new and transformative program, constituted a fair use of that material as a matter of law. In reaching this result, the Court does not overturn or modify its earlier cases involving fair use.'
That's John Searle's thought experiment actually. Hilary Putnam had some thoughts in reference to it along the lines that a brain in a vat might think in a language similar to what we would speak, but the words of that language would necessarily encode different meanings due to the different experience of the external world and sensory isolation.
And this applies to everything, not just source code.
I’m just presuming we have a future where you can consume unique content indefinitely. Such as instead of binge watching Star Trek on Netflix you press play and new episodes are generated and played continuously, 24/7, and they are actually really good.
While headway has been made in photo algorithms like StyleGAN, GPT-3's scriptwriting, and AI voice replication, we aren't even close to having AI-generated stick cartoons or anime. At best, AI generated Star Trek trained on old episodes would produce the live-action equivalent of limited animation; it would reuse the most liked parts over an over again and rehash the same camerawork and lens focus that you got in the 60's and the 90's. There wouldn't be any new planets explored, no new species, no advances in cinematography, and certainly no self-insert character (in case you wanted to see - simulation of how you'd fair on the Enterprise). It wouldn't add anything new as far as I can see. Now if there was some way to recreate all the characters in photorealistic 3D with Unreal Engine, feed them a script, and use some form of intelligent creature and planet generation, you may get a little closer to creating a truly new episode.
As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)
And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...