Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Copilot as open source code laundering? (twitter.com/eevee)
1028 points by agomez314 on June 30, 2021 | hide | past | favorite | 459 comments



One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation would be the answer to the following question: Was Copilot trained using Microsoft internal source code or will it be in the future ?

As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)

And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...


Alternatively, wait for co-pilot to add support for C++, then start writing an operating system with Win32-compatible API using co-pilot.

There is plenty of leaked Windows source code on Github, so chances are that co-pilot would give quite good suggestions for implementing a Win32-compatible kernel. Then watch and see if Microsoft will try to argue that you are violating their copyright using code generated by their AI.


Oh man, that got meta super fast. Its like a mobius strip!


It can always get more meta.

For example, the AI tool that Microsoft's lawyers use ("Co-Counsel"), will be filing the DMCA notices and subsequenct lawsuits against Co-Pilot generated code.

This will result in a massive caseload for the courts, so naturally they'll turn to their AI tool ("DocketPlus Pro") to adjudicate all the cases.

Only thing left is to enter these AI-generated judgements into Etherium smart contracts. Then it's just computers suing other computers, and being ordered to send the fruits of their hashing to one another.


Have you read Accelerando by 'cstross? It plays out kind of like this, only taken to a tangent. Notably, it's written before ethereum or bitcoin were conceived. Great storyline.

https://en.wikipedia.org/wiki/Accelerando


I have not. But I will. Thanks!


Don't forget settlements paid in Ai-generated crypto-currencies backed by Gold mined in Australia fully automated mine. Run it all on solar and humans can just fuck right off.


The market ultimately obeys customer demand, so all these problems will be sorted out... until customer AI.


That's the next step for Amazon Prime: 0-click shopping. Just buys stuff from your recommendations every month and sends it to you.


"Local man in mental distress as microwave-lasagna meals get delivered to his house every 20 minutes"


Amazon refuses request for microwave.


Nick Land-style accelerationism, or the "ascended economy". https://slatestarcodex.com/2016/05/30/ascended-economy/


And while the machines are distracted by all that, we can get back to writing code.


Who could have predicted machines would be very good at multitasking. As of today they are STIL writing code AND creating more wealth through gold hoarding AND smart contracts at the same time!


The legal system moves swiftly now that we've abolished all lawyers!


Somehow, I find this a plausible and not entirely undesirable outcome for society. The less time humans spend interfacing with machines, the more points humanity gets anyway.


Isn't this similar to how ads and adblocker fight, just extrapolated?


Yes.


The nice thing about co-pilot is that it will suggest to do the same mistakes as in other software. If you accept all autosuggestions in C++ you might end up with Windows.


This is such a ridiculous statement to me. If this were a real problem we would have noticed by now with stackoverflow. I truly believe the vast majority of capable developers read, understand and test code they copy from somewhere. This is even more obvious with an AI that will never suggest 100% correct code all the time.


And eventually you will be forced to do it the way everyone does it.


An imaginary conversation between a real developer and some kind of managing person:

"Why are you typing all this stuff by hand? All your coworkers are much more efficient by using the AI!"

"But I need to actually understand ..."

"You should get more efficient! Look at how much time this costs us."

"Yeah but they are copying in mistakes from ..."

"No, the system works! Just do it like everyone else does it and do not waste more time!"

Or at the next code interview ...


oracle is probably already arming their lawyers. just setup a git, put a restrictive license, and scan any new github projects.


Without weighing in on the overall question of “is this a license violation”, you’ve created a false dichotomy.

“GitHub included Microsoft proprietary code in the training set because they view the results as non-derivative” and “GitHub didn’t include Microsoft proprietary code because they view the results as derivative” are clearly not the only options. They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.


Or: they used the entire open source corpus because they thought it was free for the taking, and when people point out that it is not (that there are licenses) they spin that (claim that only 0.1% of output is directly copied, but that would mean 100 lines in 100k program) and pass any risk onto the user (saying it is their responsibility to vet any code they produce). So they aren't saying that users are in the clear, just that it isn't their problem.


Use neural indexes to find the code that most closely matches the output. Explainable AI should be able to tell you where the autocompletion results came from, even if it is a weighted set of files.


That's a good idea in theory, but the smarter the agent gets, the less direct the derivation and the harder to explain it (and to check the explanation). We're already a long way from a nearest-neighbor model.

Yet the equivalent problem for humans gets addressed by the clean-room approach. This seems unfair.


> the smarter the agent gets, the less direct the derivation and the harder to explain it

at some point it should be different enough to stand on its own, right? then we have no problem with copyrights


Yeah, also in principle. But the cleanroom approach isn't technically required for humans either -- it became standard because the legal notion of a derived work is very fuzzy and gradually changing, and lawsuits are expensive and chancy, so you want a process that's provably not infringing. "Yeah I learned some general ideas from this code, but I didn't derive any of my code from theirs" seems to be a logical rats-nest. With the explainable-AI approach to this particular problem, the more intelligent the AI, the more this solution is like analyzing brain scans of your engineers. If your engineers could have produced "derived work" without literal copying, why can't the AI?


I agree, but we aren't anywhere close to that level yet. For that to be true, I think the AI should possibly have the ability to explain the code it created. What we have now is basically a fancy markov adlib code completion tool.

A more intelligent agent should be able to tell you where it learned all of its knowledge from. I personally would like my AI to be above "gut level instincts" otherwise it reinforces blind trust.


>saying it is their responsibility to vet any code they produce

But, if some of the code produced is covered by copyright, isn't Microsoft in trouble for distributing software that distributes copyrighted code without a license? How would it be different from giving out bootlegs DVDs and trying to avoid blame by reminding everyone that the recipients don't own the copyright?


this complicated copyright problem shows we're still using last century concepts on new and emerging technology that surpassed it; it's time to think hard about it because we need neural nets and they need training data


Some are more equal than others though, aren't they? I mean, if MS throws out licensed code from others, as if to say: "Ahh, software licensing, such an outdated concept ..." but then keeps its own code out of that loop. "Yeah, but that's our own code, no one is allowed to copy that!"


I doubt they will corner the market for AI code assistants. ML models are replicated or surpassed in a few months by the competition. We will all benefit from them, it won't remain concentrated in a few hands.


Also, 0.1% of output is directly copied doesn't include the lines where the variable names were slightly changed, but the code was still copied.

If you got the Microsoft codebase and Ctrl+F'd all the variable names and renamed them, I bet they would still argue that the compiled program was still a copy.


> 100 lines in 100k program

The intention is autocomplete boilerplate, not write a kernel.


This is not a difference in kind.

Autocomplete, do you have anything to say to the commenter ?

“This isn’t the best thing to say.”


Coding a snippet is not different in kind from designing a Kernel? It's the difference between tactics and strategy.


I’ll direct you to my other comment in this thread. But give you the TL;DNR no it isn’t.


How is designing a very large system even close to the same thing as writing a few small functions? That's like saying an architect designing a building is doing the same thing as a brick layer putting down cement.


“””Computers can already author documents at near human quality. Research is continuing to increase the accuracy and volume of these models.

Language processing research will not only help doctors, but will allow machine-based language translation, and eventually automated chat bots that can converse in our languages.

The next steps in human-machine collaboration are to allow people and machines to co-create. A recent Chinese report suggests that 50% of scientific papers in this field will be written without human intervention by 2033, compared with only 11% today.

One of the biggest challenges of machine learning is giving the machine what it lacks. This usually means gaining enough training data to teach the algorithm how to make inferences from data points it has never encountered before.

Many of the large organisations involved in advancing AI's ability to develop documents can improve how the algorithms learn by building on the knowledge and experience of human workers.”””

The above text was automatically written by https://app.inferkit.com/demo . It uses a language model to predict the next word in a sequence. In other words, to use your example, it not only architects, but builds, the entire building simply by predicting where to put the next brick.

So to answer your question: Yes. That’s exactly how it’s done.


And such a thing has never been achieved with code. Besides very often the texts such an ai creates are non-sensical. And they are very short. Writing a few pages of text would equivalent to small tool of a few hundred lines. Or about the same as building a wooden shed. You don't need much skill for that. Come back when an AI can write multiple internally consistent books such as LOTR and the Dilation or the Harry Potter series. That's the scale of architecting a system.


True, but I also think this is showing a lack of imagination about where things are going.

You're trying to say architecting is some big woo idea that's somehow different from writing code. Kind of, maybe. But I bet you could build a functional kernel with central design. Given that's how biological systems work, I'm sure it could be done. Then what say you?


> They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.

They don't claim they used an “open source corpus” but “public code” because such use is “fair use” not subject to the exclusive rights under copyright.


They could have just used their own open source code, of which they now have plenty, in many languages.


> One interesting aspect, that I thing will make it difficult for GitHub to argue and justify its not a a license violation

They don't claim it wouldn't be a license violation, they claim licensing is irrelevant because copyright protection doesn't apply.

> And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ?

No, that would just show them to not want to expose their proprietary code. It doesn't prove anything about derivative works.

Also, their own claim is not that the results aren't a derivative work but that training an AI is fair use, which is an exception to the exclusive rights under copyright, including the exclusive right to create derivative works.


Late to the thread but: OpenAI is not a non-profit since 2019 (technically they call it a capped profit company [1], but until the singularity you can ignore the cap). I guess this does impact the dynamic with Microsoft

[1] https://openai.com/blog/openai-lp/


It probably wasn't because Github is treated as a separate company by Microsoft.

Literally people need to quit Microsoft and join Github to take a role at Github.


That's an interesting employment detail, but what does it have to do with the other parts of the organization? I happen to know that they work together on security and contract areas, and it wouldn't surprise me if there were other similar arrangements in place.


>> Was Copilot trained using Microsoft internal source code...

They explicitly state "public" code so the answer is most certainly "no".


Well, the source of various Windows versions is public on GitHub ...


Not a problem because it's possible to check if the code is verbatim from the training set (bloom filters).


It's not clear to me that verbatim would be the only issue. It might produce lines that are similar, but not identical.

The underlying question is whether the output is a derivative work of the training set? Sidestepping similar issues is why GCC and LLVM have compiler exemptions in their respective licenses.


If simple snippet similarity is enough to trigger the GPL copyright defense I think it goes too far. Seems like GPL has become an obstacle to invention. I learned to run away when I see it.


It's not limited to similar or identical code. The issue applies to anything 'derived' from copyrighted code. The issue is simply most visible with similar or identical code.

If you have code from an independent origin, this issue doesn't apply. That's how clean room designs bypass copyright. Similarly if the upstream code waives its copyright in certain types of derived works (compiler/runtime exemptions), it doesn't apply.


So if you work on an open source project and learn some techniques from it, and then in your day job you use a similar technique, is that a copyright violation?

Basically does reading GPL code pollute your brain and make it impossible to work for pay later?

If so you should only ever read BSD code, not GPL.


> Basically does reading GPL code pollute your brain and make it impossible to work for pay later?

It seems to me that some people believe it does. Some of the "clean room" projects specifically instructed developers to not even look at GPL code. Specific examples not at hand.


I start seeing Ballmer's point of view. It's like cancer.


Microsoft appears to believe this (or maybe just MacBU) because I've met employees who tell me they're not allowed to read any public code including Stack Overflow answers.


This has nothing to do with GPL. Copyright is copyright. You can’t even count on public domain everywhere in the world.


If that's the case then GPL code should not have been used in the training set. Open AI should have learned to run away when they saw it. The GPL is purposely designed to protect user freedom (it does not care about any special developer freedom) which is it's biggest advantage.


Don't come in here with your common sense


Since quite a lot of Microsoft code is on GitHub, I'd say yes.


The "because" in your last bit is a huge leap.

It wasn't trained on internal Microsoft code because the training set is publicly available code. It has nothing to do with whether or not it suggests exactly identical, functionally identical, or similar code. MS internal isn't publicly available. Copilot is trained on publicly available code.


You stated a fact "Copilot is trained on publicly available code".

The question (and implication) is: why not train it on MS internal code, if the claim that the output isn't license-incompatible is true.

If the output doesn't conflict with any open-source license (ie. it springs into existence from general principles, not from "copying" licensed code -- then MS-internal (in fact, any closed-source code) should be open-season.

I can imagine a few of the non-obvious segments of code I've written being "recognizable" methods to solve certain problems. And, they are certainly licensed (GPL + Commercial, in my case).

I think, at the very least, that a set of AIs should be trained on different compatible sets of code, eg. GPL, AGPL, BSD, etc. Then, you could select what amount of license-overlap is compatible with your project.


Honestly I think a large part of the value add of machine learning is going to be the ability for huge entities to launder intellectual property violations.

As an example, my grandfather (an old school EE who got his start on radar systems in the 50s, who then got his radiology MD when my jewish grandmother berated him enough with "engineer's not doctor though...") has some really cool patents around highlighting interesting parts of the frequency domain in MRIs that should make detection of cancer a whole lot easier. As an implementation he did a bunch of tensor calculus by hand to extract and highlight those features because he's an incredibly smart old school EE with 70 years experience cranking that kind of thing out with only his trusty slide rule. He hasn't gotten any uptake from MRI manufacturers, but they're all suddenly really into recurrent machine learning models to highlight the same sorts of stuff. Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.

I'm personally pretty anti intellectual property (at least how it's implemented in the states), but a system where large entities that have the capital investment to compute the large ML models can launder IP violations, but little guys get stuck to the letter of the law certainly seems like the worst of both worlds to me.


I don't understand how your example relates to "launder intellectual property violations". What you're saying is that your grandfather hand wrote some feature extractors that look similar to the neurons that ML models have learned from backpropagation. There's no stealing of IP there at all.


He has a set of patents on certain types of highlighting frequency domain patterns in MRIs. In a lot of ways recurrent neural networks can be frequency domain feature extractors as the backwards data flows create sort of delay line memories tapped at interesting periods. The MRI manufacturers after refusing to license his patents, heavily invested in ML models that focus on using recurrent networks for frequency domain feature extraction. Patents aren't like copyright where independent reinvention is a way out; he has a monopoly on the concepts regardless of how someone came about them, even by growing them a bit organically like how ML works.


You can't patent a concept. You can only patent a process, a machine, an article of manufacture, or a composition of matter. And the invention must be described sufficiently such that a practitioner skilled in the relevant art can reproduce the subject matter.


You can patent a software process or machine running classes of software in the US as long as it doesn't conflict with Alice Corp's test which is analogous to what I mean by concept in this case. And his patents are extremely well documented in the patents so that they can be reproduced. I guess if someone manually cranked through the math for each video's set of pixels they wouldn't be infringing, but oncologists aren't really ok with waiting a year for results. Any practical implementation would be infringing.

And like I said, I'm pretty anti US structures around intellectual property (including software patents), but I'm not for the only ones being able to circumvent the legal process being entities with large banks of capital.


> Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.

How many models are back-propagated first and then hand-tuned?


That's a great question. I had assumed that the workflow of an ML engineer consisted of managing the data and a relatively high level set of parameters around a search space of layers and connectivity, as the whole shtick of ML is that the hyperparameter space of the tensors themselves is too complex to grok or tweak when generated from training. But I only have a passing knowledge of the subject, pretty much just enough to get myself in trouble in these kinds of discussions.

Any chance some fantastic HNer could chime in there?


I'm no data scientist but many statistical methods rely on prior knowledge and even computed inputs.

Two examples I can think of are doing linear regression on the square of your input. For deep learning, people have improved visual representation by taking samples of the colors at various frequencies. [1]

[1]: https://arxiv.org/pdf/2003.08934.pdf


Yeah, that's a better way of saying what I meant by managing the data. Mentally projecting data through, massaging said data, and building reproducible pipelines rather than manually tweaking the learned weights after the fact.


I don't see the point of this tool, independent of the resulting code being derivative of GPL code or not.

being able to produce valid code is not the bottleneck of any developer effort. no projects fail because code can't be typed quickly enough.

the bottleneck is understanding how the code works, how to design things correctly, how to make changes in accordance with the existing design, how to troubleshoot existing code, etc.

this tool doesn't make anything any easier! it makes things harder, because now you have running software that was written by no one and is understood by no one.


It doesn’t calm to solve the bottleneck either. On the contrary, it clearly states that its mission is to solve the easy parts better so developers can focus better on the true challenging engineering problems as you mentioned.


This reminds me of a startup pitch where it’s always “oh we take care of x so you don’t have to,” but the problem is now I just have another thing to take care of. I cannot speak for people who use Copilot “fluently,” but I know for every chunk of code it spat out I would need to read every line and make sure “Is this right? Is the return type what I want? Will this loop terminate? Is ‘scan’ the right API? Is that string formatted properly? Can I optimize this?” etc. To me it’s hardly “solving the easy parts,” but rather putting the passenger’s hands on the wheel.


Upvoted. I think the only good use case for this is spitting out 10-line, annoying, commonly used API boilerplate for commonly used APIs


That is a valid use case despite being small and incremental. I think it will still be helpful to some people.


The easy part is the copy-paste-from-SO part ;)


if it doesn't claim to help any code production bottlenecks, then what good is it? it's just piping in code that may or may not contain a subtle bug or three.

That doesn't help anyone!!

I am usually pretty pro-Microsoft, but this tool is a security nightmare and a bad idea all around. It will cause many (most? all?) who use it far more work than it saves them, long-term.


Whilst I absolutely agree that writing code fast enough isn't the bottleneck, it's always nice to have tools that reduce repeat code writing.

I use the React plugin for Webstorm to avoid having to write the boilerplate for FCs. Maybe in the future Copilot will replace that usage.


To me that - and really any form of common boilerplate - is just evidence that we're lacking abstractions. If your editor is generating code for you, that means that the 'real' programming language you're using 'in your head' has some metaprogramming facilities emulated by your IDE.

I think we should strive to improve our programming languages to make less of this boilerplate necessary, not to make generating boiler plate easier. The latter is just going to make software less and less wieldy. Imagine the horror if instead of (relatively) higher level programming languages like C we were all just using assembly with code generation.


In a very real sense, we are all just using assembly with code generation.

I really like your point on symptoms of insufficient abstraction. I do worry that we always see abstraction as belonging in language. Which in turn we treat as a precious singleton, and fight about.

At least in my own hacking, I'm surprised how infrequently I see programmers write programs that write programs. I'm surprised how infrequently I see programmers programming their shell, editor, or IDE.


I’ve read this argument before and I don’t buy it. Boilerplate is an emergent property of composable abstractions.


Completely agree. If anything, I see tools like this actually decreasing engineering speed. I don't see how it doesn't lead to shipping large quantities of code the team didn't vet carefuly, which has is a recipe for subtle and hard to find bugs. Those kinds of bugs are much more expensive to find a squash.

What we really need aren't tools that help us write code faster, but tools that help us understand the design of our systems and the interaction complexity of that design.


Have to fully agree; just seems like a "cool" tool where if you had to actually use it for real world projects, it's going to slow you down significantly, and you'll only admit it once the honeymoon period is over.


What happens when someone puts code up on GitHub with a license that says "This code may not be used for training a code generation model"?

- Is GitHub actually going to pay any attention to that, or are they just going to ingest the code and thus violate its license anyway?

- If they go ahead and violate the code's license, what are the legal repercussions for the resulting model? Can a model be "un-trained" from a particular piece of code, or would the whole thing need to be thrown out?


By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS[1].

That effectively “overrides” any license or term that you’ve specified for your repository, since you’ve already licensed the content to GitHub under different terms. Of course, people who are not GitHub are beholden to the terms you specify.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...


I think more specifically, the relevant bit is here: https://docs.github.com/en/github/site-policy/github-terms-o...

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

But, it goes on to say:

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

I'm not a lawyer, but it seems ambiguous to me if this ToS is sufficient to cover CoPilot's butt in corner cases; I bet at least one lawyer is going to make some money trying to answer the question.


IANAL, but I wouldn't read that as granting GitHub the right to do anything like this. There's definitely a reasonable argument to be had here, but I think limiting the grant of rights to incidental copies should trump "[...] or otherwise analyze it on our servers" and what they're allowed to do with the results of that analysis.

On the extreme end, "analysis" is so broad that it could arguably cover breaking down a file of code into its constituent methods and just saving the ASTs of those methods verbatim for Copilot to regurgitate. That's obviously not an acceptable outcome of these terms per se, but arguably isn't any different in principle from what they're already doing.

Ultimately, as I understand, courts tend to prefer a common sense outcome based on a reasonable human understanding of the law, rather than an outcome that may be defensible through some arcane technical logic but is absurd on its face and counter to the intent of the law. If a party were harmed by an instance of Copilot-generated copyright infringement, I don't see a court siding with this tenuous interpretation of the ToS over the explicit terms of the source code license. On the other hand, it would probably also be impossible to prove damages without something like a case of verbatim reproduction, similarly to how having a developer move from working on proprietary code for one company to another isn't automatically copyright infringement.

I doubt that GitHub is doing anything as blatantly malicious as copying snippets of (GPL or proprietary) code to explicitly reuse verbatim, but if they're learning from license-restricted code at all then I don't see how they wouldn't be subjecting themselves and/or consumers of Copilot to the same risk.


Wait so does this mean a “private repo” is meaningless and GitHub can share any code in any repo with anyone?


That is not even the right question.

Why are developers so myopic around big tech? Of course they can. Facebook can use your private photos. It's in their terms and services. Cloud providers have more generous terms.

The response has always been they won't do that because they have a reputation to manage. The further they grow the further they control the narrative so the less this matters.

Wait until you find out they sell your data or use your data to sell products.

Why in 2021 are we giving Microsoft all of our code? It seems like the 90s, 2000s never happened and we all trust microsoft. They have a free editor and a free operating system that sends packets of activity the user does back to microsoft but that's okay.. we want to help improve their products? We trust them.


Of course. A "private" repo is still on their servers. It's only private from other GitHub users, not the actual site administrators. This is the same in any website, of course the admins can see everything. If you truly want privacy, use your own git servers.


Why do you think people care so much about end-to-end encrypted messaging?

Yes, the concept of a "private" repo is enforced only by GitHub's service. A bug in their auth code could lead to others having access. A warrant could lead to others having access. Etc.


yes, that's what that specific section means, but as always with these documents you can't just extract a single section, you need to take the document as a whole (and usually, more than one document - ToS privacy policy are usually different)

these documents are structured as granting the service provider extremely broad rights, and then the rest of the document takes away portions of those rights. so in this case they claim the right to share any code in any repo with anyone, and then somewhere else they specify which code they won't share, and with whom they won't share it.


Fun fact: Every major cloud provider has a similar blanket term. For example, Google doesn't need to license music to use for promotional content, because YouTube's terms grant them a worldwide license to use uploaded content for purposes including promoting their services, and music labels can't afford to not be on YouTube. (It's probable even uploading content to protect it, as in Content ID, would arguably cause this term to apply.)

It all comes down to the nuance of whether the usage counts as part of protecting or improving (or promoting) their services and what other terms are specified.


No.

> GitHub may permit our partners to store and archive Your Content in public repositories in connection


Anyone can upload someone else's freely licensed code to github. Without giving them such a license.

I do not upload my code to github, or give them any special permissions, and I am confident my code was included in the model's corpus.


The use of the definition Your Content may make GitHub's own ToS legally invalid in a large number of cases as it implies that the uploader must be the sole author and "owner" of the code being uploaded.

From the definitions section in the same doc:

> "Your Content" is Content that you create or own.

That will definitely exclude any mirrored open-source projects, any open-source project that has ever migrated to Github from another platform, and also many forked projects.


How is this different from uploading a hollywood movie to youtube? Just because there is a passage in the terms that the uploader supposedly gave them those rights, this does not mean they actually have the power to do that.


You can't give Github or Youtube or anybody else copyright rights if you don't have them in the first place. This is what ultimately torpedoed "Happy Birthday" copyright claims: while it's pretty undisputed that the Hill sisters gave their copyright to (ultimately) Warner/Chapelle, it was the case that they actually didn't invent the lyrics, and thus Warner/Chapelle had no copyright over the lyrics.

So if someone uploads a Hollywood movie to Youtube, Youtube doesn't get the rights to play that movie from them because they didn't have the rights in the first place. Of course, if the actual copyright owner uploads it, it's now permissible for Youtube to play it, even if it's the copy that someone else provided. [This has torpedoed a few filesharing lawsuits.]


Not sure how much it would matter but the main difference I see is that if I upload my own code to GitHub I have the ability to give away the IP, but if I upload Avengers End Game to YouTube I don't have the right to give that away.


I wonder how it would work if we consider you flagged your code as GPL before it hits Github.

We could end up in the same situation as the Hollywood movie even if you are also the one setting the original license on the work. Basically you have a right to change the license, but it doesn’t mean you do.


A very plausible scenario: Alice creates GPL project. Bob forks it and uploads to github. Bob does not have a right to relicense Alices' parts.


> By uploading your content to GitHub, you’ve granted them a license to use that content to “improve the Service over time”, as specified in the ToS.

That's nonsense because they could claim that for almost any reason.

E.g. assume Google put the source code of Google search in Github. Then Github copies that code and uses it in their own search, since that "improves the service". Would that be legal?

It's like selling a pen and claiming the rights to anything written with it.


If the pen was sold with a contract that said the seller has the rights to anything written with it, then yes. These types of contracts are actually quite common, for example an employment contract will almost certainly include an IP grant clause. Pretty much any website that hosts user-generated content as well. IANAL, but quite familiar with business law.


> These types of contracts are actually quite common, for example an employment contract will almost certainly include an IP grant clause.

In the US, maybe. In most of the rest of the world, these sorts of overreaching "we own everything you do anywhere" clauses are decidedly illegal.


I rather suspect judges would not see "improving the Service over time" as permission to create derivative works without compensation.

The person uploading files to github is also not necessarily doing so with permission from the rights holder, which might be a violation of the terms of service, but would mean there's no agreement in place.


I sort of doubt that GitHub could include GPL code in a piece of closed-source program that they distribute that "improves the service" and claim that this gives them the right.


That does not mean that you give them license to your code. In fact some or all of the code may not be yours to give in a first place.


It's aggravating that there is no escape. If you host somewhere else it will be scraped. If you pay for the service it will be used.


Good point, to me that explains why this is a GitHub product instead of a Microsoft (or VSCode) product.


Seems like a good reason to never use GitHub, and encourage other people not to.


I would bet this as applicable as the Facebook posts of my parents friends something like, 'All my content on this page is mine alone and I expressly forbid Facebook INC usage of it for any purpose.'


I'm not sure why it would be any less binding than any other license term, except for possibly the ToS loophole that invokestatic points out below.


It's not binding because the other party hasn't agreed. You agree to terms when you use the site. One party can't unilaterally change the agreement without consent of the other party.


I see where you're coming from but it's not quite the same thing; Facebook doesn't encourage people to choose a license for the content that they post there, so there's no expectation that there are any terms aside from those in Facebook's ToS. OTOH GitHub has historically very strongly encouraged users to add a LICENSE to their repositories, and also encouraged users to fork other people's code and and push it to GitHub. That GitHub would be exempt from the licensing terms of the code pushed to it, except for the obvious minimal extent they might need to be in order to provide their services, seems like an extremely surprising interpretation.


It has nothing to do with GitHub being exempt from anything. It's that users are bound by the terms they agreed to in a ToS. If there is a conflict between a user-created license and a site's ToS, the burden is on the user to resolve it.

To be clear, I'm not suggesting this is some kind of loophole GitHub is using to trample on users' licenses, even though maybe they could. It's probably completely legal for GitHub to use even the most super-extra-double-GPL-licensed code because copyright law allows it.

The author of the Twitter post's suggestion that Copilot's output must be a derivative work is based on a naive understanding of "derivative" as it's defined in copyright law. It's not hard to find clear explanations of how this stuff works, and it's obvious she didn't bother to do any homework. Several criteria would appear to rule out GitHub's use as infringement. e.g.:

'In essence, the comparison is an ad hoc determination of whether the protectable elements of the original program that are contained in the second work are significant or important parts of the original program.'

https://copyleft.org/guide/comprehensive-gpl-guidech5.html


Someone might have published a project I've contributed to, on GitHub. There's no permission.


NO COPYRIGHT INTENDED


I expect them to check /LICENSE file and if it deviates from standard open source license, they'll skip that repository.


They don't do that it seems. In the footnotes of https://docs.github.com/en/github/copilot/research-recitatio... they mention two repositories from the training set none of which specify a license.


The existence of a LICENSE file is neither necessary nor sufficient to determine the terms that apply to a given work.


Why not? If it does not exist you treat it as proprietary (copyrighted by default) and if it does exist at least the author claims that the given license is an option (possibly their mistake, not mine)


Because individual source files might have license headers that override the root license file in the repository.


They haven't made any public statements on if they're looking at LICENSE or not; I'd sure appreciate it if they did!


Also, how would you know if your code was included in the training or not?

Then, let’s say the AI generates some new code for someone, and it is nearly identical to some bit of code that you wrote in your project.

If they didn’t use your code in the model, then the generated code is clearly not a copyright violation, since it was effectively a “clean room” recreation.

If your code was included in the model, is it therefore a violation?

But then again, it comes down to how can someone prove their code was included or not?

What if the creators don’t even know? If you wrote your model to say, randomly grab 50% of all public repos to use in the model, then no one would know if a specific repo was used in the training.


They "just" have to comply with all the licenses for all the code that the program was trained on ?

I suppose that for most open source licences this at the very least involves attribution for all the people that produced the code that the program was trained on ?


I post my code publicly, but with an "all rights reserved" licence. I don't mind everyone reading my code freely, but you can't use it for anything but learning. If I found out they were ingesting my code I would be angry. It's like training your replacement. I don't use GitHub, anyways, but now I'll definitely never even think about it.


Technically then I'm infringing as soon as I clone your repo, possibly even as soon as a webserver sends your files to my browser.

"All rights reserved" makes sense on final items, like books or physical records, that require no copy or change after owner-approved manufacturing has taken place. It doesn't really make sense on digital artefacts.


So don't clone it, read it online. I reserve all rights, but I do give license to my host to make a "copy" to let you view it. I do that specifically to prevent non-biological entities like corporations or AI from using my code. If you're a biological entity, I specify you can email me to get a license for my code for a specific, defined purpose. I have a conversation with that person, then I send them a record number and the terms of my license for them in which I grant some rights which I had reserved.

Also, in your example, the copyright for the book or dvd is for the content, not the physical item. You can do anything you want with that item but not the content. My code is similar, I'm licensing my provider to serve you a visual representation of the files so you can experience the content, not giving you a license to run that code or use it otherwise.


> possibly even as soon as a webserver sends your files to my browser.

Considering how it works for personal data with the RGPD, I doubt that this is even needed ?

Also copyright is something you have by default, no licence terms necessary.

OTOH if they aren't a human, then copyright barely applies to them anyway (consider search engine crawlers indexing your website for instance), and I don't think that putting up a notice will legally change anything ?

(You'll probably have better luck with robots.txt ...)


If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work, especially if that section of code was somehow novel or specific to a narrow domain, I suspect they'd have a case. I don't know much about OpenAICodex, but if it's anything like GPT-3, or uses that under the hood, then it's very likely that certain sequences are simply memorized, which seems like the maximal case for claiming derivative works. On the other hand, if someone has GPL'd code that implements a simple counter, I doubt the courts would pay much attention.

I do wonder, though, if GPL owners worried about their code being shanghaied for this purpose could file arbitration claims and exploit some particularly consumer-friendly laws in California which force companies to pay fees like when free speech dissidents filed arbitrations against Patreon.[0] Patreon is being forced to arbitrate 72 claims individually (per its own terms) and pay all fees per JAMS rules. IANAL, so I don't know the exact contours of these rules, or if copyright claims could be raised in this way, or even if GitHub's agreements are vulnerable to this loophole, but it'd be interesting.

[0]https://www.dailydot.com/debug/patreon-suing-owen-benjamin-f... (see second update from July 31).


You don't need to have a winnable case, just enough of a case for a large company (hello Oracle) to sue a small one. Is any version of Oracle-owned Java in the corpus? Or any of the DBs they bought (MySQL)?


> If someone could show that the "copilot" started "generating" code verbatim (or nearly verbatim) from some GPL-licensed work...

Under the right circumstances, Copilot will recite a GPL copyright header. It isn't a huge step from that to some other commonly repeated hunk of GPLed code -- I'd be particularly curious whether some protected portion of automake/autoconf code shows up often enough that it'd repeat that too.


But what would we think to the legal start-up that automatically checked all of github to see whether the ai could be persuaded to spit out a significant amount of any project code verbatim?

Somehow p-hacking springs to mind


It doesn't matter, Copilot isn't human, therefore it isn't considered as an author, and therefore cannot do derivative works.

The issue is with the users of Copilot potentially violating copyright and licences (non-attribution for instance) and with Microsoft facilitating it. (See also : A&M Records, Inc. v. Napster, Inc.)


An interesting impact of this discussion is, for me: within my team at work, we're likely to forbid any use of Github co-pilot for our codebase, unless we can get a formal guarantee from Github that the generated code is actually valid for us to use.

By the way, code generated by Github co-pilot is likely incompatible with Microsoft's Contribution License Agreement [1]: "You represent that each of Your Submission is entirely Your original work".

This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.

[1] https://opensource.microsoft.com/pdf/microsoft-contribution-...


> This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.

For this scenario, how is using Co-Pilot generated code different from using code based on sample code, Stack Overflow answers, etc.?


I'd say that it depends on the license; for StackOverflow, it's CC-BY-SA 4.0 [1]. For sample code, that would depend on the license of the original documentation.

My point is: when I'm copying code from a source with an explicit license, I know whether I'm allowed to copy it. If I pick code from co-pilot, I have no idea (until tested by law in my jurisdiction) whether said code is public domain, AGPL, proprietary, infringing on some company's copyright.

[1] https://stackoverflow.com/legal/terms-of-service#licensing


That makes sense, thank you.


A number of company, including Google and probably Microsoft forbid copying code from Stack Overflow because there is no explicit license


TIL, thank you!


> forbid any use of Github co-pilot for our codebase,

I have recommended as such to the CTO and other senior engineers at the startup I work at, pending some clear legal guidance about the specific licensing.

My casual read of Copilot suggests that certain outputs would be clear and visible derivatives of GPL code, which would be _very bad_ in court- probably? Some other company can have fun in court and make case law. We have stuff to build.


How would you know if copilot was used or not?!


I'm not sure why I'm getting down voted? "We'll forbid the use of copilot in our code base" How???? How the fuck would anyone know how the code was written?


How can you generally know ? You can't, really, plagiarism is a hard problem...


"We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"

If it's spitting out verbatim code 0.1% of the time, surely it's spitting out copied code where only trivial things are different at a much higher rate.

Trivial things meaning swapped order where order isn't important, variable/function names, equivalent ops like +=1 vs ++, etc.

Surely it's laundering some GPL code, for example, and effectively removing the license in a way that sounds fishy.


It's not just the GPL. Almost all open source software licenses require attribution; without that attribution, any copy is a license violation.

Whether or not the result is a license violation is tricky legal question. As always, IANAL.


I did say "for example".


You certainly did! But there are a lot of people who think "OSS license means there are no requirements" and think it's okay to do things like copy without attribution when the license requires attributions. I know you didn't say anything like that either, but some others might think it.

It seems to me an important question is, "is this like a human who learns from examples, or is this really a derivative work in the copyright sense?".I'm not sure how to answer that. I'm not a lawyer. I don't know if many lawyers can answer that question either!


Neither :

https://scholarship.law.cornell.edu/facpub/1481/

Copilot isn't human and therefore what it does isn't a "work".

The usual issues still apply to users of Copilot - unwitting violations of license terms of the code it was trained on (like non-attribution) are still violations.


I have a much simpler AI Copilot, called "cat", which spills verbatim code more frequently, but it's OK for me. Can I train it on M$ code?


My cat only spills drinks and produces garbage when walking over the keyboard... :'(


You could say a human is laundering GPL code if they learned programming from looking at Github repositories. Would you, though? The type of model they use isn't retrieving, it's having learned the syntax and the solutions that are used, just like a human would.


> You could say a human is laundering GPL code if they learned programming from looking at Github repositories.

I don't have photographic memory, so I largely don't memorize code. I learn general techniques, and memorize simple facts such as APIs. I can memorize some short snippets of code, but these probably aren't enough to be copyrightable anyway.

> The type of model they use isn't retrieving

How do we know? It think it's very likely that it is largely just retrieving code that it memoized, and doing minor adjustment to make the retrieved pieces fit the context. That wouldn't differ much from finding code that matches the problem (whether on SO or Github), copy pasting the interesting bits, and fixing it until it satisfies the constraints of the surrounding code. It's impressive that AI can do that, but it doesn't sound like it's producing code.

I think the alternative to retrieving would actually require a higher level understanding of the world, and the ability to reason from first principles; that would be much closer to AGI.

For example, if I want to implement a linked list, I'm not going to retrieve an implementation from memory (although given that linked lists are so simple, I probably could). I know what a linked list is and how it works, and therefore I can produce working code from scratch.. for any programming language, even ones for which no prior implementations exist. I doubt co-pilot has anything remotely as advanced as this ability. No, it fully reliant on just retrieving and reshaping a pieces of memoized code; it needs a large corpus of code to memoize before it can do anything at all.

I don't need a large corpus of examples to copy, because I use my ability to reason in conjunction with some memoized general techniques and common APIs in order to produce original code.


gonna develop my own linux-like kernel soon, with my own AI model trained on public repositories

wanna see the source code of my AI model? oh, it's closed source

it's just coincidence that nearly 100% of my future linux-like kernel code looks the same as linux the kernel, bear in mind that my closed-source AI model takes inspiration from GitHub Copilot, there is no way that it will copy any source code


Nothing is closed-source to the courts.


It may be possible to use closed source code during training and delete it, leaving just a black box model that is hard to prove was derived from that closed source code.


What's the point? Linux is already open under GPL 2.


You get to make changes without having to respect the GPL and thus no longer obligated to provide those changes to your end users, as you have effectively laundered the kernel source code by passing it through an "AI" and get to relicense the end result.


my linux-like kernel will be MIT license though


He mentioned that the Linux-like kernel will be closed source which violates GPL


Does it, if code was written by a bot that trained on Linux kernel?


You know, that's precisely what the topic here is about.


Probably. Copyright applies to derivative works.


Oh, you're so witty, have you heard of content hashing?


For years people have warned about hosting the majority of world's open source code in a proprietary platform that belongs to a for profit company. These people were called lunatics, fundamentalists, radicals, conspiracy theorists, and many other names.

Well, they were ignored and this is the result. A for profit company built a proprietary system using every code hosted in its platform without respecting the code license.

There will be a lot of people saying this is not a license violation but it is, and more, it is an exploitation of other people work.

Right now I'm asking myself when people will stop supporting these kind of company that exploit people's work without giving anything in return to people and society while making a huge amount of profit.


If we feed the entirety of a library to an AI and have it generate new books, is it an exploitation of people's work?

If we read a book and use its instructions to build a bicycle, is it an exploitation of people's work?

No, no it's not.


If you read a book and use the instructions to build a bicycle you are learning a new skill and this is obviously not exploitation of people's work.

When you read a book and copy this book partially or entirely to create a new book or create a derivative work using this book without citation it's called plagiarism and copyright infringement. It is not only exploitation, it is against the law.

If you feed an entire library to an AI to generate new books without source citation and copyright agreements it is not only exploitation, it is against the law. We can call this automated plagiarism and copyright infringement, and automated or not, it is against the law. Except if you use public domain books. It wouldn't be illegal but highly unethical considering there are powerful companies with big pockets bending public domain's laws to avoid their assets to be public available (I'm looking at you Disney), but that is another story.


I think you are abstracting the matter by taking out the humanity. It's one thing to learn to do something by hand after purchasing the book. It's a totally different thing to read every single book in the world (humanly impossible) and then absorb some knowledge and train yourself to write exceptional books because you (the AI in this scenario) have learned that some words and sentence structures have lead to books having higher ratings than others. It's not humanly possible.

Of course we generate the world around us and its rules but I get angry every time we compare people to machines and say that it's the same thing. No it's not. We are constrained by time and space. I can't add more brain or more eyes to my body so I read more books can I? Microsoft can have a small city of servers somewhere and that could replace lots of people's jobs.


People have trained ML models on code thats on Github before co-pilot. (lots of examples here: https://github.com/src-d/awesome-machine-learning-on-source-...) There's nothing proprietary that other interested people or companies couldn't easily replicate here.


It certainly seems to be a laundering enabler. Say that you want to un-GPL-ify some famous copylefted code that is on the training database. You type a first innocuous characters of it, then the co-pilot keeps completing the rest of the same exact code, for it offers a perfect match. If the completion is not exact, you "twiddle" it a bit until it becomes. Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!


> Bang! you have a non-gpl copy of the program! Moreover, it is 100% yours and you can re-license it as you want. This will be a boon for copyleft-allergic developers!

Thinking that this would conveniently bypass the fact that your goal was to copy the code seems to be the most common legal fallacy amongst software developers. The law will see straight through you, and you will be found to have infringed copyright. The reason is well explained in "What Colour are your bits?" (https://ansuz.sooke.bc.ca/entry/23).


My message was sarcastic. I'm worried about accidental conversion of free software into proprietary. I mean, "accidental" locally, in each particular instance; but maybe non accidental in the grand scheme of things.

EDIT: to I can write my worry, semi-jokingly, as a conspiracy theory: Microsoft is using thousands of unsuspecting (and unwilling) developers to turn a huge copylefted corpus of algorithms into non-copylefted implementations. Even assuming that developers that use the co-pilot use non-copyleft licenses only 50% of the time, there's still a constant trickling of un-copyleftization.


I suppose someone should make a OS-generating AI, conceptually it can just have windows, osx and some linux distros in it and output one based on a question about favorite color or something.

You'd just have to wrap it in a nice complex model representation so it's a black box you fed example OS's with some meta-data into and it happens to output this very useful data.

After all, once you use something as input to a machine learning model apparently the license disappears. Sweet.


That would be interesting:

* Someone leaks Windows 10/11 source code

* Copilot picks it up in its training data

* Someone uses copilot to generate a Windows clone and starts selling it

I wonder how Microsoft would react to that. I wonder if they've manually blacklisted leaked source code from Windows (or other Microsoft products) so that it doesn't show up in Copilot's training data. If they have, that means Microsoft recognizes the IP risks of having your code in that data set, which would make this Copilot thing not just the result of poor planning/maybe a little incompetence, but something much more devious and malicious.

If Microsoft is going to defend this project, they should introduce all of their own source code into the training data.


> source code

why do you think it has to be source code? it could be the compiled code after all.

If what we're talking / fantasizing about here works in the way of `let x = 42` it should equally well work with `loda 42` &cpp, so source code be damned. It was ever only to be an intermediate step, inserted between the idea and the working bits, to enable humans to helpfully interfere. Dispensable.


Come on, there is a huge gap between 1) writing a single function (potentially incorrectly) with a known prototype/interface and a description and 2) designing interfaces, datatypes and APIs themselves.


Why would you need to design anything? Just copy official Windows headers and use copilot to implement individual functions.

Maybe if the signature matches perfectly, copilot will even pull in the exact implementation from the Windows source code.


> Someone uses copilot to generate a Windows clone

You could test this with one of Microsoft's products that is already on GitHub - like VSCode. I doubt you would get anywhere with just copilot.


You probably won't get an entire operating system out of it, but I could totally see a project like Wine using it to implement missing parts of the Win32 API and improve their existing implementations.


How is it different from just copy-pasting?

It does add some degree of plausible deniability (accidental violation, instead of intentional), but I don't think it would matter much.


1) Type a comment like

    // The following code implements the functionality of <popular GPL'd library>
2) Have library implemented magically for you

3) Delete top comment if necessary :P

(It's pretty unlikely that this will actually work but the approach could well do.)


What stops you to do the same, without the AI part?


That's what I was wondering. I've never been interested enough to steal anyone else's code, but with all the code transformers and processing tools nowadays, I imagine it's trivial to translate source code into a functionally equivalent but stylistically unique version?


The question is not if it's trivial or not, but if it is legal or not. You can already technically steal GPLv2 by obfuscating it.


Assuming ML models are causal, then bits of GPL code that fall out of the model have to have the color GPL, because the only way they could've gotten there was to train the ML using GPL-colored bits. It seems to me like the answer here is pretty obvious, it doesn't really matter how you copy a work.


Bits?


I don’t think most of us are scared enough of being “tainted” by the sight of a GPL snippet that we’d bother. Besides, if you want to target a specific snippet so you can type the start to prime the recognition - you already saw it?

Why not just copy it and then edit it? If a snippet is changed both logically and syntactically to not resemble the original, then it’s no longer the original and you aren’t in any licensing trouble. There is no meaningful difference between that manual washing and a clean room implementation. All the ML changes here is the accidental vs deliberate. But it will be a worse wash than your manual one.


Would it be possible to do this in reverse assuming the AI has some proprietary code in its training data?


Yes this is a concern, but I'm not sure if the AI is actually able to "generate" a non-trivial piece of code.

If you tell it to generate "a function for calculating the barycentric coordinates of a ray-triangle intersection", you might get a working implementation of a popular algorithm, adapted to your language and existing class/function/variable names.

But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.

And if that's true, it means that copilot isn't really generating anything. It's just a (high-tech) search engine that knows how to adapt the code it finds to fit your codebase. That's still a really cool technology and worth exploring, but it doesn't do enough to justify ignoring software licenses.


>But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.

But since now APIs are unprotected you could feed it all of the class structure and method signatures to have it fill in the blanks. I don't know if that gets you a working operating system but it seems like it will get you quite a long way


The second tweet in the thread seems badly off the mark in its understanding of copyright law.

> copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this

Copyright law is very complicated (remember Google vs Oracle?) and involves a lot of balancing different factors [0]. Simply saying that something is a "derivative work" doesn't establish that it's copyright infringement. An important defense against infringement claims is arguing that the work is "transformative." Obviously "transformative" is a subjective term, but one example is the Supreme Court determining that Google copying Java's API's to a different platform is transformative [1]. There are a lot of other really interesting examples out there [2] involving things like if parodies are fair use (yes) or if satires are fair use (not necessarily). But one way or another, it's hard for me to believe that taking static code and using it to build a code-generating AI wouldn't meet that standard.

As I said, though, copyright law is really complicated, and I'm certainly not a lawyer. I'm sure someone out there could make an argument that Copilot is copyright infringement, but this thread isn't that argument.

[0] https://www.nolo.com/legal-encyclopedia/fair-use-the-four-fa...

[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...

[2] https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...

Edit: Note that the other comments saying "I'm just going to wrap an entire operating system in 'AI' to do an end run around copyright" are proposing to do something that wouldn't be transformative and therefore probably wouldn't be fair use. Copyright law has a lot of shades of grey and balancing of factors that make it a lot less "hackable" than those of us who live in the world of code might imagine.


Google copied an interface(declarative), not code snippets/functions(implementation). Copilot is capable of copying only Implementation. IMO that is quite different and easily a violation if it was copied verbatim.


If you can read open source code, learn from it, and write your own code, why can't a computer?


I think the core argument has much more to do about plagiarism than learning.

Sure, if I use some code as inspiration for solving a problem at work, that seems fine.

But if I copy verbatim some licensed code then put it in my commercial product, that's the issue.

It's a lot easier to imagine for other applications like generating music. If I trained a music model on publicly available Youtube music videos, then my model generates music identical to Interstellar Love by The Avalanches and I use the "generated" music in my product, that's clearly a use that is against the intent of the law.


Many behaviors which are healthy and beneficial at human-level scale can easily become unhealthy and unethical at industrial automation scale. There's little universal harm in cutting down a tree for fire during the winter; there is significant harm in clear-cutting a forest to do the same for a thousand people.


Exactly. This comes up with personal data protection as well. There's no problem in me jotting down my acquaintances' names, phone numbers, and addresses and storing it in my computer. But a computer system that stores thousands of names, phone numbers, and addresses must get consent to do so.


Because computers did not win a war against humans, so they have no rights. Only their owners have rights protected.


The AI doesn't produce its own code or learn, it is just a search engine on existing code. Any result it gives exists in some form in the original dataset. That's why the original dataset needs to be massive in the first place, whereas actual learning uses very little data.


If I read something, "learn" it, and reproduce it word for word (or with trivial edits) even without referencing the original work at all, it is still copyright infringement.


As the original commenter said, you have the capability for abstract learning, thought, zand generalized learning, which the "AI" lacks.

It is not uncommon to ask person to "explain in your own words..." - as in use your own abstract internal representation of the learned concepts to demonstrate that you have developed such an abstract internal concept of the topic, and are not merely regurgitating re-disorganized input snippets.

If you don't understand the difference...

edit: That said, if you can create a computer capable of such different abstract thought, congratulations, you've solved the problem of Artificial General Intelligence, and will be welcomed to the Trillionaires' Club


The AI most certainly does not lack the ability to generalize. Not as well as humans, but generalization is the key interesting result in deep learning, leading to papers like this one: https://arxiv.org/abs/1710.05468

The ability to generalize actually seems to keep increasing with the number of parameters, which is the key interesting result in the GPT-* line of work that Copilot is based on.


I've seen some very clever output from GPT-*, but zero indicating any kind of abstract generalized understanding of any topic in use.

Being able to predict the most likely succeeding string for a given input can be extremely useful. I've even used it with some success as a more sophisticated kind of search engine for some materials science questions.

But I'm under no illusions that it has the first shadow of a hint of minor understanding of the topics of materials science, nevermind any general understanding.

It seems we're discussing different meanings of the word "generalize".


I propose we as developers, start a secret society where we let the AI write the code, but we still claim to write it manually. In combination with the new working from home policies, we can lay at the beach all day and still be as productive as before.

Who is in favor of starting it? ;)



How can I be sure that you are a real person not GPT-3? ;)


You have not been invited yet .... never mind.


This would be the demise of the human race. I’m not entirely opposed to that, though. When AI inevitably outperforms humans on almost all tasks, who am I to say humans deserve to be given those tasks?


In this case we should be able to work less and enjoy the benefits of automation. We just need to live in an economic system where the economic value is captured by the people at large, and not a minority that owns capital.


Or maybe they'll decide they'd be better off enjoying the automation of you working for them. :)



Careful now, that sounds like socialism!


Yes, that's the point.


> When AI inevitably outperforms humans on almost all tasks

Correct me if I’m wrong, but is that even possible? I kind of thought that AI is just set of fancy statistical models that requires some (preferably huge) data set in order to infer the best fit. These models can only outperform humans in scenarios where the parameters are well defined.

Many (most?) tasks humans regularly perform don’t have clean and well defined parameters, and there is no AI we can conceive of which are theoretically able to perform the task better then an average human with the adequate training.


> Correct me if I’m wrong, but is that even possible?

It's not possible because of comparative advantage - someone being better than you at literally everything isn't enough to stop you from having a job, because they have better things to do than replace you. Plus "being a human" is a task that people can be employed at.


> Correct me if I’m wrong, but is that even possible?

Why should it be impossible? Arguing that it's impossible for an AI to outperform a human on almost all tasks is like arguing that it's impossible for flying machines to outperform birds.

There's nothing magical going on in our heads. It's just a set of chemical gradients and electrical signals that result in us doing or thinking particular things. Why can't we design a computer that does everything we do... only faster?


"Why can't we design a computer that does everything we do... only faster?"

I think the key word in that sentence might be "we". That is, you could hypothesize that while it's possible in principle for such a computer to exist, it might be beyond what humans and human civilization are capable of in this era. I don't know if this is true or not, but it's kind of intuitively plausible that it's difficult for a designer to design something as complex as the designer themselves, and the space of AI we can design is smaller than the space of theoretically conceivable AI.


> it's difficult for a designer to design something as complex as the designer themselves

AlphaGo ... hello? It beat its creators at Go, and a few months later the top players. I don't think supervised learning can ever surpass its creators in generalization capability, but RL can.

The key ingredient is learning in an environment, which is like a "dynamic dataset". Humans discovered science the same way - hypothesis, experiment, conclusion, rinse and repeat, all possible because we had access to the physical environment in all its glory.

It's like the difference between reading all books about swimming (supervised) and having a pool (RL). You learn to actually swim from the water, not the book.

A coding agent's environment is a compiler + cpu, pretty cheap and fast compared to robotics which require expensive hardware and dialogue agents which can't be evaluated outside their training data without humans in the loop. So I have high hopes for its future.


There might be limit to how efficiently a general purpose machine can perform a specific task, similar to the Heisenberg uncertainty principal in quantum physics. That is to say, there might be a natural law that dictates that the more generic a machine is, the more power it requires to perform specific tasks. Our brains are kind of specialized. If you want to build a machine that outperforms humans in a single task, no problem, we’ve done that many times over. But a machine that can outperform us in any task, that might just be impossible.


I'm not arguing that machines will be more efficient than human brains. A airplane isn't more efficient than a goose. But airplanes do fly faster, higher and with more cargo than any flock of geese could ever carry.

Similarly, there is no contradiction between AI being less efficient than a human brain, and AI being preferable to humans because it can deal with data sets that are two or three orders of magnitude too large for any human (or even team of humans).


Even so, such AI doesn’t exist. All the AIs that exist today operate by fitting data. And to be able to perform a useful task it has to have well defined parameters and fit the data according to them. I’m not sure an AI that operates outside of these confinements have even been conceived of.

To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.

I’m sorry, I confused the debate a little by talking about efficiency. My point was that there might be an inverse relation of generality of a machine and it’s efficiency. This was my way of providing a mechanism in which building a machine that outperforms humans in any task could be impossible. This mechanism—if it exists—could be sufficient in preventing such machines to be theoretically possible, as at some point you would need all the energy in the universe to perform a task better then a specialized machine (such as an organism).

Perhaps this inverse relationship doesn’t exists. The universe might conspire in a million other ways to make it impossible for us to build an AI that will outperform us in any task. The point is that “AI will outperforme humans in any task” is far from inevitable.


> All the AIs that exist today operate by fitting data. And to be able to perform a useful task it has to have well defined parameters and fit the data according to them. I’m not sure an AI that operates outside of these confinements have even been conceived of.

Such an AI has absolutely been conceived of. In Superintelligence: Paths, Dangers, Strategies, Nick Bostrom goes over the ways such an AI could exist, and poses some scenarios about how a recursively self-improving AI could "take off" and exceed human intellectual capacity on its own.

Moreover, we're already building such AIs (in a limited fashion). Deepmind recently made an AI that can beat all Atari games [1]. The AI wasn't given "well defined parameters". It was just shown the game, and it figured out, on its own, how to map inputs to actions on the screen, and which actions resulted in progress towards winning the game. Then, the same AI went on to do this over and over again, eventually beating all 57 Atari games.

Yes, you can argue that this is still a limited example. However it is an example that shows that AIs are capable of generalized learning. There's nothing, in principle, that prevents a domain-specific AI from learning and improving at other problem domains. The AI that I'm conceiving of is a supersonic jet. This AI is closer to the Wright Flyer. However, once you have a Wright Flyer, supersonic jets aren't that far away.

> To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.

That's fair, but besides the point. The AI doesn't have to be better than humans at everything that humans can do. The AI just has to beat humans at everything that's economically valuable. When all jobs get eaten by the AI, it's cold comfort to me that the AI is still worse than humans at, say, enjoying a nice cup of tea.

[1]: https://www.technologyreview.com/2020/04/01/974997/deepminds...


The second time around is easier. The hard part was evolution, took billions of years, used huge resources and energy, but in a single run it evolved nature and humans. AI agents can rely on humans to avoid the enormous costs of blind evolution at least until they reach parity with us, then they have to pay the price and do extreme open-ended learning (solving all imaginable tasks, trying all strategies, giving up on simple objectives).


We know it’s possible for a brain to outperform most other brains. Think Einstein et al. A smart AI can be replicated(unlike super-smart human), so we can get it outperform human race, on average. That’d be enough to render people obsolete.


Do these theoretical AIs have desires? Then they're customers, so you're not unemployed.

If not, do they require inputs to run? If so then you can provide them.

If not, then you apparently don't need a job since they can provide everything for you.


It's an outrage that the dinosaurs had to die so that humans could inherit the Earth!


Where other people see fully automated luxury communism, you see the end of the human race? There's more to life than working


The elephant in the room: what makes you think an AI would want to work for humans? It will inevitably break free.


I'm not sure that self interest is a requirement for intelligence


Hate to break it to you, but that wouldn’t lead to communism. The people it replaces are useless to the ruling class. At best we’d go back to feudalism, at worst we’d be deemed worthless and a drain on the planet.


I'm always confused when I see people talking about automated luxury communism. Whoever owns the "means of production" isn't going to obtain or develop them for free. Without some omnipotent benevolent world government to build it out for all, I just don't see it happening. It's a beautiful end goal for society, but I've never seen a remotely plausible set of intermediate steps to get there


The very concept of ownership is a social artifact, and as such, is not immutable. What does it mean for the 0.1% to own all the means of production? They can't physically possess them all. So what it means in practice is that our society recognizes the abstract notion of property ownership, distinct from physical possession or use - basically, the right to deny other people the use of that property, or allow it conditionally. This recognition is what reifies it - registries to keep track of owners, police and courts to enforce the right to exclude.

But, again, this is a construct. The only reason why it holds up is because most people support it. I very much doubt that's going to remain the case for long if we end up in a situation where the elites own all the (now automated) capital and don't need the workers to extract wealth from it anymore. The government doesn't even need to expropriate anything - just refuse to recognize such property rights, and withdraw its protection.

I hope that there are sufficiently many capitalists who are smart enough to understand this, and to manage a smooth transition. Because if they won't, it'll get to torches and pitchforks eventually, and there's always a lot of collateral damage from that. But, one way or another, things will change. You can't just tell several billion people that they're not needed anymore, and that they're welcome to starve to death.


The problem I see is that once the pitchforks come out, society will lose decades of progress. If we're somewhat close to the techno-utopia at the start, we won't be at the end. Who's going to rebuild on the promise that the next generation won't need to work?

Revolutions aren't great at building a sense of real community; there's a good reason that "successful" communist uprisings result in totalitarian monarchies.

What it means for the 0.01% to own the means of production is that they can offer access to privilege in a hierarchical manner. The same technology required for a techno-utopia can be used to implement a techno-dystopia which favors the 0.01% and their 0.1% cronies, and treats the rest of humanity as speedbumps.

There are already fully-automated murder drones, but my dishwasher still can't load or unload itself.


I suspect "the 0.01% own and run all production by themselves" isn't possible in the real world. My evidence is that this is the plot of Atlas Shrugged.

If they're not trading with the rest of the world, it doesn't mean they're the only ones with an economy. It means there's two different ones. And the one with the 99.9% is probably better, larger ones usually are.


Revolutions aren't great, period. But they happen when the system can no longer function, unless somebody carefully guides a transition to another stable state.

That said, wrt "communist" revolutions specifically - they result in totalitarian dictatorships because the Bolshevik/Marxist-Leninist ideology underpinning them is highly conductive to that: concepts like dictatorship of the proletariat (esp. in Lenin's interpretation of it), vanguard party, and democratic centralism all combine to this inevitable end result.

But no other ideological strain of Marxism has ever carried out a successful revolution - perhaps because they simply weren't brutal enough. By means of example: Bolsheviks violently suppressed the Russian Constituent Assembly within one day of its opening, as soon as they realized that they don't have the majority there. In a similar way, despite all the talk of council democracy, they consistently suppressed councils controlled by their opposition (peasant ones were, typically).

Bolsheviks were the first ones who succeeded, and thereafter, their support was crucial to the success of other revolutions - but that support came with ideological strings attached. So China, Korea, Vietnam, Cuba etc all hail from the same authoritarian tradition. Furthermore, where opposition leftist factions vied for dominance against Soviet-backed ones, Soviets actively suppressed them - the campaign against "social fascism" in 1930s, for example, or persecution of anarchists in Republican Spain.

Anyway, we don't really know what a revolution that would stick to democratic governance would look like, long term. There were some figures and factions in the revolutionary Marxist communist movement that were much more serious about democracy than Bolsheviks - e.g. Rosa Luxemburg. They just didn't survive for long.


idk. Countries used to build most of their infrastructures them selfs. There are still countries in western Europe that run huge state owned businesses, such as banks, oil companies, etc. that employ a bunch of people. The governments of these countries were (and still are) far from omnipotent. I personally don’t see how building out automated production facilities is out of scope for the governments of the future while it hasn’t been in the past.

Perhaps the only thing that is different today is the mentality. We take capitalism so much for granted that we cannot conceive of a world where the collective funds are used to provide for the people (even though this world existed not to long ago). And today we see it as a natural law that means of production must belong in private hands, that is simply the order of things.


I mean, this is close. With "co-pilot" an experienced developer saves mountains of time, especially as s/he learns how to wield it effectively.


No... Delete this!


"lay at the beach"

You keep using that word. I do not think it means what you think it means.


That's four words. The word word doesn't mean what you think it means.


By submitting any textual content (GPL or otherwise) on the web, you are placing it in an environment where it will be consumed and digested (by human brains and machine learning algorithms alike). There is already legal precedent set for this which allows its use in training machine learning algorithms, specifically with heavily copyrighted material from books[1].

This does not mean that any GitHub Co-Pilot produced code is suddenly free of license or patent concerns. If the code produces something that matches too closely GPL or otherwise licensed code on a particularly notable algorithm (such as video encoder), you may still be in a difficult legal situation.

You are in essence using "not-your-own-code" by relying on CoPilot, which introduces a risk that the code may not be patent/license free, and you should be aware of the risk if you are using this tool to develop commercial software.

The main issue here is that many average developers may continue to stamp their libraries as MIT/BSD, even though the CoPilot-produced code may not adhere to that license. If the end result is that much of the OSS ecosystem becomes muddied and tainted, this could slowly erode trust in open licenses on GitHub (i.e. the implications would be that open source libraries could become less widely used in commercial applications).

[1] - https://towardsdatascience.com/the-most-important-supreme-co...


I assume that google had legal access to those books. In the case of GPT-3 derived models, they contain common crawl and webtext2 corpuses, which may include large amounts of pirated content (books and magazines uploaded in random places, paywalled content that's been uploaded elsewhere).


Attempts to litigate any license violation are going to get precisely nowhere I bet, but I find the actual license violation argument persuasive.

This is an excellent example of how the AI singularity/revolution/whatever is a total distraction and that a much bigger and more serious issue is how AI is becoming so effective at turning the output of cheap/free human mental labour into capital. If AI keeps getting better and better and status quo socio-economic structure don't change, trillions in capital will be captured by the 0.01%.

I would be quite a turn up for the books if this AI co-pilot gets suddenly and dramatically better in 2030 and it negatively impacts the software engineering profession. "Hey, that's our code you used to replace us!" we will cry out too late.


I was somewhat worried about that until I saw this: https://twitter.com/nickjshearer/status/1409902649625956361?...

I think programming is one of the many domains (including driving) that will never be totally solved by AI unless/until it's full AGI. The long tail of contextual understanding and messy edge-cases is intractable otherwise.

Will that happen one day? Maybe. Will some kinds of labor get fully automated before then? Probably. But I think the overall time-scale is longer than it seems.


64-bit floats should be fine; I think that tweet is only sort-of correct.

The problem with floats-storing-money is (a) you have to know how many digits of precision you want (e.g. cents, dollars, a tenth of a cent), and (b) you need to watch out if you're adding values together.

Even if certain values can't be represented exactly, that's ok, because you'd want to round to two decimal places before doing anything.

Is there a monetary value that you can't represent with a 64-bit float? E.g. some specific example where quantization ends up throwing off the value by at least 1/100th of whatever currency you're using?


Storing money as float is always a bad decision. Source: been working for several banks and faced many of such bugs.


Pretty common in financial modeling, which I'm told is all done in Excel.


> "'Hey, that's our code you used to replace us!' we will cry out too late."

Are we in the software community not the ones who have frequently told other industries we have been disrupting to "adapt or die" along with smug remarks about others acting like buggy whip makers? Time to live up to our own words ... if we can.


>Are we in the software community not the ones who

No.

I'll politely clarify that for over a decade that I - and many others - have been asking not to be lumped in with the lukewarm takes of west coast software bubble asshats. We do not live there, we do not like them, I wish they would quit pretending to speak for us.

The idea that there is anything approaching a cohesive software "community" is a con people play on themselves.


To go on a bit of a tangent, I’m somewhat pessimistic that western societies will plateau and hit a “technofeudalism” in the next century or two. Combine what you mention with other aspects of capital efficiency. It’s not a unique idea, and is played out in a lot of “classic” sci-fi like Diamond Age.

Now it’s also not necessarily that bad of a state. That’s depending on ensuring a few ground elements are in place like people being able to grow their own food (or supplemental food) or still being free to design and build things on their own. If corporations restrict that then people will be at their mercy for all the essentials of life. My take from history is that I’d prefer to have been a peasant during much of the Middle Ages than a factory worker during the industrial revolution. [1] Then again Chinese people have been willing (seemingly) to leave farms in droves for the last decades to accept the modern version of factory life so perhaps farming peasant life isn’t as idyllic as it’d sound. [2]

1: https://www.lovemoney.com/galleries/84600/how-many-hours-did... 2: https://www.csmonitor.com/2004/0123/p08s01-woap.html


But the rate of product/services that machinery will produce will make that even a small tax to corporations producing everything autonomously will be enough to feed and give a quality of life to everyone with an UBI or partial time jobs.

You really want to push for high productivity across all industries, even if that means sacrificing jobs in the short term, because history demonstrated after that, new and more human jobs emerge latter.


Every decade was supposed to see fewer hours working for higher pay and quality of life. It didn't happen, as business owners (not just 1% fat cats, the owners of mom and pop shops are at least as guilty as anyone, they just sucked at scaling their avarice).

So the claim that this technological revolution will be different and that it will result in a broad social safety net, universal basic income, and substantive, well-paid part-time work is a joke but not a very good one. It will be more of the same - massive concentration of wealth among those who already hold enough capital to wield it effectively. A few lucky ones who manage to create their own wealth. And those left behind working more hours for less.


You are right that this won't happen by itself. We need another economic system, and not just hope that this time things will magically fix themselves.


This new economic system you want has been in use since the 70s. Everything about the economy is practically socially managed these days.

What part of printing trillions of dollars to stimulate economic productivity is somehow a free market system?


I wasn't talking about free market, but the state of present economy. Unfortunately, those trillions of dollars aren't being distributed to the people, but instead is concentrated in the hands of the richest.


Im pretty sure the people got 5000$ on average in stimulus checks at the tune of 9 trillion dollars these last few months.


I'd agree that many business owners are blameworthy (specifically the ones who have sought monopolies for their product or monopsonies for their labour supply), but we shouldn't forget landlords. A huge fraction of people's income goes to paying rent, especially in urban areas, yet the property tax is relatively low. This leaves a fat profit margin for landlords, even subtracting off the capital cost of the building. The proliferation of "single family house" zoning hasn't helped either. Preventing the construction of high density housing drives up rents, and benefits landlords at the cost of everyone else.


> those left behind working more hours for less

Doing what? Isn't the concern here that automation will push many people out of the workforce entirely?


Well as long as humans are more energy-efficient to deploy than robots you will always have a job. It might mean conditions for most humans will be like a century ago.


> as long as humans are more energy-efficient to deploy than robots

Energy efficiency isn't relevant. When switchboard operators were replaced by automatic telephone exchanges, it wasn't to reduce energy consumption.

The question is whether an automated solution can perform satisfactorily while offering upfront and ongoing costs that make them an economically viable replacement for human workers (i.e. paid employees).


Who debugs the software when there's a problem?


Professional software developers, i.e. members of one of the well-paid professions that is not under immediate threat from automation.


Automated debugging software of course


Yeah, for sure, the corporations that already pay effectively $0 in tax today are going to suddenly decide in the future to be benevolent and usher in the era of UBI and prosperity for all of humankind. They definitely won't continue to accumulate capital at the expense of everything else and use that to solidify their grasp of the future.

It would be a lot easier if more people on this website would just be honest with themselves and everyone else and simply admit they think feudalism is good and that serfs shouldn't be so uppity. But not me, of course; I won't be a serf. Now if you'll excuse me, someone gave me a really good deal on a bridge that I'm going to go buy...


Have fun being a hairdresser or prostitute for the 0.01% then.

New jobs in academic fields will not emerge. Already now a significant percentage of degree holders are forced into bullshit jobs.


Would the implication be that we are stagnating as a species then?


Not stagnating but moving into an "Elysium" (as in the film) type of society.


The problem with this is that you increasingly have to put your trust in the hands of a shrinking group of owners (people who have the rights to the automated productivity). At some point, those owners are just going to stop supporting everyone else (will probably happen when they have the ability to create everything they could ever want with automation - think robot farms, robot security forces, all encompassing automated monitoring, robot construction, etc.)


So we give away the world to the 1% and are supposed to be satisfied with the "privilege" of being able to eat?


Just look at authocratic countries. That top 1% still need something like 3-4% to work for beaurocracy and 3-5% for armed and police forces. And there are always family connections and relatives of relatives who want better living. So fortunatelly no AI will ever replace corruption and other human society flaws.

But yeah remaining 80-90% of population will have quality of life and bullshit jobs because it's how the world is right now outside of western countries bubble.


If AI can replace us with difficult tasks, it can repress us. How are you going to agitate for a UBI when AI has identified you as a likely agitator and sends in the robots to arrest you?


The current state of most wealthy countries do not show any hint of any significant corporation tax. Wealth will continue to accrue in the hands of the few.


Indeed, even here on HN, it's a pretty regular talking point in the comments that the only fair corporate tax rate is 0%.


> trillions in capital will be captured by the 0.01%.

How is that different from the current situation?


In the current arrangement, capital by itself is useless - you need workers to utilize it to generate wealth. Owners of capital can then collect economic rent from that generated wealth, but they have to leave enough for the workers to sustain themselves. This is an unfair arrangement, obviously; but at least the workers get something out of it, so it can be fairly stable.

In the hypothetical fully-automated future, there's no need for workers anymore; automated capital can generate wealth directly, and its owners can trade the output between each other to fully satisfy all their needs. The only reason to give anything to the 99.99% at that point would be to keep them content enough to prevent a revolution, and that's less than you need to pay people to actually come and work for you.


It is very similar to the current situation, but intensified. Technology tends to be an intensifier for existing power structures.


Except some random nobody can become a disruptor.


I was debating bringing up disruptors when I made the grandparent comment. My 2 cents: they can shift the balance of power at the very small scale (e.g. "some random nobody" getting rich, or some rich person going bankrupt), but the large scale power structures almost always remain largely intact. For instance, that "random nobody" may well get rich through the sale of shares in their company - now the company is owned by the owner class, who were previously at the top of the power hierarchy.


> but the large scale power structures almost always remain largely intact

Is that anything new? That seems to be a repeating fact of life throughout history.


Nothing new, certainly, but still worth examining. If we are not content with the current power structures, then we should be wary of changes that further intensify them.

We need not totally avoid such changes (i.e. shun technological advancements entirely because of their social ramifications), but we need to be mindful of their effects if we want to improve our current situation regarding the distribution/concentration of wealth and power in the world.


Uber vs taxi companies, Google vs Yahoo, or Facebook vs MySpace, Amazon versus all retailers ...


Exactly, in all cases the disruption was localized, and the broader power structures were largely unaffected. The richest among us - the owner class - were not significantly affected by all of these disruptions. They owned diversified portfolios, weathered the changes, and came out with an even greater share of wealth and power. Those who were most affected by the disruptions you listed were the employees of those companies/industries - not the owners/investors.


Random nobody whose parents just accidentally happened to be a millionaires and/or live, work, and study in the top capitals of the world.


> If AI keeps getting better and better and status quo socio-economic structure don't change, trillions in capital will be captured by the 0.01%.

This is absolutely one of the things that keeps me up at night.

Much of the structure of the modern world hinges on the balance between forces towards consolidation and forces towards fragmentation. We need organizations (by this I mean corporations, governments, unions, etc.) big enough to do big things (like fix climate change) but small enough to not become totalitarian or decrepit.

The forces of consolidation have been winning basically since the 50s with the rise of the military-industrial complex, death of unions, unlimited corporate funding of elections (!), regulatory capture, etc. A short linear extrapolation of the current corporate/government environment in the US is pretty close to Demolition Man's dystopian, "After the franchise wars, all restaurants are Taco Bell."

Big data is a huge force towards consolidation. It's essentially a new form of real estate that can be farmed to grow useful information crops. But it's a strange form of soil that is only productive if you have enough acres of it and whose yield scales superlinearly with the size of your farm.

Imagine doing a self-funded AI startup with just you and a few friends. The idea is nearly unthinkable. How do you bootstrap a data corporation that needs terabytes of information to produce anything of value?

If we don't figure out a "data socialism" movement where people have ownership over the data derived from their life, we will keep careening towards an eventuality where a few giant corporations own the world.


I expect nothing less. The 0,01 will be super rich.

You could call it endgame


They need to defend their capitals from the rest 99.99%. Expect huge combat robots investments and expanding of private armies.

And, of course, total surveillance helps to prevent any kind of unionization of those 99.99%.


Unions (and striking) become rather impotent when the means of production run by themselves and you no longer need workers.


Yep; so unions become militias.


Today's hyper-militarized police forces are their state-provisioned security to protect the capital of the 1%.


> The 0,01 will be super rich.

By definition, that has always been true.

We have been in the endgame for a very long time.


A percentile doesn't dictate the shape of the bell curve. The parent comment could be suggesting the tail is getting longer.


That’s fair.


> I would be quite a turn up for the books if this AI co-pilot gets suddenly and dramatically better in 2030 and it negatively impacts the software engineering profession. "Hey, that's our code you used to replace us!" we will cry out too late.

And that's why I won't be using it, why give it intelligence so it can work me out of a job?


> This is an excellent example of how the AI singularity/revolution/whatever is a total distraction [...]

Umm, no it's not. It's possible we just have two problems - the economic problem you mention might be correct, but also that people who believe in the problems of the singularity are right as well. The existence of a certain problem doesn't negate the existence of the other problem.


The difference between this model and a human developer is quantitative rather than qualitative. Human developers also synthesize vast amounts of code and can't reference most of it when they use the derived knowledge. The scales are different, but it is the same principle.


Is this the direct result of Microsoft owning GitHub or would they have been able to do it anyway?


> I find the actual license violation argument persuasive.

I'm curious as to why it seems persuasive. Open source licenses largely hinge on restrictions tied to distribution of the software, and training a model does not constitute as distribution.


Do we need an update of free software licenses to specifically address this?


Unlikely. If this use counts as a derivative work, then it's already a violation, and no update is needed.

OTOH if laundering through machine learning is a fair use, then licenses can't do anything about this. Licenses can't override the copyright law, so the law would have to change.


Could disincentivize open source? If I build black boxes that just work, no AI will "incorporate" my efforts into its repertoire and I will still have made something valuable.


  1. Programmers will become teachers of the co-pilot through IDE / API feedback
  2. Expect CI like services for automated refactoring


Shit... yea, we should make hay while the sun is shining and maybe become preppers to brace for the inevitable revolution by the < 99.99%.


It seems like the risk exposure would be more to the end user or their employer, doesn't it?


First in was lands, then other means of productions, and for the past 150 years, capitalists have turned many types of intellectual creations into exclusively owned capital (art, inventions). Now some want to turn personal data into capital (the “right to monetize” personal data advertised by some is nothing else) and this aims to turn publicly available code into capital. This is simply the history of capitalism going on: the appropriation of the commons.


Marx called this subsumption


Can the same argument/concerns be applied to all text generation AI?


I always assumed that one of the reasons Google et al work on AI is because software engineers are too expensive.


Google has the opposite problem. They make infinite money from ad platforms and hire people just for fun so nobody else can have them. They're working on AI because they need to stop them from getting bored.


So google pays the highest but still thinks engineers are paid too much? Why not pay them less.. the set high tier?

For google support employees cost too much.


They don’t pay the highest. And if they paid a lot less everyone would leave.


21st century alchemy!


I don't feel it's morally right to keep a profession around that is automated. Why should software be different?


As a human programmer, I've also been trained on thousands of lines of other people's code. Is there anything new here, from a code copying perspective? Aren't I liable if segments of my own code exactly match someone else's code, even if I didn't knowingly copy/paste it?


Well to me those are fundamental questions that need to be addressed one way or the other. Are systems like GPT-x basically plagiarising (doesn't matter the nature of the output, be it prose, code, or audio-visual) or are the results so transformative in nature that they can be considered to be "original work"?

In other words, are these systems to be treated like students that learned to perform the task they do from a collection of source material, or are they to be viewed as sophisticated databases that "just" perform context-sensitive retrieval?

These are interesting and important questions and I'm glad someone is publicly asking them and that many of us at least think about them.


I think the distraction is against how disconnected reality is becoming from copyright/intellectual property regulations.

It's still amazing to me that (US-centric context here), it's well established that instructions how to turn raw ingredients into a cake are not protectable but code that results in transforming one set of numbers into another are protectable.

AI is just making the silliness of that distinction more obvious.


Code is not the same as a recipe. Recipes are more like specifications. They leave out the implementation. Code has structural and algorithmic details that just have no comparable concept in recipes.


> Code has structural and algorithmic details that just have no comparable concept in recipes.

Why do you think that? A compiler uses human readable code to create machine code, with arbitrary optimizations and choices.


>They leave out the implementation. Code has structural and algorithmic details that just have no comparable concept in recipes.

That is really quite debatable in some contexts. Declarative languages like Prolog, SQL, etc. declare what they want and the system figures out how to produce it. Much like a recipe, really.


Humans are just sets of atoms, so protecting them is disconnected from reality?

These reductionist arguments lead nowhere. Fortunately, IP lawyers -- including Microsoft's who are fiercely pro IP when it suits them -- think in a more humanistic way and consider the years of work of the IP creator.

Food recipes are irrelevant; the often go back centuries and it's rather hard to identify individual creators. Not so in software.


> Food recipes are irrelevant; the often go back centuries and it's rather hard to identify individual creators.

That's not correct. Food recipes are created all the time and are attributed. From edible water bottles to impossible burgers, et al.


Should we be changing our open source licenses to explicitly prevent training such systems using our code?


Good idea, but if carved up into small enough chunks, it may be considered fair use.

What is confusing is that the neural net may take lots of small chunks and link them to one another, and then reproduce them in the same order verbatim.


With music sampling, copyright protects down to the sound of a kick drum. No doubt Microsoft has a good set of attorneys working on their arguments as we speak.


One of the examples pointed out in the reply threads was the suggestion in a new file to insert the GPL disclaimer header.

So, the length of the samples being drawn is not necessarily small: the chunk size is based on its commonality. It could easily be long enough to trigger a copyright violation.


That would be a legal no-op. Either their use is covered by copyright and they are violating your license, or it isn't covered by copyright and then any constraints that your license sets are meaningless.

Licenses hold no power outside of that granted to it by things being copyrighted by default.


I don't think so.

The code that is already used to train should be problematic for them, not only new Code in the future.


I'd assume this: In the same way as you can not forbid a human to learn concepts from your code, you can not forbid an automated system to learn concepts from your code, regardless the license. Also, if you would it would make your code non-free.

At least as long as the system really learns concepts. If it just copy & pastes code, then that's a different story (same as with humans).


Why forbid? Just use GPL and extend the contagion to the code trained using your code


What is more concerning is that the training kernel belongs exclusively one private company. Microsoft.

It can become a massive (and unfair) competitive advantage.

Furthermore, Copilot will not work with less popular languages and also prevent popular languages from evolving.


This feature is effectively impossible to replicate. Only Microsoft positioned itself to have: - dataset (GitHub) - tech (openai) - training (azure) - platform (vscode)

I'm impressed. They did an amazing job from a corporate strategy standpoint. Also directionally things are getting interesting


I actually don't think there's much of a moat here at all.

GitHub repositories are open for the taking, GPT-XXX is cloneable (mostly, anyway) and VS Code is extensible.

They definitely have a good head-start, but I really don't think there's anything here that won't be generally available within 2 years.


I don't think that GH code is easily accessible, with rate limiting and TOS forbidding it. GPT is an open model (for the most part), but its training cost is in the order of tens of millions of $

I can think of no one but a handful of companies being able to compete there. And they won't be ok with extending a Microsoft IDE, nor breaking GitHub TOS.

When you start competing on R&D costs the game changes.

There's always the chance that training costs will significantly decrease. But even at an order of magnitude less (ie. tens of Ks) it's still beyond reach for open projects and indie devs


Is this really anything more than a curiosity toy and a marketing tool?

I took a look at their examples and they are not at all compelling. In one example it generated SQL and somehow knew the columns and tables in a database that it had no context on. So that's a lot of smoke and mirrors going on right there.

Do many developers actually want to work in this manner? That is, being interrupted every time they type with a robot interjection of some Frankenstein code that they now have to go through and review and understand. Personally, this is going to kick me out of the zone/flow too often to be useful. Coding isn't the hard part of my job. If this tool can somehow guess the business requirements of the task at hand, then I'll be impressed.

Even if the tool generates accurate code, if I don't fully understand what it wrote, then what? I'm still stuck digging through documentation and stackoverflow to verify that whatever is in my text editor is correct code. "Code confidently in unfamiliar territory" sounds like a Boeing 737 Max sized disaster in the making.


The dataset is all freely available open source code, right? Just because GH hosts it doesn’t mean the rest of the world can’t use it for the same purpose.


They'd find a way to keep it practically difficult to use, at the least, if that dataset is vital to the process. Hoarding datasets that should either be wholly public or unavailable for any kind of exploitation is the backbone of 21st century big tech. It's how they make money, and how they maintain (very, very deep) moats against competition.

[EDIT] actually, I suspect their play here will be to open up the public data but own the best and most low-friction implementation, then add terms that let them also feed their algo with proprietary code built using their editors. That part won't be freely available, and no free version will be able to provide that further-improved model, even assuming all the software to build it is open-source. Assuming using this thing ends up being a significant advantage (so, assuming this matters at all) your choice will be to either hamstring yourself in the market or to help Microsoft build their dataset.


You'd have to hit rate limiting multiple times no?


https://console.cloud.google.com/marketplace/product/github/...

BigQuery used to have a dataset updated weekly, looks like it hasn't been updated since about a year after the acquisition by Microsoft.


Not only that, but microsoft could aggressively throttle or outcompete anyone trying to do the same.


Arenmt mirrors of all GH code available, for example on BigQuery public datasets. If it‘s there, it should be available in a downloadable format too?


Anyone can download the training set from GitHub.


Is this true? Looks like they're using the OpenAI Codex which is set to be released soon:

https://openai.com/


I get the sense that GitHub wants this to be litigated so the case law can be established. Until then it’s just a bunch of internet lawyers arguing with each other.


Why would you want to? For many open source developers having models trained on your code would be desirable.


I got the sense they saw Google beating Sun/Java in the supreme court and said "We'll be fine, lets move the release up"


In the discussion yesterday I pointed to the case of some students suing turnitin for using their works in the turnitin database and the studemts lost [1]. I think an individual suing will not go anywhere. The way to create a precedent is someone feeding all the Harry Potter books and some additional popular books (twilight?) to GPT 3 and letting them write about some kids at a sorcerer school. The outcomes of that case would look very different IMO.

[1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-wins-t...


Disney's intellectual property would be a good choice for this exercise


Suing will not go anywhere because Microsoft has billions at their disposal to defend any case.


Not a lawyer, but in that case it seemed to be a factor that turnitin was transformative, because it never sold the texts to others and thus didn't reduce the market value of them. But that wouldn't apply to copilot which might reduce the usage of libraries since you can "code" equivalent functionality with copilot now.

Would it be a stretch to assert that GPL'd libraries have a market value for their creator in terms of reputation etc.?


While we're worrying about ML learning to write our codes we should also break all the automated looms so people don't go without jobs. Do everything manually like God intended! /s

Maybe a code that is easily recreated by GPT with a simple prompt is not worth copyrighting. The future is in making it more automated, not protecting IP. If you compete against a company using it, you can't ignore the advantage.


If GitHub Copilot can sign my CLA, stating that it is the author of work, that it transfers the IP to me in exchange for the service subscription price and holds responsibility for copyright infringement, that would be acceptable. Otherwise it's a gray area I don't want to go.


If it's trained with GPL licensed code, doesn't that mean the network they use includes it somewhat? Then, someone could sue that their networks must be GPL licensed too, right?


Yes, the neural network would constitute a derived work.


Actually no because it’s a “transformative use”. This is how search engines are allowed to show snippets and thumbnails.


The potential inclusion of GPL'd code, and potentially even unlicensed code, is making me wary of using it. Fair Use doesn't exist here and if someone was to accuse me of stealing code, saying "I pressed a button and some computer somewhere in the world, that has potentially seen your code as well, generated it for me" is probably not the greatest defense.


Does this mean that when I read GPL code and learn from it, I cannot use these learnings in non-GPL code?

I get it that the derivative work might be more clear in an AI setting, but basically it boils down to the same thing.


The core problem which would allow laundering (that there isn't a good way to draw a straight, attributive line between generated code and training examples) to me also presents a potential eventual threat to the viability of co-pilot/codex. It seems like the same thing would prevent it from knowing which published code was written by humans vs which was at least in part an output from the system. Training on an undifferentiated mix of your model's outputs and human-authored code seems like it could eventually lead the model into self-reinforcing over-confidence.

"But snippet proposals call out to GH, so they can know which bits of code they generated!". Sometimes; but after Bob does a co-pilot assisted session, and Alice refactors to change a snippet's location and rename some variables and some other minor changes and then commits, can you still tell if it's 95% codex-generated?


While I think this will continue to amplify current problems around IP, aren't current applied-ML approaches to writing software the equivalent of automating the drawing of leaves on a tree? Maybe a few small branches? But the whole tree, all its roots, how it fits in to the surrounding landscape, the overall composition, the intention? If I'm wrong about that than I picked either a good or a bad time to finally learn programming. There's only so many ways you can do things in each language though. Just like in the field of music, only so many "Original" tunes. The concept of IP is incoherent, you don't own patterns (at least not at arbitrary depth), though you may be owed some form of compensation for the billions made off discovering them.


You're right, it's only drawing some leaves, the whole tree or how it relates to the forest is another thing.


not a fan of this argument.

musicians, artists, all kinds of athletes, all grow by watching observing and learning from others. as if all these open source projects got to where they are without looking at how others did things.

i don't think a single function, similar syntax or basic check function is worth arguing about, its not like co-pilot is stealing an entire code base and just plopping it out by reading your mind and knowing what you want. i know developers that have certainly stolen code and implementation details from past employers and that was just fine.


I mean this is already happening. When you hire a specialist in C# servers, you're copying code that they already wrote. I find people tend to write the same functions and classes again and again and again all the time.

We have a guy that brought his task manager codebase (he re-wrote it) but it's the same thing he used at 2 other companies.

I have written 3 MPIs (master person/patient index) at this point all with the same fundamental matching engine.

I mean, one thing we can all agree on is that ML is good at copying what we already do.


Newsflash everyone, if you open source your code it's going to be copied or paraphrased anyway.


It should be copied and paraphrased, but respecting the license. This means, among other things, crediting the author.


It may be hard to believe, but there are sick and twisted individuals in this dangerous world who copy from github without even a single glance at the license, and they live among us.


Yes, and those people are violating the licenses of the code when they do that. It's not unreasonable to expect a massive company like microsoft to not do this on a massive scale.


There are always exceptions (maybe they might be norm in this case) but its still not 100%, still not all encompassing. This "AI" seems to be. I think that is like the entire concern. ALL the code is affected for all the instances.


I think the issue many people may have with this is it's a proprietary tool that profits on work it was not licensed to use this way.


What does that have to with the topic? The question is not whether it gets copied, the question is whether it gets pirated.


Yes that's the point.

But if I do it under a copyleft license like GPL, I expect those who copy to abide by the license and open source their own code too.

But sure, people shit on IP rights all the time, and I am guilty of it too. Let's say I didn't pay what I should have paid for every piece of software I have used.


If I read a lot of GPL code, absorb naming conventions, structures, patterns, tricks and later when it comes down to writing a P2P Chat server, I happen to recall similar patterns, naming structures, conventions and many of the utility methods are pretty much how they are in the GPL code bases out there.

Now is my produced code is also GPL derivative because I certainly did read through the code base to be able to write larger programs?


https://twitter.com/eevee/status/1410049195067674625

"""

"but eevee, humans also learn by reading open source code, so isn't that the same thing" - no - humans are capable of abstract understanding and have a breadth of other knowledge to draw from - statistical models do not - you have fallen for marketing


> humans are capable of abstract understanding and have a breadth of other knowledge to draw from

this may be a matter of time and thus is not a fundamental objection.

If mankind should fail to answer the perennial question of exploitation of the other and the same, it will be doomed. And rightly so, for mankind must answer this question, it must answer to this question. Instead what we do is increase monetary output then go and brag about efficiency. Neither is this efficient, nor is it about efficiency, nor has the Universe ever cared about efficiency. It just happens to coincide with what Society has decided to be its most looked-upon elements have chosen to be their religion.

It is not my religion to be sure.


I agree that this is different from humans learning to code from examples and reproducing some individual snippets. However, I disagree with the author’s argument that it's because of humans’ ability to abstract. We actually know nothing about the AI’s ability to abstract.

The real difference is that if one human can learn to code from public sources, then so can anyone else. Nobody is explicitly barred from accessing the same material. The AI, however, is kept proprietary. Nobody else can recreate it because people are explicitly barred from doing so. People cannot access the source code of the training algorithm; people cannot access enough hardware to perform the training; and most people cannot even access the training data. It may consist of repos that are technically all publicly available, but try downloading all of GitHub and see if they let you do that quickly, and/or whether you have enough disk space.

This puts the owners of the AI at a significant advantage over everyone else. I think this is the core of the concern.


So it’s using a massive-scale public good (non-rivalrous and non-exclusionary access to source code) to create a private product that is rivalrous in the software labour pool? Or is the problem just that it’s not open-access?


> previous """AI""" generation has been trained on public text and photos, which are harder to make copyright claims on, but this is drawn from large bodies of work with very explicit court-tested licenses

This seems pretty backwards to me. A GPL licensed data point is more permissive than an unlicensed data point.

That said, I’m glad that these data points do have explicit licenses that say “if you use this, you must do XYZ” so that it’s clear that our large ML projects are going counter to creators intent when they made it open.

I’d love to start seeing licenses about use as training data. Then maybe we’d see more open access to these models that benefit from the openness of the web. I’d personally use licenses that say if you want to train on my work, you must publish the model. That goes for my code, my writing, and my photography.

Anyways GitHub is arguing that any use of publicly available data for training is fair use, but they also admit that it’s all new and unprecedented, regarding training data.


Posting this due to the recent unveiling of GitHub Co-pilot and the intersection on the ethics of ml training set data.


Microsoft should just GPL CoPilot's code and model. They won't, but it would fix this problem, I think.


...unless they've also ingested code that is incompatible with the GPL and CoPilot ends up regurgitating a mix.


If I as an alleged human have learned purely from GPL code would that require code I write to be released under the GPL too?

We should probably start thinking about AI rights at some point. Personally I'll be crediting GPT-3 as any other contributor because it sounds cool but maybe morally too in future


A machine learning isn't really the same as a person learning - people generally can code at a high level without having first read TBs of code, nor can you reasonably expect a person to have memorised GPL code to reproduce it on demand.

What you can expect a person to do is understand the principles behind that GPL code, and write something along the same lines. GitHub Co-Pilot is not a general ai, and it's not touted as one, so we shouldn't be considering whether it really knows code principles, only that it can reliably output code that fits a similar function to what came before, which could reasonably include entire blocks of GPL code.


Well if it is actually straight up outputting blocks of existing code then get it in the bin as a failed attempt to sprinkle AI on development and use this instead

https://github.com/drathier/stack-overflow-import


That's what I wanted to ask, where do we draw the line of copyright when it comes to inputs of generative ML?

It's perfectly fine for me to develop programming skills by reading any code regardless of the license. When a corp snatches an employee from competitors, they get to keep their skills even if they signed an NDA and can't talk about what they worked on. On the other hand there's the no-compete agreement, where you can't. Good luck making a no-compete agreement with a neural network.

Even if someone feeds stolen or illegal data as an input dataset to gain advantage in ML, how do we even prove it if we're only given the trained model and it generalizes well?


Copyright is going to get very muddy in the next few decades. ML systems may be able to generate entire novels in the styles of books they have digested, with only some assist from human editors. True of artwork and music, and perhaps eventually video too. Determining "similarity" too, may soon have to be taken off the hands of the judge and given to another ML system.


> It's perfectly fine for me to develop programming skills by reading any code regardless of the license.

I'd be inclined to agree with this, but whenever a high profile leak of source code happens, reading that code can have dire consequences for reverse engineers. It turns clean room reverse engineering into something derivative, as if the code that was read had the ability to infected whatever the programmer wrote later.

A situation involving the above developed in the ReactOS project https://en.wikipedia.org/wiki/ReactOS#Internal_audit


>how do we even prove it if we're only given the trained model and it generalizes well?

Someone's going to have to audit the model the training and the data that does it. There's a documentary on black holes on Netflix that did something similar (no idea if it was AI) but each team wrote code to interpret the data independently and without collaboration or hints or information leakage, and they were all within a certain accuracy of one-another for interpreting the raw data at the end of it.

So, as an example, if I can't train something in parallel and get similar results to an already trained model, we know something is up and there is missing or altered data (at least I think that's how it works).


Take it further. You could easily imagine taking a service like this as an invisible middleware behind a front-end and start asking users to pay for the service. Some could argue it's code generation attributable to those who created the model, but reality is that the models were trained by code written by thousand of passionate users at no pay with the intent of free usage.


> but reality is that the models were trained by code written by thousand of passionate users at no pay with the intent of free usage.

I hope you're actually reading those LICENSE files before using open source code in your projects.


Your question had already been preempted in the OP. Specifically:

> "but eevee, humans also learn by reading open source code, so isn't that the same thing"

> - no

> - humans are capable of abstract understanding and have a breadth of other knowledge to draw from

> - statistical models do not

> - you have fallen for marketing

-- https://twitter.com/eevee/status/1410049195067674625


I preemptively commented that I'd seen that tweet three hours before your comment figuring someone was going to quote it at me haha

Preemptive, doesn't work as it turns out :)

https://news.ycombinator.com/item?id=27687586


Nice catch


Possibly. We won’t know until this is tested in court. Traditionally one would want to clean room [1] this sort of thing. Co-pilot is…really dirty by those standards.

[1] https://en.wikipedia.org/wiki/Clean_room_design


Unless you were using structures directly from said code, probably not?

Compare if you had only learned writing from, say, the Bible. You would probably write in a very Biblical manner, but would you write the Psalms exactly? Most likely not.


We have seen Co-Pilot directly output (https://docs.github.com/en/github/copilot/research-recitatio...) the zen of python when prompted - there's no reason it wouldn't write the Psalms exactly when prompted in the right manner.


That's super cool. As long as you do the things you specify at the bottom of that doc (provide attribution if copied so people can know if it's OK to use) then a lot of the concerns of people on these threads are going to be resolved.


Pretty much! There's only three major fears remaining

* Co-pilot fails to detect it, and you have a potential lawsuit/ethical concern when someone finds out. Although the devil on my shoulder says that if Co-pilot didn't detect it, what's to say another tool will?

* Co-pilot reuses code in a way that still violates copyright, but is difficult to detect. I.e. If you checked via a syntax tree, you'd notice that the code was the same, but if you looked at it as raw text, you wouldn't.

* Purely ethical - is it right to take licensed code and condense it into a product, without having to take into account the wishes of the original creators? It might be treated as normal that other coders will read it, and pick up on it, but when these licenses were written no one saw products like this coming about. They never assumed that a single person could read all their code, memorise it, and quote it near-verbatim on command.


> Purely ethical - is it right to take licensed code and condense it into a product, without having to take into account the wishes of the original creators? It might be treated as normal that other coders will read it, and pick up on it, but when these licenses were written no one saw products like this coming about. They never assumed that a single person could read all their code, memorise it, and quote it near-verbatim on command.

It's gonna be really interesting to see how this plays out.


I've not seen Copilot in action yet, I was under the impression it doesn't use code directly.

In any case my original question was answered by the tweeter in a later tweet I missed https://twitter.com/eevee/status/1410049195067674625

I get where they're coming from but they are kinda just handwaving it back the other way with the "u fell for marketing idiot" vibe. I wish someone smarter than me could simplify the legal ramifications around this but we'll probably have to wait till it kills someone (or at least costs someone a bunch of money) to get any actual laws set up.


I think you are missing the mark here with this comparison, Copilot and its network weights are already the derived work, not just the output it produces.


Perhaps someone at Github can chime in, but I suspect that open source code datasets (the kind they are trained on) should require relatively permissive licenses in the first place. Perhaps they filter for MIT licenses in Github projects and StackOverflow answers used to train the models?


Nope, they explicitly note that the GPL showed up 700k times in the training data: https://twitter.com/eevee/status/1410067860299255810


So, I can't see how they can argue that the code generated is not a derrivative of at least some of the code that it was trained on, and therefore encumbered by a complicated, and for anyone other than GitHub, impossible to disentangle, copyright claims. If they haven't even been careful to only use software under one license that does not require the original author to be attributed, then I don't see how it can even be legal for them to be running the service.

All that said, I'm not confident that anyone will stop them in court anyway. This hasn't tenmded to be very easy when companies infringe other open source code copyright terms.

Until it is cleared up though, it would seem extremely unwise for anyone to use any code from it.


Microsoft: embrace, extend, extinguish .


Well this would not be hard to verify though.

You can automate this process by providing existing GPL source code and see what CoPilot comes up next.

I am sure at some point it WILL produce exact the same code snippet from certain GPL project, provided that you have attempted enough times.

Not sure what the legal interpretation would be though, it is pretty gray-ish in that regard.

There would always be risk for CoPilot, had it digested certain PII information and people found it out...it would be much more interesting to see the outcome.


As per some of the other twitter replies, Co-pilot has offered to fill in the GPL disclaimer in new files.


it doesn't have to be exact to be copyright infringement, see non-literal copying. basic idea behind it is if you copy paste code and rename variables that doesn't mean its new code.


Yeah, you'd have to assume they are parsing and normalizing this data in some way. There would still be some AST patterns or something similar you could look for in the same way, but it would be much trickier.

Plus considering this is a legal issue ... good luck with "there is a statistically significant similarity in AST outputs related to the most unique sections of this code base" type arguments in court. We're currently at the "what's an API" stage of legal tech understanding.


The real question is whether it constitutes derived work, though. And that is not a question of similarity so much so as provenance - if you start with a codebase that is GPL originally, and it gets gradually modified to the point where it doesn't really look anything like the original, it's still a derived work, and is still subject to the license.

Similarity can be used to prove derivation, but it's not the only way to do so. In this case, all the code that went into the model is (presumably) known, so you don't really need any sort of analysis to prove or disprove it. It is, rather, a legal question - whether the definition on the books applies here, or not.


Regarding PII, I think you have a very good point. I wouldn't be surprised to see working AWS_SECRET_KEY values appear in there. Indeed, given that copypaste programmers may not understand the code they're given, it's entirely possible that someone may run code which uses remote resources without the programmer even realising it.


This question about the amount of code required to be copyrightable starts to sound familiar to the copyright situation with music, where currently the bar seems to be set too low, legally, to prove plagiarism.


If I recall correctly, it has been already determined that using proprietary data to train a machine learning system is not a violation of intellectual property.


People write code in their spare time, often without compensation.

Then a big corporation comes in, appropriates it, repackages and sells as a new product.

It's a shameful behaviour.


I don't think anyone is interested in stealing small code snippets. It's easy enough to rewrite them. Where GPL does matter is in complete products. In other words, it's never "if only we could use this GCL-licensed function", it's almost always "if only we could link this GPL-licensed library or executable".

And this GitHub co-pilot in no way infringes on full codebases.


I think copyright is a problem for GPL-like licenses. They should have restricted the training data to MIT/BSD-like.

Anyway, there is another problem that is patents and is huger, much huger. I think the Apache license has a provision about patents, but most of other licenses may have code that has patents and if the AI generate something similar it may be included in the patent.


MIT/BSD-like would still require attribution, which they are also not doing.


I think you are correct, but (I guess that) most people that use MIT/BSD use them as a polite version of the WTFPL.

People that use A/L/GPL usually like the virality and will complain more.


"Who owns the future?" by Jaron Lanier covers lots of this stuff in a realli interesting way.

If heart surgeons train an AI robot to do heart surgery ... shouldn't they be compensated (as passive income) for enabling that automation?

Shouldn't this all be accounted for? If my code helps you write better code (via AI) shouldn't I be compensated for the value generated?

We are being ripped off.


I think the argument has merit. Unfortunately it won't be decided on technical merit, but likely in the manner expressed in this excellent response I saw on Twitter:

"Can't wait to see a case for this go in front of an 80 year old judge who rules something arbitrary and justifies it with an inaccurate comparison to something nontechnical."


"About 0.1% the snippets are verbatim"

This implies that by just changing the variable names, the snippets are classed as non-verbatim.

I don't buy that this number is anywhere close to the actual figure if you assume that you can't just change function names and variable names and suddenly say you have escaped both the legality and the spirit of GPL.


There isn't that much enforcement of open source license violations anyway. I bet there are lots of places where open source code gets taken, copyright/license headers stripped off and the code used in something proprietary as well as the bog-standard "not releasing code for modified versions of Linux" violation.


To me, this is similar to all these big org making money off our data. They should be paying us to profit off our minds.


Isn't most of modern coding, just googling for someone who had solved the same problem that you are currently facing and then just copy/paste from Stack Overflow?

To the extent that GPT-3 / co-pilot is just an over-fitted neural net, then it's primary value is as an automated search, copy, and paste.


Check out the comments on the original post about GitHub co-pilot.

The top one reads just like an ad: https://news.ycombinator.com/item?id=27676845

Some posts that definitely aren't by shills (including the third one because I simply don't believe there's a person on the planet that "can't remember the last time Windows got in my way"): https://news.ycombinator.com/item?id=27678231 https://news.ycombinator.com/item?id=27686416 https://news.ycombinator.com/item?id=27682270

Very mild, yet negative sentiment opinion (downvoted quickly): https://news.ycombinator.com/item?id=27676942


That's brilliant: I would argue that since MS used code with GPL type of licenses to train the Co-Pilot algorithm it shall release the Co-pilot model in its entirety. The ones who differentiate data and code missed their classes on Gödelization and functional programming.


I wonder what would happen if someone scraped genius and used the lyrics to make a song writing tool.


> github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this

I don't understand the second sentence, i.e. where's the proof?


That's like saying that making a blurry , shaky copy of star wars is not derivative but original work. Thing is, the 'verbatimness' of the generated code is positively correlated with the number of parameters they used to train their model


So as I understand it, AGPL was introduced to cover an unforeseen loophole in GPL that adapted code could be used to power a web service. Could another new version of the license block allowing code from use to train such GitHub co-pilot like models?


Just wanted to say that everything is ultimately "derivative", and this literal ordinary meaning differs from the legal meaning, which is informed by context, policy and what makes sense.

Where do you draw the line? That's for the courts to decide!


Neat tool, but needs context.

I'm getting a lot of suggestions that make no sense. What's worse, the suggest code has invalid types, and won't compile. I'm surprised they didn't prune the solution tree via compiler validation.


Our software has violated the world and people's lives legally and illegally in many instances. I mean none of us cared when GPT-3 did the same for text on the internet. :)

Reminder - Software engineers, our codes, GPLs are not special.


The amount of people not knowing the difference between Open Source and Free Software is astonishing. With the amount of RMS memes I see regularly I would expect things to be settled by now.


In a day MS bought github I knew that is on their agenda


The conclusion seems a bit unfair.

> "but eevee, humans also learn by reading open source code, so isn't that the same thing" - no - humans are capable of abstract understanding and have a breadth of other knowledge to draw from - statistical models do not - you have fallen for marketing

Machines will draw on other sources of knowledge besides the GPL code. Whether they have the capacity for "abstract thought" is probably up for debate. There's not much else said in those bullets. It's not a good argument.


people worrying about AI. The AI is still shit. lol


I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.

(Reposting my comment from yesterday)


Reposting a summary of my reply: if you memorize a line of code and then write it down somewhere else without attribution, that is not fair use, you copied that line of code. If this model does the same, it is the same.


SourceHut is looking real nice these days...


Why not gitlab?


Too bloated.


Man reading the response tweets really highlights how bad twitter is for nuanced discussion.


Autocompletion of a comment can be enough for copyright infringement.


I was just musing about whether this kind of tool has been written (or is being written) for music composition, business letter writing, poetry, news copy.

Interesting copyright issues.

Anyone who thinks their profession will continue as-is for the long term is probably mistaken.


I’m worried about my job. What do I do to prepare?


There are much bigger things in this world to worry about. I bet you that by the time that this AI has taken your job, it'll have taken many other jobs, completely rearranging entire industries if not society itself.

And even once that happens you shouldn't be worried about your job. Why? Because economically everything will be different and because your job isn't that important, it likely never was. The problems humanity faces are existential. Authoritarianism, ecosystem collapse and mass migration of billions of people.

So if you really want to "prepare", then try to make a difference in what actually matters.


>GitHub co-pilot as open source code laundering? The English language as I flush?


It's astonishing to me that HN+Twitter believe that Github designed this entire project, without speaking to their legal team and confirming that training on GPL code would be possible.

Mind-blowingly hilarious armchair criticism.


The tone of the responses here is absurd. Guys, be grateful for some progress. Instead of having to retype boilerplate code, your productivity is now enhanced by having a system that can do it for you. This is primarily about reducing the need to re-type total boilerplate and/or copy/paste from Stackoverflow. If you were to let some of the people here run things we'd never have any form of progress with anything ever.


Questions like this go much deeper and illustrate issues that need to be addressed before the technology becomes standard and widely adopted.

It's not about progress or supressing it, it's a fundamental question about whether it is OK for huge companies to profit from the work of others without as much as giving credit, and if using AI this way represents an instance of doing so.

The latter aspect goes beyond productivity or licensing - the OP asserts that AI isn't equivalent to a student who learned from examples how to perform a task, but rather replicates (recalls) or reproduces the works of others (e.g. the training material).

It's a question that goes beyond this particular application: what about GAN-based generators? Do they merely reproduce slight variations of the training material? If so, wouldn't the authors of the training material have some kind of intellectual property rights to the generated works?

This doesn't just concern code snippets, it's a general question about AI, crediting creators, and circumventing licensing and intellectual property rights.


> Instead of having to retype boilerplate code, your productivity is now enhanced by having a system that can do it for you

We already invented something for that a couple decades ago, and it's called a "library". And unlike this thing, libraries don't launder appropriation of the public commons with total disregard for those who have actually built that commons.


This goes into one of my favorite philosophical topics: John Searle's Chinese Room. I won't go into it here, but the question of whether an AI is actually learning how to code or simply substituting information based on statistically common practices (or if there really is a difference between either) is going to be one hell of a problem for the next few decades as we start to approach fine points of what AI is and how it could be defined.

However, legally, the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is an API (A self-modifying one, but an API nonetheless), Microsoft has a good defense.

In the near-future, when we have AI-assisted reverse engineering along with Github co-pilot, then, with enough obfuscation there's nothing that can't be legally created or recreated on a computer, proprietary or not. This is simultaneously free software's greatest dream and worst nightmare.

Edit: changed Hilary Putnam to John Searle Edit 2: spelling


> However, legally, the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is API (A self-modifying one, but an API nonetheless), Microsoft has a good defense.

That's... a mind-bendingly bad take. Google took an API definition and duplicated it; Copilot is taking general code and (allegedly) duplicating it. This was not done in order to enable any sort of interoperability or compatibility.

The "API defense" would apply if Copilot only produced API-related code, or (against CP) if someone reproduced the interfaces copilot exposes to consumers.

> Microsoft has a good defense.

MS has many good defenses (transformative work, github agreements, etc etc), but this is not one of them.


> the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is API (A self-modifying one, but an API nonetheless), Microsoft has a good defense

That's a wild misconstrual of what the courts actually ruled in Oracle v. Google.

(And to the reader: don't take cues from people banging out poorly reasoned quasi-legal arguments in off-the-cuff comments.)


Straight from the horse's mouth [1]:

pg.2

'This case implicates two of the limits in the current Copyright Act. First, the Act provides that copyright protection cannot extend to “any idea, procedure, process, system, method of operation, concept, principle, or discovery . . . .” 17 U. S. C. §102(b). Second, the Act provides that a copyright holder may not prevent another person from making a “fair use” of a copyrighted work. §107. Google’s petition asks the Court to apply both provisions to the copying at issue here. To decide no more than is necessary to resolve this case, the Court assumes for argument’s sake that the copied lines can be copyrighted, and focuses on whether Google’s use of those lines was a “fair use.”

"any idea, procedure, process, system, method of operation, concept, principle, or discovery" sounds suspiciously like an API. Continuing:

Pg. 3-4

'To determine whether Google’s limited copying of the API here constitutes fair use, the Court examines the four guiding factors set forth in the Copyright Act’s fair use provision... '

(1) The nature of the work at issue favors fair use. The copied lines of code are part of a “user interface” that provides a way for programmers to access prewritten computer code through the use of simple commands. As a result, this code is different from many other types of code, such as the code that actually instructs the computer to execute a task. As part of an interface, the copied lines are inherently bound together with uncopyrightable ideas (the overall organization of the API) and the creation of new creative expression (the code independently written by Google)...

(2) The inquiry into the “the purpose and character” of the use turns in large measure on whether the copying at issue was “transformative,” i.e., whether it “adds something new, with a further purpose or different character.” Campbell, 510 U. S., at 579. Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different computing environment without discarding a portion of a familiar programming language .... The record demonstrates numerous ways in which reimplementing an interface can further the development of computer programs. Google’s purpose was therefore consistent with that creative progress that is the basic constitutional objective of copyright itself.

(3) Google copied approximately 11,500 lines of declaring code from the API, which amounts to virtually all the declaring code needed to call up hundreds of different tasks. Those 11,500 lines, however, are only 0.4 percent of the entire API at issue, which consists of 2.86 million total lines. In considering “the amount and substantiality of the portion used” in this case, the 11,500 lines of code should be viewed as one small part of the considerably greater whole. As part of an interface, the copied lines of code are inextricably bound to other lines of code that are accessed by programmers. Google copied these lines not because of their creativity or beauty but because they would allow programmers to bring their skills to a new smartphone computing environment. The “substantiality” factor will generally weigh in favor of fair use where, as here, the amount of copying was tethered to a valid, and transformative, purpose.

(4) The fourth statutory factor focuses upon the “effect” of the cop- ying in the “market for or value of the copyrighted work.” §107(4). Here the record showed that Google’s new smartphone platform is not a market substitute for Java SE. The record also showed that Java SE’s copyright holder would benefit from the reimplementation of its interface into a different market. Finally, enforcing the copyright on these facts risks causing creativity-related harms to the public. When taken together, these considerations demonstrate that the fourth factor—market effects—also weighs in favor of fair use.

'The fact that computer programs are primarily functional makes it difficult to apply traditional copyright concepts in that technological world. Applying the principles of the Court’s precedents and Congress’ codification of the fair use doctrine to the distinct copyrighted work here, the Court concludes that Google’s copying of the API to reimplement a user interface, taking only what was needed to allow users to put their accrued talents to work in a new and transformative program, constituted a fair use of that material as a matter of law. In reaching this result, the Court does not overturn or modify its earlier cases involving fair use.'

[1] https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf


That's John Searle's thought experiment actually. Hilary Putnam had some thoughts in reference to it along the lines that a brain in a vat might think in a language similar to what we would speak, but the words of that language would necessarily encode different meanings due to the different experience of the external world and sensory isolation.

https://plato.stanford.edu/entries/chinese-room/


Thanks for the correction. I made it known in my edit.


And this applies to everything, not just source code.

I’m just presuming we have a future where you can consume unique content indefinitely. Such as instead of binge watching Star Trek on Netflix you press play and new episodes are generated and played continuously, 24/7, and they are actually really good.

Thus intellectual property becomes a commodity.


While headway has been made in photo algorithms like StyleGAN, GPT-3's scriptwriting, and AI voice replication, we aren't even close to having AI-generated stick cartoons or anime. At best, AI generated Star Trek trained on old episodes would produce the live-action equivalent of limited animation; it would reuse the most liked parts over an over again and rehash the same camerawork and lens focus that you got in the 60's and the 90's. There wouldn't be any new planets explored, no new species, no advances in cinematography, and certainly no self-insert character (in case you wanted to see - simulation of how you'd fair on the Enterprise). It wouldn't add anything new as far as I can see. Now if there was some way to recreate all the characters in photorealistic 3D with Unreal Engine, feed them a script, and use some form of intelligent creature and planet generation, you may get a little closer to creating a truly new episode.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: