Hacker News new | past | comments | ask | show | jobs | submit login

On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).



But the thing is that we explicitly allow humans to learn and develop their own skills learning from other humans, but we have our own taboos around directly copying peoples work without permission and passing it off as your own. The debate is that copilot isn’t a human, it’s a machine that outputs copied work on a statistical basis.

Humans are allowed to be unoriginal, uncreative, boring, mediocre, and all sorts of things. But they’re not copying whole cloth the way copilot is.


> But they’re not copying whole cloth the way copilot is.

Stack Overflow content is CC-BY-SA 4.0 yet I can bet most corporate codebases include tons of code snippets without a link or citation to the original answer


Don't you have to work pretty hard to get copilot to reproduce snippets verbatim? My understanding is that, while its possible to make copilot reproduce snippets so long as they appear in a large number of files, this basically never happens under normal usage.


This argument doesn't work.

I don't know whether copilot is giving me infringing content or not. I'm always at risk that my question was one of the ones that trigger infringing replies.


Whenever I've used Copilot it never seems to copy whole sections of code. Can you provide examples of this?

From what I've seen it is producing fairly generic boilerplate that has been modified based on the rest of the code in my repo so that it works with the other functions and even incorporates other pieces of my code in the same style that I'm using. The boilerplate aspect makes sense because this would be the most common sequence of tokens that it observed during training. It's somewhat miraculous that it can incorporate code on the fly from my repo. I've never seen anything that looks like a direct copy paste from elsewhere though. If you have a different observation I'd love to see it.


Behold: https://twitter.com/StefanKarpinski/status/14109710611816816...

Probably helps that this is from a codebase that's been forked quite a bit.


Yeah, I wanted an example from a real project not a one file demo. The high fork number and probably also its existence in thousands of other projects likely results in this behaviour if you have no surrounding context.

This is also easily solved by checking the box in Copilot that says not to produce any code matching public code.


Can I circumvent the new OGL revocation by training an AI on 100 copies of the D&D rulebook, and using its output?


You can't even code search in forked repos so maybe forks were excluded (besides commits on top of the fork)?


Most forks probably happened before GitHub existed.


> On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet

And that would make sense and it would be argued on its own merit. The judge/jury will decide if this argument is correct and legal.

But the article implies that there are lawyers and infringers out there who are arguing that they could not have possibly afforded the cost of not infringing, so they were justified in their infringement. Since when did the massive cost of avoiding infringement become a valid reason to carry on with infringement? This seems just plain absurd by common sense. How do lawyers and infringers make this argument? How is it even entertained in court? What am I missing?


So if I make a script that automatically downloads every torrent in existence it's suddenly ok, since it is infeasible to check the copyright of them all?


We allow humans to do what copilot does because we take into account that the human brain is very limited in this regard. If we could scan all of GitHub in under a week and recall perfectly what we saw we would already have different laws. Now that machines are able to somewhat learn like humans but 1,000,000 faster we need new laws.

That's why I don't believe "but that's like humans doing X" is a strong argument.


>On the other side, they could argue that it's like a human learning how to code over a decade of looking at the internet, and that human doesn't need to DM every code author to ask if they can learn from their content (and the risk for the author is similar given the human might one day recall some author's code verbatim and not give attribution).

It might not be that easy, I think Wine developers are not allowed to read code related to Windows, even if this code is published on GitHub. The fact you looked at the code was decided to be a risk.

You also have cases with a NN producing an identical output, so you either prove your NN NEVER produces copyrighted code or you have to have a second process that is 100% correct and double checks the NN output and check for plagiarism.

I am against Microsoft in this case because they decided not to put their proprietary code in the NN , would have been funny to have the AI write an open source Windows re-implementation when you feed it the Win APi documentation.


The Wine developers are allowed to read whatever they want. They may choose to have a policy not to, because it makes it easier for them to prove that they didn't make unlawful use of proprietary code: they can't have copied something that they never read. If you reproduce something independent of knowledge of the original then that is a defense against copyright infringement. This is essentially the "Clean Room" tactic: https://en.wikipedia.org/wiki/Clean_room_design


If copilot is so advanced that we need to grant it right human has, then it has right to freedom and owning copilot is a crime.

I dony think Microsoft wants to go down this path


It is not a human though. It is a function approximated from inputs and outputs. The laws are different and the licenses call out derivative works.


If enough code is recalled verbatim, I can sue the author of that code. That seems to fit entirely with this case -- they are suing the owner of Copilot, partially because it reproduces chunks of code.


Why are wine developers and similar required to do clean room implementations to not be sued then?

Simply reading the leaked source code of Windows makes you not eligible to contribute to wine.

Why is Windows source code so much more important than mine?

The other thing is that copilot is not a human, so it doesn't matter anyways.

Humans are a special exception with laws, because they are intended to protect and benefit humans while also being fair. I don't think you can just substitute something in and assume that the same rules apply.


> Why are wine developers and similar required to do clean room implementations to not be sued then?

That's the neat part, they don't. It's essentially a self-imposed limitation which contradicts actual court rulings on the matter such as Sony v. Connectix, in which the court commented on clean-room being "inefficient" and the kind of inefficiency that fair use was "designed to prevent".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: