That's really no different than somebody uploading proprietary code they don't own (stolen, leaked, whatever reason etc) on Github. Github has to assume that you are allowed to do so. What are they going to do otherwise, somehow manually verify that each repository is legit?
Now you might say, what about GPL code you don't own. You are allowed to redistribute it (upload to github). But because you are not the owner you can't license it to Github under new terms (that allow them to use it for ML training). But the question still is, is there anything in the GPL that forbids it's code being used for ML training? Even if the generated model is proprietary, has no attributions, etc?
Ok, takedown requests exists. Say Qualcomm finally wises up and asks github to takedown a copy of the millions lines of their super proprietary 4G modem firmware implementation from github. Will github retrain the model after each such takedown? :D
If not, then it's kinda stupid to argue the point about the lack of knowledge, since lack or not lack of knowledge clearly doesn't matter. Github will happily continue using confidential code even from trigger happy companies like Qualcomm for copilot.
I guess they would add some kind of filter to copilot output that removes results that clearly come from code that was DMCAd.
It's kind of like some employee that worked at Qualcomm and has seen the code. Do you retrain him (aka hit his head until he forgets) after leaving the company?
The comparison might seem silly but as AI advances I expect more and more arguments (especially in court) to come from analogies of humans and AIs.
What kind of filter? I thought copilot does not output the input data verbatim.
Creating an output filter based on millions lines of DMCAd code that would not cripple the copilot output completely at the same time, sounds like one of those hard problems. Especially if there's no agreed upon definition of copyright "violation" here.
Now you might say, what about GPL code you don't own. You are allowed to redistribute it (upload to github). But because you are not the owner you can't license it to Github under new terms (that allow them to use it for ML training). But the question still is, is there anything in the GPL that forbids it's code being used for ML training? Even if the generated model is proprietary, has no attributions, etc?