I agree that AI usage of code is somewhat murky with current licenses,which obviously don't mention it either way.
Free software has a principle of "freedom to run, to do whatever you wish" (freedom 0), so arguably has said that training AI is OK. (We could quibble over the word Run, but the Gnu.org,and RMS clearly say "freedom 0 does not restrict how you use it."
GPL code can be used by the military to develop nuclear weapons. Given that the is a guiding principle of the FSF its hard to argue that the current usage is not OK.
I have no problem with Copilot being trained on AGPL code and the getting released with a AGPL compatible license. Free to do whatever they want with it.
The problem is Copilot training on source code and then discarding any restrictions of the licenses. Maybe it is legal right now but I'm sure this case will find it's way into open source licenses pretty soon.
Even if usage is legal right now, the other obligations of the license need to be adhered to as well. Can't just pick or choose one tiny aspect of FSF philosophy and run with that. AGPL is clearly about sharing and spreading free/libre software as well.
Do we know if CoPilot X was trained on AGPL, not just GPL?
Additionally I'm not sure if AGPL does anything.
I suspect the ethics and such of licensing when large fractions of work are training AI and using AI need to be worked out rather than getting mad at any individual.
There is indeed a problem of transparency right now. Companies afaik did not release the complete training data set. Might even be intentionally, because they do not want to risk, that they trained it on stuff they should not have had, without building in license and attribution into the output of their models. Or it might be, that they know that to be a fact.
I can only hope, that lawmakers hurry to catch up with reality and impose transparency obligations for AI models.
I largely agree with you, but I think there is one question that hasn't been addressed yet: Are the weights learned by an LLM a derivative work?
When a person learns from GPL code this question doesn't arise. The state of a person's brain is outside of copyright. But is the state of an LLM also outside of copyright or outside of the terms covered by the GPL? I'm not sure.
An LLM can not only emit source code derived from code published under the GPL, it can also potentially execute a program and could therefore be considered object code.
This isn't necessarily a problem as long as the model isn't distributed and does not include any AGPL code.
> the state of a person's brain is outside of copyright
It clearly isn’t. Which is why clean-room reverse engineering always requires at least two people. Or why a musician that accidentally recreates a chord progression they heard years ago but don't remember the source might still get sued.
No, you're missing the very distinction I'm trying to highlight.
When I read and remember some text, possibly also learning from it, I'm not making a copy and I'm not creating a derivative work. The state of my brain is outside of copyright. Only at the point where I create a new representation based on what I have read I may be violating someone's copyrights.
But is it the same for an AI? Is the act of reading, remembering and learning (i.e. adjusting model weights) not in itself tantamount to creating a derivative work?
Is it actually? If we could fully pull out the state of your brain, and understand that you stored a copy of a copyrighted work, I think you could be on the hook for licensing it, paying fees every time you remember the work as a performance of it.
The state of your brain is moot wrt copyright as you cannot distribute your brain.
Copyright is to the exclusive right to make copies, it is the exclusive right of distributing them.
As a simple example reading a book aloud in your home or singing in the shower is not copyright infringement; not even if you record it.
If you sell tickets to these performances or stream them on twitch it becomes copyright infringing.
Similarly it cannot be in violation of copyright for GitHub to train copilot on any random code they can legally access. It can be in violation to sell access to the model trained in this way.
Free software has a principle of "freedom to run, to do whatever you wish" (freedom 0), so arguably has said that training AI is OK. (We could quibble over the word Run, but the Gnu.org,and RMS clearly say "freedom 0 does not restrict how you use it."
GPL code can be used by the military to develop nuclear weapons. Given that the is a guiding principle of the FSF its hard to argue that the current usage is not OK.