Hacker News new | past | comments | ask | show | jobs | submit login

That's what I wanted to ask, where do we draw the line of copyright when it comes to inputs of generative ML?

It's perfectly fine for me to develop programming skills by reading any code regardless of the license. When a corp snatches an employee from competitors, they get to keep their skills even if they signed an NDA and can't talk about what they worked on. On the other hand there's the no-compete agreement, where you can't. Good luck making a no-compete agreement with a neural network.

Even if someone feeds stolen or illegal data as an input dataset to gain advantage in ML, how do we even prove it if we're only given the trained model and it generalizes well?




Copyright is going to get very muddy in the next few decades. ML systems may be able to generate entire novels in the styles of books they have digested, with only some assist from human editors. True of artwork and music, and perhaps eventually video too. Determining "similarity" too, may soon have to be taken off the hands of the judge and given to another ML system.


> It's perfectly fine for me to develop programming skills by reading any code regardless of the license.

I'd be inclined to agree with this, but whenever a high profile leak of source code happens, reading that code can have dire consequences for reverse engineers. It turns clean room reverse engineering into something derivative, as if the code that was read had the ability to infected whatever the programmer wrote later.

A situation involving the above developed in the ReactOS project https://en.wikipedia.org/wiki/ReactOS#Internal_audit


>how do we even prove it if we're only given the trained model and it generalizes well?

Someone's going to have to audit the model the training and the data that does it. There's a documentary on black holes on Netflix that did something similar (no idea if it was AI) but each team wrote code to interpret the data independently and without collaboration or hints or information leakage, and they were all within a certain accuracy of one-another for interpreting the raw data at the end of it.

So, as an example, if I can't train something in parallel and get similar results to an already trained model, we know something is up and there is missing or altered data (at least I think that's how it works).


Take it further. You could easily imagine taking a service like this as an invisible middleware behind a front-end and start asking users to pay for the service. Some could argue it's code generation attributable to those who created the model, but reality is that the models were trained by code written by thousand of passionate users at no pay with the intent of free usage.


> but reality is that the models were trained by code written by thousand of passionate users at no pay with the intent of free usage.

I hope you're actually reading those LICENSE files before using open source code in your projects.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: