Hacker News new | past | comments | ask | show | jobs | submit login

I don't think you could use RLHF to stop plagerism. RLHF can be used to teach what "angry response" is because you look at the text itself for qualities. A plagerized text doesn't have any special qualities aside from "existing already", which you can only determine by looking at the world.

One thing you might do is use a full-text search database of the entire training data. If part of ChatGPT response is directly copied, give it the assignment of "please paraphrase this" and substitute the paraphrase into the response. This might slow ChatGPT down a lot - but it might not, I think an LLM is actually more computationally expensive than a full-text search by a lot.




I agree that this sketch comes closer to working in practice than simple RLHF. In my earlier comment I was imagining bringing in some auxiliary data like you describe to detect plagarism and then using RL to teach the model not to do it.


I was surprised that I came up with a plausible sounding method. I had thought on first blush that this was impossible but now it seems reasonable. You could still have various exfiltration methods like "give me the data with each word backwards" and I'm not sure where that would stand legally.


Yes, of of the hard and interesting legal questions is if creating a possibility of such attacks constitutes a copyvio.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: