Hacker News new | past | comments | ask | show | jobs | submit login

How about mandating that the big players feed SHA sums into a HaveIBeenPwned-style service? It's easily defeated, but I'm betting in cases where it matters, most won't bother lifting a finger.



Watermarking [0] is a better solution. It still works after changes made to the generated output, and anyone can independently check for a watermark. Computerphile did a video on it [1].

But of course, watermarking or checksums stop working once the general public runs LLMs on personal computers. And it's only a matter of time before that happens.

So in the long run, we have three options:

1. take away control from the users over their personal computers with 'AI DRM' (I strongly oppose this option), or

2. legislate: legally require a disclosure for each text on how it was created, or

3. stop assuming that texts are written by humans, and accept that often we will not know how it was created

[0]: Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large language models. arXiv preprint arXiv:2301.10226. Online: https://arxiv.org/pdf/2301.10226.pdf

[1]: https://www.youtube.com/watch?v=XZJc1p6RE78


Will the general public be running LLMs on their own hardware, or will it be like where we are today with self-hosting? Despite what I've written above I would like to think it won't. But at the same time this is something big tech companies will work very hard to centralise.


In the short therm, I think it's very likely that companies (including smaller companies) integrating LLM's in their products want to locally run an open source LLM instead of relying on an external service, because it gives more independence and control.

Also, technical enthousiasts will run LLM's locally, like with image generation models.

In the long term, when smartphones are faster and open source LLM's are better (including more efficient), I can imagine LLM's running locally on smartphones.

'self-hosting', which I would define as hosting by individuals for own use or others based on social structures (friends/family/communities), like the hosting of internet forums, is quite small and it seems to shrink. So it seems unlikely that that form of hosting will become relevant for LLMs.


As of today you can download LLaMa/Alpaca and run it offline on commodity hardware (if you don't mind having someone else do the quantisation for you) - the cat's out of the bag with this one


Why?

Fist, if it should work, you'd need fuzzy fingerprints. Just changing a linebreak would alter the SHA sum.

Secondly, why?


Please explain how this would work. The SHA sum would be different 100% of the time. In other words, you would never get the same SHA sum twice.


Fair enough. It might work as follows:

I generate some text using ChatGPT.

ChatGPT sends HaveIBeenGenerated a checksum.

I publish a press release using the text verbatim.

Someone pastes my press release into HaveIBeenGenerated.


Is there something like perceptual fingerprinting but for text?


It is called an embedding, OpenAI does these ;)


Eventually we would have to call ChatGPT to the witness stand, and ask it whether it remembers telling these specific words to that man over there.


Tweaking 1 char would change the checksum


Which IMV is fine, since you were arguably using ChatGPT as an assistant versus a tool for brazen plagiarism.


But you can automate that too, with a different tool.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: