How about mandating that the big players feed SHA sums into a HaveIBeenPwned-sty...

stcg · on March 22, 2023

Watermarking [0] is a better solution. It still works after changes made to the generated output, and anyone can independently check for a watermark. Computerphile did a video on it [1].

But of course, watermarking or checksums stop working once the general public runs LLMs on personal computers. And it's only a matter of time before that happens.

So in the long run, we have three options:

1. take away control from the users over their personal computers with 'AI DRM' (I strongly oppose this option), or

2. legislate: legally require a disclosure for each text on how it was created, or

3. stop assuming that texts are written by humans, and accept that often we will not know how it was created

[0]: Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large language models. arXiv preprint arXiv:2301.10226. Online: https://arxiv.org/pdf/2301.10226.pdf

[1]: https://www.youtube.com/watch?v=XZJc1p6RE78

tjpnz · on March 22, 2023

Will the general public be running LLMs on their own hardware, or will it be like where we are today with self-hosting? Despite what I've written above I would like to think it won't. But at the same time this is something big tech companies will work very hard to centralise.

stcg · on March 22, 2023

In the short therm, I think it's very likely that companies (including smaller companies) integrating LLM's in their products want to locally run an open source LLM instead of relying on an external service, because it gives more independence and control.

Also, technical enthousiasts will run LLM's locally, like with image generation models.

In the long term, when smartphones are faster and open source LLM's are better (including more efficient), I can imagine LLM's running locally on smartphones.

'self-hosting', which I would define as hosting by individuals for own use or others based on social structures (friends/family/communities), like the hosting of internet forums, is quite small and it seems to shrink. So it seems unlikely that that form of hosting will become relevant for LLMs.

_kuvn · on March 22, 2023

As of today you can download LLaMa/Alpaca and run it offline on commodity hardware (if you don't mind having someone else do the quantisation for you) - the cat's out of the bag with this one

madsbuch · on March 22, 2023

Why?

Fist, if it should work, you'd need fuzzy fingerprints. Just changing a linebreak would alter the SHA sum.

Secondly, why?

welder · on March 22, 2023

Please explain how this would work. The SHA sum would be different 100% of the time. In other words, you would never get the same SHA sum twice.

tjpnz · on March 22, 2023

Fair enough. It might work as follows:

I generate some text using ChatGPT.

ChatGPT sends HaveIBeenGenerated a checksum.

I publish a press release using the text verbatim.

Someone pastes my press release into HaveIBeenGenerated.

justusw · on March 22, 2023

Is there something like perceptual fingerprinting but for text?

madsbuch · on March 22, 2023

It is called an embedding, OpenAI does these ;)

justusw · on March 23, 2023

Eventually we would have to call ChatGPT to the witness stand, and ask it whether it remembers telling these specific words to that man over there.

nonethewiser · on March 22, 2023

Tweaking 1 char would change the checksum

tjpnz · on March 22, 2023

Which IMV is fine, since you were arguably using ChatGPT as an assistant versus a tool for brazen plagiarism.

bmacho · on March 22, 2023

But you can automate that too, with a different tool.