Hacker News new | past | comments | ask | show | jobs | submit login

That is sort of understood facts with even models like Copilot & ChatGPT. With the amount of information we are generally churning, all PII may not get scrubbbed. And these LLMs often could be running on unsanitized data - like a cache of Web on Archive.org, Getty images & the likes.

I feel this is a unavoidable consequence of using LLM. We cannot ensure all data is free from any markers. I am not a expert on databases/data engineering so please take it as an informed opinion




Copilot has a ton of well publicised examples of verbatim code being used, but I didn't realize that it was as trivial as all that to go plumbing for it directly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: