That is sort of understood facts with even models like Copilot & ChatGPT. With the amount of information we are generally churning, all PII may not get scrubbbed. And these LLMs often could be running on unsanitized data - like a cache of Web on Archive.org, Getty images & the likes.
I feel this is a unavoidable consequence of using LLM. We cannot ensure all data is free from any markers. I am not a expert on databases/data engineering so please take it as an informed opinion
Copilot has a ton of well publicised examples of verbatim code being used, but I didn't realize that it was as trivial as all that to go plumbing for it directly.
I feel this is a unavoidable consequence of using LLM. We cannot ensure all data is free from any markers. I am not a expert on databases/data engineering so please take it as an informed opinion