The problem is that the information is in an opaque encoding that nobody can reverse engineer today. So it's impossible to prove that a certain subset of data has been removed from the model.
Say, you have a model that repeats certain PII when prompted in a way that I figure out. I show you the prompt, you retrain the model to give a different, non-offensive answer. But now I go and alter the prompt and the same PII reappears. What now?
Say, you have a model that repeats certain PII when prompted in a way that I figure out. I show you the prompt, you retrain the model to give a different, non-offensive answer. But now I go and alter the prompt and the same PII reappears. What now?