Could it be that the unlearning is actually teaching the AI how to not respond with certain information, and that sort of learning is more nuanced and thus easier to lose than the original information, leading to the information being 'relearned' when the model is compressed?
It does draw concern to the idea that anything the AI model might be doing is still using the 'bad' information even if it has learned how to not show it directly.
It does draw concern to the idea that anything the AI model might be doing is still using the 'bad' information even if it has learned how to not show it directly.