I feel like the biggest takeaway here is that a classifier trained on samples could only predict whether or not ChatGPT would refuse their response 76% of the time, which to me seems very low (given that they used BERT, regression, and a random forest as their classifier).
Probably means there's a lot we still can't predict about how LLMs work internally, even if we try to apply classification to it.
Probably means there's a lot we still can't predict about how LLMs work internally, even if we try to apply classification to it.