*Unit testers hate this one trick!* But on a serious note, I think it's the ambi...

Unit testers hate this one trick!

But on a serious note, I think it's the ambiguity. What if the model refuses one prompt, but then accepts the other - that is essentially the same, but worded differently.

What if it at one point refuses a prompt, but then on the next run accepts the exact same one, for some weird fuzzy reason that can't be debugged.