I think 2 & 3 should be combined. The AI should just finish the current sentence (internally) before it's being spoken, and once it reaches a high enough confidence, stick with the response. That's what humans do, too. We gather context and are able to think of a response while the other person is still talking.
You use a smaller model for confidence because those small models can return results quickly. Also it keeps the AI from being confused trying to do too many things at once.