They don't have the dataset needed to train the speech recognition engine, nor do they seem to be willing to deploy some of that $250b in cash to hire engineers smart enough to make a model which only needs the local user's input for training.
You misunderstand what the internet connection is for.
Right now most machine learning models run on server farms and your voice snppet is sent to the server, where the model processes it. They then send the interpretation back
"Local model" means the voice snippet is processed on your device. Never beamed off to a server farm.
The interpretation might be:
Action: internetSearch
QueryString: movie times for "Avengers" [near me || {userPreferences.movieTheater}]
UseLocation: true
Which then kicks off whatever process Siri has for handling internet search actions