During my time at Apple the bigger issue with personalized, on-device models was the file size. At the time, each model was a significant amount of data to push to a device, and with lots of teams wanting an on-device model and the desire to update them regularly, it was definitely a big discussion.
They’ve gone with a single 3B model and several “adapters” for each use case. One adapter is good at summarising while another good a generating message replies.
AI noob here. Is every single model in iOS really just a thin adapter on top of one base model? Can everything they announced today really be built on top of one base LLM model with a specific type of architecture? What about image generation? What about text-to-speech?
If they’re obviously different models, they can’t load them all at once into RAM. If they have to load from storage every time an app is opened, how will they do this fast enough to maintain low latency?
They'll have plenty of time to load the model; It still needs to wait for the user to actually voice/type their request. Invoking Siri happens well before the request is ready.