When you describe the overlay layer, that sounds similar to the idea of low rank adaptation (LoRA). LoRA is kind of like finetuning, but it doesn't update every parameter, it adds a relatively small number of parameters and finetunes those
Am I understanding what you're describing about the VMs and containers analogy?
Yup. I guess LoRA counts as fine tuning. Except I've never seen inference engines where they actually let you take the base model and the LoRA parameters as separate inputs (maybe it exists and I just haven't seen it). Instead, they bake the LoRA part into the bigger tensors as the final step of the fine tune. That makes sense in terms of making inference faster, but prevents the scenario where a host can just run the base model with any finetune you like, maybe switching them mid-conversation. Instead, if you want to host a fine-tuned model, you take the tensor blob and run a separate instance of the inference program on it. Incidentally, this is the one place where OpenAI and Azure pricing differs; OpenAI just charges you a big per-token premium for fine-tuned 3.5, and Azure charges you for the server to host the custom model. Likewise, the hosts for the open-weights models will charge you more to run your fine-tuned model than a standard model, even though it's the almost the same amount of GPU cycles, just because it needs to run on a separate server that won't be shared by multiple customers; that wouldn't be necessary if overlays were separated.
I wouldn't be surprised if GPT-4's rumored mixture of many models does something like this overlay management internally.
Am I understanding what you're describing about the VMs and containers analogy?