To me, the crazy thing about LoRA is they work perfectly well adapting models checkpoints that were themselves derived from the base model on which the LoRA was trained. So you can take the LCM LoRA for SD1.5 and it works perfectly well on, say, RealisticVision 5.1, a fine-tuned derivative of SD1.5.
You’d think that the fine tuning would make the LCM LoRA not work, but it does. Apparently the changes in weights introduced through even pretty heavy fine tuning does not wreck the transformations the LoRA needs to make in order to make LCM or other LoRA adaptations work.
Finetuning and LoRAs both involve additive modifications to the model weights. Addition is commutative, so the order in which you apply them doesn't matter for the resulting weights. Moreover, neural networks are designed to be differentiable, i.e. behave approximately linearly with respect to small additive modifications of the weights, so as long as your finetuning and LoRA change the weights only a little bit, you can finetune with or without the LoRA, respectively train the LoRA on the finetuned model or its base, and get mostly the same result.
So this is something that can be somewhat explained using not terribly handwavy mathematics. Picking hyperparameters on the other hand...
If you want to use a counterexample to refute the general idea about additions, you need to pick one that fulfills the preconditions, like being differentiable. x → sin (1/x) is not differentiable at 0 and for any other value where it is differentiable, there's a small ɛ and a linear function L such that for all a and b < ɛ, sin(1/(x + a + b)) = sin(1/x) + L(a + b) + O(ɛ²) and because L is linear, L(a + b) = L(a) + L(b). The wrinkle is that ɛ might have to be extremely small indeed.
Recalling the definition of exact differentiability is irrelevant.
Instead take the smallest interval that you can represent in fp32 not too far away from zero for example. Take few values in that infinite interval and check the behaviour of the said monstrous function.
This is a “trivial” example when studying eg. Distribution theory.
Said differently, you need to assess how smooth is the differential operator itself.
You’d think that the fine tuning would make the LCM LoRA not work, but it does. Apparently the changes in weights introduced through even pretty heavy fine tuning does not wreck the transformations the LoRA needs to make in order to make LCM or other LoRA adaptations work.
To me this is alchemy.