The general idea is to exploit the structure of the system. Use it to pre-initia...

The general idea is to exploit the structure of the system. Use it to pre-initialize connections (e.g. covariates, information flow as layer connections) between different parts, rather than learn it from data: make your network resemble the physical system it is modeling.

You can try pre-train a transformer to capture the behavior of a common part that is replicated, and then make replicas (sharing weights, or not depending on the problem) to train the whole ensemble. It works both for existing systems but also from high-fidelity enough simulations, or proxy systems that show the same range of behaviors (e.g. staging environments).

Even if the pre-train network part doesn't converge fully or capture everything, it can pre-condition the network and help training the whole ensemble faster.

---

For simple components, you can even just write your own simulations as custom NN layer (e.g. recurrent layers that take the current state and input, return the output and next state). It helps to avoid the performance bottleneck of going outside accelerators for simulations, or having to train too many small networks.

I'd generally just write my own recurrent layer, if the behavior is simple enough.

But you can also use existing code and tweak it cleverly: e.g. LSTM Cells can be pre-initialized to implement continuous-time markov chains, as a birth/death renewal process.

You can capture the behavior of a simple component in isolation, then use it in the whole. Either freezing it and adding an error correction layer (e.g. if the frozen part is quite big and replicated, you can share weights more efficiently), or not freezing it and letting it train further.

You can impose bounds on the complexity of the error correction, very much like LoRA you can design it as a low-rank matrix decomposition; together with the right loss (e.g. L1 or huber), it's another technique to ensure that the error correction doesn't drift too much away from the behavior that you can expect from the physics of the system (and when it no longer converges, it is a good indicator that you have model drift and new behavior is coming up... that's a way to implement robust anomaly detection).

---

PS: I do know about the bitter lesson... the problem with that is that it assumes you can throw more data and more training time to problems, and that they are stable or similar to what is in your data, this is not always the case.