You can segment the validation to be data after a certain date, and train on data before that date. You get an accurate sense of how well the model will perform in the real world, as long as you make sure the data never borrows from the future.
That only ensures your model is accurate assuming real world parameters remain the same, which again, is prone to overfitting.
To use a real world example, financial models on mortgage backed securities were the root cause of the financial crisis, because they were based on decades of mortgages that were fundamentally different than the ones they were actually trying to model. Even if someone was constructing a model by training on data from say, 1957-1996, and validating using 1997-2006, they would have failed to accurately predict the collapse because the underlying factors that caused the recession (the housing bubble, prevalence of adjustable rate mortgages, lack of verification in applications) were essentially unseen in the decades of data prior to that.
Validation protects against overfitting only to a certain degree, and only to the extent that the underlying data generating phenomena don't ever change, which, in the real world, is generally a terrible assumption.
That's not hard and fast, though. While no model is perfect, robust models can "handle" outliers. Worst case, you know when it happens and train with more a priori.
It's not about outliers. Let's say you're at a startup and you fit some model to your first 30 customers. It works great for your next 10 customers, but fails dramatically for your first enterprise client. Why? Because the enterprise client was fundamentally different from your previous 40 customers. If you fit your model on a population in which the relationship looks one way, then try to apply your model to a population with a different relationship, it will fail.
Machine learning and statistics are both application of the same principles of probability and information theory. They work (for the most part) by modeling the world capturing the relationships between random variables. A random variable can be any natural process that we can't express in precise terms, so we express it in probabilistic terms.
This is the same principle underlying the premise that "past results do not guarantee future success." The relationships between random variables in the world that affect success in anything -- stock market performance, legal outcomes, etc. -- might not be the same tomorrow as they are today.
And that's not even a matter of overfitting. That's just your ever-present real-world threat of having all your modeling work invalidated by forces outside your control. Overfitting happens when you, the data scientist, fit your model to random noise in the training data. An overfitted model will have bad generalization performance on held-out samples, even from the same population. It's not always easy or possible to detect overfitting, especially with small training sets.
What's the problem with that, though? Startups are usually advised to service one market, not several. If your first 40 customers were prosumers but then you have a prospective enterprise client, the logical response is say no to the enterprise client and go after another 60 (or 60,000) prosumers.
Or at least understand that you're entering a new market and budget appropriately for development. Usually, if you're switching from between prosumer -> enterprise, you are very, very lucky if the sum total of changes you need to make is training a new machine learning model. To start out with, you usually need to get used to sales cycles that take 6-18 months, hiring a dedicated sales guy to manage the relationship, and handling custom development requests.
There's no problem with it, but some very intelligent people don't seem to realize that you can't just "use machine learning" and predict whatever you want. It's gotten better over the last few years, now that it's less new and magical than it used to be, but I still see it happen now and then.
Hopefully your analysts (which in this case includes your lawyers, accountants and statisticians) will tell you that the new client is different to the others and your models may not hold up and may need revision.
Only if your structural theory is not-wrong enough.
Even if you KNOW that your model is not-wrong in the right direction and within acceptable orders of magnitude, how do you fit the parameters for that structural model? You need some kind of data, even if you're just using anecdata to pick magic constants.
That's my whole point. You just asserted that you can extrapolate outside a training set with a structural model. I am asserting that those "many contexts" and "metastudies" amount to a bigger, more representative training set.
It means cross validation. It essentially means is a way of simulating how well your model will do when it encounters real world data.
When building a model, you divide your data into two parts, the training set and the testing set. The training set is usually larger (~80% of your original data set, although this can vary), and is used to fit your model. Then, you use the remaining data you set aside for the testing set by using your model to generate predictions for that data, and comparing it to the actual values for that data.
You can then compare the accuracy of the model for the training and testing sets to get an idea if your model generalizes well to the real world. If, for example, you find that your model has an accuracy of 95% on the training data, but 60% on your testing data, that means your model is overly tuned into features of the data used to build the model that may not actually be helpful for prediction in the real world.
I assumed Code Versioning so that if you have robust data segmentation you have less uncertainty about the impact of change. However, I'm a tourist here and hope OP comes back to share.