Your argument reminds me of the joke about the man searching for his keys under ...

Your argument reminds me of the joke about the man searching for his keys under the streetlamp.

"A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, 'this is where the light is.'" [https://en.wikipedia.org/wiki/Streetlight_effect]

I agree that this is a well-designed study given its constraints. And it's admirable that it's a replication study.

That doesn't change the fact that it's largely irrelevant to professionals. It doesn't test the claims made by TDD proponents (TDD leads to better design, reduces long-term maintenance, allows for team coordination, etc.), nor does it address any of the interesting questions about TDD:

* Is TDD more effective in a professional setting than commonly-used alternatives?

* Is a mock-heavy approach to TDD more effective than a mock-light approach?

* Do people using TDD refactor their code more or less than people using a different but equally rigorous approach?

* Is the code done with TDD more maintainable than code done rigorously in another way?

* Is TDD easier or harder to sustain than equivalently-effective alternatives?

As a study, it's fine, if only of interest to academics. The problem isn't the study. It's the credulous response on the part of industry developers who then turn the false authority of the study into statements like "TDD doesn't lead to higher quality or productivity."