Alright, now suppose someone does. Doesn't that mean Microsoft has to rework all work they made with these codebases using this legal argument and not a fair use one? Doesn't even doing this set them up for a potentially very expensive compliance action?
I think what you say is true: either they train on any open sourced code with fair use, no matter if it was published on github or anywhere, and ignoring the license, OR they trained on data that is potentially not complying with their ToS (e.g. uploaded by someone that is not the author, regardless of license, they couldn't legally agree to a ToS that gives away additional rights of the work).
However, the reality is that this is all extremely muddy, far from proving that software A has copied some code from software B where you can just compare the source code. There are too many muddy steps, and you can bet that Microsoft will just get away with it.