Not sure about this, but training a model on the website that displays the code ...

Not sure about this, but training a model on the website that displays the code is not quite the same as training it specifically on just the code. Moreover, (raw) repo content files might not even be included in crawled datasets (e.g., look at https://gitlab.com/robots.txt). I think there is something specific to GitHub as it being part of Microsoft that makes processing that data much easier.