This is really nifty, unfortunately it includes comments and so with thousands of files all including copyright notices, 'the' is the 3rd most popular word in c++ files.
That's good to hear. I didn't look in to it in too much depth, I just thought it was strange that 'the' was so high for c++ so clicked on it to see example usage and got things like:
** use the contact form at http://qt.digia.co/contact-us.
furnished to do so, subject to the following conditions:
* This file is part of the LibreOffice project.
// with this library; see the file COPYING3. If not see
So assumed licenses had not been excluded.
Having a brief look at the source, I think with the licence marking approach it's still leaving in quite a few lines from each licence (see above for examples).