TLDR: Campbell's methodology is flawed, does not consider edge cases (one of which (equating apex-only and www-prefixed domains) I consider reckless), and didn't understand how Majestic collects and processes its data.
Longer version: This isn't comprehensive, but I think of two main reasons why:
- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example https://peering.azurewebsites.net/) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least http://ai./ does exist!)
- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!
Edit: I've downloaded out the CSV to check my claims, and it shows:
wixsite.com 0
beian.gov.cn 0
Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?
Also addendum to crawling but I consider "probably forgivable":
- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.
> Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will.
I can confirm stuff like that - I'm writing a crawler&indexer-program (prototype in Python, now writing the final version in Rust) and assuming anything while crawling is NOK. I ended up adding URLs to my "to-index"-list by considering only links explicitly mentioned by other websites (or by pages within the same site).
It even says right at the top of the Majestic Million site "The million domains we find with the most referring subnets", not implying anything about reachability for http(s) requests.
Longer version: This isn't comprehensive, but I think of two main reasons why:
- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example https://peering.azurewebsites.net/) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least http://ai./ does exist!)
- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!
Edit: I've downloaded out the CSV to check my claims, and it shows:
Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?Also addendum to crawling but I consider "probably forgivable":
- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.