TLDR: Campbell's methodology is flawed, does not consider edge cases (one of whi...

zepearl · on July 15, 2022

> Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will.

I can confirm stuff like that - I'm writing a crawler&indexer-program (prototype in Python, now writing the final version in Rust) and assuming anything while crawling is NOK. I ended up adding URLs to my "to-index"-list by considering only links explicitly mentioned by other websites (or by pages within the same site).

cratermoon · on July 15, 2022

It even says right at the top of the Majestic Million site "The million domains we find with the most referring subnets", not implying anything about reachability for http(s) requests.