Hacker News new | past | comments | ask | show | jobs | submit login
10% of the top million sites are dead (ccampbell.io)
375 points by Soupy on July 15, 2022 | hide | past | favorite | 143 comments



Many issues with this analysis, some others have already mentioned, including:

• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)

• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.

• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.

• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.

If you want to probe if domains are still active:

• confirm they're still registered via a `whois`-like lookup

• examine their DNS records for evidence of current services

• ping them, or any DNS-evident subdomains

• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)

• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services

If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.


> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.


> Majestic promotes their list as the "top 1 million websites of the world"

Well, the source URL provided by the article author initially claims, “The million domains we find with the most referring subnets”. Then it makes a contradictory comment mentioning ‘websites’. At best we can say Majestic is vague and/or confused about what they’re providing – but given the author’s results, I suspect this list contains domains with no guarantee Majestic ever saw a live HTTP service on these domains.

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

How about I cite HN user ~gojomo, who for nearly a decade wrote & managed web crawling software for the Internet Archive. He says: “Sites that don’t want to be crawled use every tactic you can imagine to repel unwanted crawlers, including unceremoniously instant-dropping open connections from disfavored IPs and User-Agents. Sadly, given Google’s dominance, many give a free pass to only Google IPs & User-Agents, and maybe a few other search-engines.”


> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.


> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

We use stage.www.domain.tld for the staging/testing site, but that's about it ;)


looks like you’re not alone!

https://crt.sh/?q=stage.www.%25

(warning: will take a while to load)



It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12


As near as I can tell, these are the top 1,000,000 domains referred to by other websites they crawled.

The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.

So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.

[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet


I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.

It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.

It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.


wixsite.com is a domain for free sites built on Wix, so if your username on Wix is smugma, and your site name is mysite, then you'll have a URL like smugma.wixsite.com/mysite for your Home page.

That's why this domain is in the top


Correct, that's why it's in the top. Your example further confirms why the author's methodology is broken.


> other sites I checked such as wixstatic.com don't appear to be valid

But docs.wixstatic.com is valid.


100% agree his methodology is broken. Another example like this is googleapis.com. If I remember correctly there a quite a number of domains like this in magestic million.

Not to mention a number of his requests may have been blocked.


He takes this into account by generously considering any returned response code as “not dead”.

> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid


That doesn't take this into account, no. `curl wixsite.com` returns a "Could not resolve host" error; it doesn't return a response code, so the author would consider it invalid, even though `curl www.wixsite.com` does return a response (a 301 redirect to www.wix.com).


Oh how does that work then? How does the browser get to the redirect when curl doesn't get any response at all? Is this a DNS thing?


Browsers sometimes try adding things to URLs to try and make them work. Firefox tends to add https:// if http:// fails, perhaps some browsers are adding www.


Funny, if true (doing some research...) I'll add this to the warchest for that classic interview question about what happens when you type a url into your browser.


Definitely was true. Browsersers (at least FF) also used to add .com at the end. I think these days they all just send you to their ad-laden funding source instead if there is no TLD.(*)

https://www.thewindowsclub.com/browser-automatically-adds-ww...

(*) Apparently at least firefox still does the domain fixup instead of search if you type the http:// or https:// prefix so e.g. http://example/ will have you end up on http://www.example.com/


apex domain is different from www cname


I just don't get how the browser gets a response like this (below), and then figures out what to do next. Sister comment said it might just try the common "www." prefix.

        $ dig wixsite.com

        ; <<>> DiG 9.16.1-Ubuntu <<>> wixsite.com
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65168
        ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 65494
        ;; QUESTION SECTION:
        ;wixsite.com.   IN A

        ;; Query time: 3 msec
        ;; SERVER: 127.0.0.53#53(127.0.0.53)
        ;; WHEN: Sat Jul 16 09:12:42 AEST 2022
        ;; MSG SIZE  rcvd: 40

        $ nslookup wixsite.com
        Server:  127.0.0.53
        Address: 127.0.0.53#53

        Non-authoritative answer:
        *** Can't find wixsite.com: No answer
Does this mean that the WIX SEO team should really resurrect the record and do a 301, otherwise they are wasting their inbound links?


I think there isn't an inbound link, the crawler is choosing to hit http://wix.com/ if links to wix.com subdomains are common enough. It might be that there are millions of links to www.wix.com and docs.wix.com and user.api.wix.com and not a single (broken) link to wix.com, and they will crawl http://wix.com/ anyway and decide that "the site is dead". This is a problem with their methodology.


Yes this goes to “what is a site?” and “who/what is controlling what sub domains”. Especially with things like GitHub.io, and indeed wix. I think ignoring dead apex domains when a subdomain worked would have been a good extra pass for the methodology.


Perhaps that is the reason for the apex domain to be dead in the first place - to communicate that the subdomains are the real roots of separate sites. Similarly, TLDs themselves are not supposed to have any A records (although there are some that do).


As someone hypothesizes above, it’s common for browsers to add www. to domains that don’t resolve.


I'm not sure if it's because I have my browsers set to generally "do what I say" or because I'm using a filtering proxy, but Firefox doesn't seem to try again if I just put "wixsite.com" in the address bar --- it gets a Host Not Found from the proxy and stops.


That is weird - it does the fixup from http://wixsite/ to http://www.wixsite.com/ but leaves http://wixsite.com/ alone.


I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.


Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain much more than what can be called a "site," especially in 2022 when the internet has been all but destroyed and all that's left is a corporate oligopoly.


How many are placeholder pages thrown up by registrars like Network Solutions?


If they're placeholder pages, they're not dead. Those 10% are not responding at all; the requests aren't reaching any HTTP server.


Not all placeholder pages will forever stay placeholder pages though. Some may get sold, become a site, then stop being a site again. Some may not get sold, come up for renewal and be deemed unlikely to be worth trying to sell anymore (renewal is cheap for a registrar but the registry will still charge a small fee).

Of course the vast majority with enough interest to make this list will either be sold and be an active page or still be an active placeholder but I wouldn't rule out there being a good count of pages towards the lower end of the top million being placeholders that were eventually deemed not worth trying for anymore.


Exactly. Afaik, there's even a whole auction based side industry trading expired domains. When you see a placeholder page, the original site is for all intents and purposes, dead. The domain just happens to be interesting enough that someone wants to stick an ad on it rather than let it resolve to nothing.


Yabbut a placeholder page returns status 200. We're discussing URLs that don't return any status at all; they are unresponsive. They have no placeholder; they aren't a "place".

A site that is occupied by a placeholder page is a 200, as far as that "top million" database is concerned. It might as well be a working shopfront or blog.


at least from his computer/script. A number could have been blocked simply detecting him as a bot.


How is "top" defined here? If they were dead, wouldn't they fairly quickly stop being "top"?

EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.


I can expect large SEO spam clusters of "sites" with many links inside a cluster to make them look legit. For some time such bits of SEO spam were on top of certain google searches and enjoyed significant traffic, putting them firmly into "top 1M".

Once a particular SEO trick is understood and "deoptimized" by Google, these "sites" no longer make money, and get abandoned.


Blows my mind that my blog is 210863rd on that list. That makes the web feel somehow smaller than I thought it was.


Eyeing you jealously from my position at 237,014 on the list... We're almost neighbors, I guess.


The biggest problem I find is that it seems to be pretty "outdated" to keep redirects in place, if you move stuff. So many links to news websites, etc. will cause a redirect to either / or a 404 (which is a very odd thing to redirect to in my opinion).

If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.

I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.

Another worry is the ever-increasing size of these scripts making archiving more expensive.


You can often pop the URL into the Wayback Machine to bring up the last live copy. It's better at handling dynamic stuff the more recent it is. Older stuff, especially early AJAX pages, are just gone because the crawler couldn't handle it at the time. It's far from a perfect solution, especially in light of the big publishers finally getting their excuse to go after the Internet Archive legally. It's a good silo, but just as vulnerable as any other.


ArchiveWeb.page + ReplayWeb.page are the best I've found at handling ajax loaded content.


> Domain normalization is a bitch

I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.

Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...


> I’m a no-www advocate.

I used to feel the same way. — Until the arrival of so many new TLDs.

Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band


I hear you. But a redirect is a good solution in that case.


Yes it is.

I just redirect the other way round, so those ever rarer individuals typing in domains are also served fine on my websites. And also to automatically grab the https.

I just find it's ever slightly more "honest" to have the server name I mention, also be the one that's actually being served. -- And that's also because I'm quite annoyed at URL shorteners and all kinds of redirect trickery having being weaponized over the years.

So I optimize for honesty and facilitate convenience.

But this pretty subtle stuff and I'm not advocating anymore. -- I don't think it's that big of a deal either way and I'm just expressing my little personal vote and priorities on the big Internet. :-)

So my post wasn't intended to change your mind, but more as a bit of an alternative view and what made me get there.


25 years ago I added a rule to my employer’s firewall to allow the bare domain to work on our web server.

Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.

A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.

Been much too long to have any clue now what we did wrong.


You probably created a cname from the apex to www? This problem still exists today.

From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME record is present at a node, no other data should be present; this ensures that the data for a canonical name and its aliases cannot be different."

So if you're looking up the MX record for domain, but happen to find a cname for domain to www.domain , it will follow that and won't find any MX records for www.domain.

The correct approach is to create a cname record from www.domain to domain, and have the A record (and MX and other records) on the apex.

Most DNS providers have a proprietary workaround to create dns-redirects on the apex (such as AWS Route53 Alias records) and serve them as A records, but those rarely play nice with external resources.


> You probably created a cname from the apex to

You can't do that, period.

A lot of "cloud" and other GUI interfaces deceive people into thinking it's possible, they just do A record fuckery behind the scenes (clever in it's own right but it causes misunderstanding).


I'm a www advocate and reroute my domains from apex domain to www. When you use an apex domain, you have to use an A record which means if you have a server outage it is going to take time to update the record to point at a new IP address. If you use www with a CNAME, the final server IP can be quickly switched assuming you've set the CNAME and network up for that functionality - you can't do that with an apex domain.


That doesn't make any sense at all - the CNAME just points to somewhere else with an A (and AAAA in $current_year) record. It adds another point you can switch around but doesn't let you switch it any quicker. How quickly you can effectively change what the domain points to is determined by the TTL of the record (withing limits) which can be lowered for any record.


Just saw this response. In my comment above, I didn't want to spend a lot of time responding to the original post so I handwaved a lot with this "assuming you've set the CNAME and network up for that functionality"

When you have a www you ultimately have more flexibility. For example, you can point a CNAME at another CNAME. This answer on ServerFault mentions the additional options (and downsides of doing that): https://serverfault.com/a/223634 https://serverfault.com/questions/223560/www-a-record-vs-cna...

Heroku vaguely mentions the benefits under the "Limitations" section of this link: https://devcenter.heroku.com/articles/apex-domains

After a DDos attack, they were much more explicit in their recommendations: "We strongly recommend against using root domains. Use a subdomain that can be CNAME aliased to proxy.heroku.com, and avoid ever manually entering IPs into your DNS configuration." https://web.archive.org/web/20110609095616/https://status.he...

Here is an old post about someone who initially used an apex domain and then had issues (that they hacked around): https://web.archive.org/web/20110718170757/http://blog.y3xz....

I believe that some larger providers are providing some work arounds which makes it easier to hack around the issue these days, but I still firmly believe that if you set your site up using "www" (even if it is initially an A record - most of mine are A records right now), you will have more flexibility in the long run than if you set your site up on an apex domain.


Free.fr, one of the biggest ISP in France a while back, and perhaps still today, still runs all the old-school websites it was hosting for people (for free) today. It's quite insane, but a lot of the French web 1.0 is still alive today thanks to them. Truly an ISP ran by passionate technical people.


Good on them. Last year I randomly discovered an ancient email to my old Hotmail address from free website host Tripod, owned at the time by Lycos, that old search engine. As an 11 year old I had a website with them and wanted to dig it out to see what I had put there. I managed to convince them I was the owner and got my access back, only to discover nothing there. I guess at some point in the ~20 years since I made one they nuked their dormant sites.


All these top million lists are very good at telling you the top most 10K-50K sites on the web. After that, you're going into 'crapshoot' land, where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.


> where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

That's literally the phenomenon the article is describing.


Ok let me reword it differently: the 500,000th most popular site on these lists most likely isnt the 500,000th most visited and it might not even be in the top 5 million. These data sources are so bad at capturing popularity after 50k sites or so simply because they dont have enough data


I haven't tested this, but the "Cisco Umbrella 1 Million" is generated daily from DNS request made to the Cisco Umbrella DNS service. That seems to be a very good and recent dataset.

It does count more than just visiting websites though. If all Windows computers query the IP of microsoft.com once a day that'll move them up quite a bit. And things in their top 10 like googleapis.com and doubleclick.net are obviously not visited directly.

So while it is quite a reliable and recent dataset, it is not a good test of popularity.


How are people determining the "top" sites? We do some of this at work and we pay SimilarWeb a giant sum of money, are people able to find site traffic in inexpensive ways which allow for these analyses?


By what possible criteria are these the "top" million sites, if 10% are dead? I'd start with questioning that data.


Dude, it's the second sentence of the first paragraph:

> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).


And moreover, the author’s conclusion is that the dataset is bad.

> While I had expected some cleanliness issues, I wasn’t expecting to see this level of quality problems from a dataset that I’ve seen referenced pretty extensively across the web


Yeah, but they're still providing a dataset that's just plain bad. It's hardly relevant how many sites link to some other site, if it's dead.


It's only bad data if it does not include what it claims to include.

If the dataset is defined as inlinks, and it is inlinks, then the data is good.


part of the problem is it's not the number of links, it's referring subnets. Fairly certain this includes, script tags.


Exactly!

Garbage In == Garbage Out


Last time I tried to crawl that many domains, I ran into problems with my ISP's DNS server. I ended up using a pool of public DNS servers to spread out all the requests. I'm surprised that wasn't an issue for the author?


You have to run your own resolver. Crawling 101.


This is of course the correct answer. It just felt like shaving a big yak at the time.


Yes. There's a lot of yak, and every crawling task reveals new ones.

As an example, all of blogger is behind a single load balancer, with a rate limit. If you don't crawl blogs, you'd never know. Or the top million, plenty of blogger blogs in the top million.

Ditto for Shopify.


Same with some large registrars that sell cheap, template / WordPress based sites as an add-on.


A properly configured unbound running locally can be a decent compromise.


That is running your own resolver. Unbound is a resolver.


well, yes, but I guess I think of unbound in a different category from setting up (e.g.) bind. but, my experience configuring bind is probably more than 20 years out of date.

you're right to make that correction though, so thank you. :)


BIND is odd in that it combines a recursive resolver with an authoritative name server, and this has actually led to a number of security vulnerabilities over the years. Other alternatives, such as djb's dnscache/tinydns and NLNet Labs' Unbound/nsd separate the two to avoid this entirety.


Yeah, BIND isn't just a resolver.

Setting up Unbound as a recursive, caching resolver is pretty straightforward; a million times more straightforward than doing the same with BIND. You don't need to configure much in a recursor; it just has to accept requests, and recurse until it finds an answer or NXDOMAIN; and then respond.

An authoritative nameserver has a lot more going on; primaries/secondaries, permissions, zone transfers and so on. BIND was the devil to configure, and TBH I have never needed an authoritative nameserver. But I think anyone who can should run their own recursor.

> more than 20 years out of date

Hah! I reckon it's about 20 years since I touched BIND.


I've been working on trying to migrate sites I ran in 2008 or so into my new preferred hosting strategy lately: I know zero people look at them, since many were functionally broken at present, but I don't like the idea of actually removing them from the web. So I'm patching them up, migrating them to a more maintainable setting, and keeping them going. Maybe someday some historian will get something out of it.


Title is misleading: that’s the outcome, but the bulk of the story is the data processing to reach that conclusion.


It happens. Most of the stuff we do these days invokes a number of disciplines. I forget sometimes that maybe ten percent of us just play with random CS domains for “fun” and that most people are coming into big problems blind, even sometimes the explorers (though having comfort with exploring random fields is a skill set unto itself).

Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.

I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…

I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.


Read that again folks:

"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "

This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.


The flip side is anyone can run these kinds of tools against your site easily and cheaply.


your point has a truth behind it for sure, but there's a large difference between serving requests and making requests. Many sites are simple html and css pages, but many others also have complex backends. It's those that often are hard to scale and why the cloud is hugely popular, maintaining and scaling the backend is hard


Oh absolutely, but he also said this:

I found that my local system could easily handle 512 parallel processes, with my CPU @ ~35% utilization, 2GB of RAM usage, and a constant 1.5MB down on the network.

Another thing that happened in the early web days was Apache. People needed a web server and it did the job correctly. Nobody ever really noticed that it had terrible performance, so early on infrastructure went to multiple servers and load balancers and all that jazz. Now with nginx, fast multi-core, and speedy networks even at home, it's possible to run sites with a hundred thousand users a day at home on a laptop. Not that you'd really want to do exactly that but it could be done.

Because of this I think an alternative to github would be open source projects hosted on peoples home machines. CI/CD might require distributing work to those with the right hardware variants though.


> you don't need to rent or provision "big infrastructure" unless you're actually quite big.

Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.


whenever i go through my bookmarks, i tend to find maybe 5-10% are now 404.

this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.


What’s the benefit to using archive.ph instead of archive.org (Internet Archive)? Seems like the latter is much more likely to be around for awhile.


i find archive.ph does a better job of preserving the page as is (it also takes a screenshot) compared to internet archive which can be flaky at best.

i also find archive.ph much faster at searching, and the browser extension is really useful too.

the faq does a great job of explaining too https://archive.ph/faq


archive.today does that by rewriting the page to mostly static HTML at the time of capture.

archive.org indexes all URLs first-class and presents as close to what was originally served as possible. It also stores arbitrary binary files and captures JS and Flash interactivity with remarkable fidelity.

When logged in, the archive.org Save Page Now interface gains the options of taking a screenshot and non-recursively saving all linked pages. I cannot reason why—the more saved, the better, right?

archive.org has a browser extension too


Isn't archive.ph/today the one with questionable funding sources and backing? Who is behind it and can it be trusted for longevity?


In this case the less we know, the longer it will last. Notice how this site ignores robots.txt and copyright claims by litigious companies that would like to see their past erased.

The data saved on your NAS will outlast this site regardless of who owns/funds it.


Their explanation for ignoring robots makes sense - they say they ignore it because their crawler only runs when a human enters a URL and archives it, they also link to Google as this is what they do.


How do you figure?


What do you mean? There’s a line of companies waiting to sue anyone involved with that site. That’s been the case for many years.


A site devoted to duplicating content from elsewhere online, and with a significant use-case of defeating paywalls, would be a very likely candidate for lawsuits.

Concealing ownership would tend to help avoid this / minimise consequences.

That might still be a brittle defence.


yeah funding is a grey area...

fwiw the website is only accessible by VPN in a lot of countries, which is say a lot for me..and i don't think they've taken down any content, although i cant say for sure.


So ... what is known about the operator(s) / funding?


Likely Slavic⸺

• WHOIS points to a "Denis Petrov" in Prague

• Share menu has buttons for Reddit, VKontakte, Twitter, Pinboard, and Livejournal. Eyebrow raising. VK is Russian, and so is LJ nowadays. Pinboard (notable as successor of del.icio.us) is American, coincidentally founded by a Polish immigrant.

• With a sizable dose of confirmation bias, the mistakes in the English of the site and blog do feel appropriately Slavic

It's stated to be privately funded with costs around US$4000/month, began accepting donations in 2016 (https://wiki.archiveteam.org/index.php/Archive.today#Vital_S...)


Thanks. That's largely the sense I've had.

Motivation / utility is a question that's occurred to me more than once.


archive.ph = Russian federation website. Blocked by most firewalls by default.


Tangential, but I love the format for your site. Any plans to do a "How I built this blog" post?


Likely using Hugo with the congo theme


Yup, nailed it. Hugo with Congo theme (and a few minor layout tweaks). Hosted on cloudflare pages for free


Nice work.

Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.

For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.

You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.


I don't have any particular opinions on the author's conclusions, but I learned a thing or two about the power of terminal commands by reading through the article. I had no idea that xargs had a parallel mode.


Probably not news to anyone who works with big data™, but I learned, after additional searches, that using (something like) duckdb as a CSV parser makes sense, especially if the alternative is loading the entire thing to memory with (something like) base R. This was informative for me: https://hbs-rcs.github.io/large_data_in_R/.


having the luxury of scrutinizing the method and retesting: to "normalize" domains and skip the www skewed results - not all websites do their redirects across apex to www (and schemas). Some servers weren't answering the request with the default curl accept header / and needed encouragement.

I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.


The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.

If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%

(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)


This looks surprisingly similar to the unfinished research that I did: https://github.com/ClickHouse/ClickHouse/issues/18842


Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.

But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!


Those are called Public Suffixes or effective TLDs (eTLDs): https://en.wikipedia.org/wiki/Public_Suffix_List

And you're entirely correct that author should've referred to such list.


I think the problem is that the original source needs to use that list as well. Just looked though .nz and they list several sites ( govt.nz , school.nz, gen.nz ) that don't exist since all the domains are one level below.

They even list.govt.nz as the top site. In fact that doesn't exist (although www.govt.nz does since it is a a kinda government portal )

I see they list an old employer of mine who got bought 15 years ago and whose website has been redirecting for 10 years.


Wow, I would not have suspected `tee` is able to handle multiple processes writing to the same file. Doesn't seem to be mentioned on the man-page, either.


All tee does is write its standard input (a single file descriptor) to a file (a single one) and its own output. xargs is the thing running multiple processes (and they inherit the same standard output, your shell's).

What you're seeing is Linux being able to handle multiple processes writing to the same file.


Well, then that's a Linux feature I was unaware of. I found this SO[1] question with two conflicting answers that have almost the same number of votes, and even the "yes you can do this" answer seems to have enough caveats that it doesn't sound like a great idea.

1. https://stackoverflow.com/questions/7842511/safe-to-have-mul...


zombo.com still kicking!


The png rotates with this:

.rotate {animation: rotation .5s infinite linear;}

I think it wasn't like this before. They must've updated it at one point.


Yes, when Flash went end of life, they were forced to adopt a new tech strategy.


Majestic is a shit list. Mystery solved.


Are there more cycles/cpu/work involved to `cat verylargefile | awk` vs `awk verylargefile` ?


His 'www' logic is flawed: https://www.example.com and https://example.com need not return the same results, but his checking code sends the output straight to /dev/null so he has no way of knowing.


In theory, sure.

In practice, how many orgs serve on both example.com and www.example.com yet operate each as entirely separate sites?

I cannot think of any example.


MIT was, for decades, though they seem to have changed.


No they're not.


How does a dead site make it into the top million?


Typically, during its pre-death phase.


wouldn't this imply that either the ranking system is broken.....or there are less than 1 million active sites.....


My current beliefs about how people use and trust information on the Web.

First, trust is _everything_ on the Web, it is the thing people first think of when arriving on some information. But how people evaluate trust has changed dramatically over the last 10 years.

- Trust now comes almost exclusively from social proof. Searching reddit, youtube, etc and other extremely _moderated_ sources of information, where the most work is done to ensure content comes from actual human beings. How many of us now google `<topic> reddit` instead of just `<topic>`?

- Of course a lot of this trust is misplaced. There's a very thin line between influencers and cult leaders / snake oil salesmen. Our last President used this hack really effectively.

- Few trust Google's definition of trust anymore -- essentially page rank. This made more sense when the Web essentially was social, where inbound links were very organic. Now with the trust in general Web sites evaporated, the main 'inbound links' anyone cares about come from individuals or community they trust or identify with. They don't trust Googles algorithm (its too opaque, and too easily gamed).

This of course means the fracturing of truth away from elites. Sometimes this could be a good thing, but in many cases cough Covid cough it might be pretty disastrous for misinformation


> How many of us now google `<topic> reddit` instead of just `<topic>`

I sure hope not, Reddit is horrible place for information


When I have a specific technical question, I append "stackoverflow" to my search queries. When I want to read a discussion, I add "reddit" (or "hacker news").


I use the strategy for a few things - including when I want to get reviews of a product or service. There's still potential for manipulation there, but you can judge the replies based on the user history - and you know that businesses aren't able to delete or hide bad reviews there.

But in general I agree with you - reddit is full of misinformation, propaganda and astroturfing


> How many of us now google `<topic> reddit` instead of just `<topic>`?

One of us lives in a bubble. I don't trust Reddit for anything, or YouTube or any social media. IME, it's mis/disinformation - not only a lack of information, but a negative; it leaves me believing something false. My experience is, and plenty of research shows, that we have no way to sort truth from fiction without prior expertise in the domain. The misinformation and disinformation on social media, and its persuasiveness, is very well known. The results are evident before us, in the madness and disasters, in dead people, in threats to freedom, prosperity, and stability.

Why would people in this community, who are aware of these issues, trust social media? How is that working out?

> This of course means the fracturing of truth away from elites. Sometimes this could be a good thing

I think that's mis/disinformation. 'Elite' is a loaded, negative (in this context) word. It makes the question about power and the conclusion inevitable.

Making it about power distracts from the core issue of knowledge, which is truth. I want to hear from the one person, or one of the few people, with real knowledge about a topic; I don't want to hear from others.

In matters of science the authority of thousands is not worth the humble reasoning of one single person.


They already acknowledge the problem of trusting the crowd, but you seem to not acknowledge the problem of trusting a central dispensary. In fact it's unwise to trust either one. Everything has to be evaluated case by case. The same source should be trusted for one thing today, and not for some other thing tomorrow.


irony that the site is not responding?


TLDR: Campbell's methodology is flawed, does not consider edge cases (one of which (equating apex-only and www-prefixed domains) I consider reckless), and didn't understand how Majestic collects and processes its data.

Longer version: This isn't comprehensive, but I think of two main reasons why:

- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example https://peering.azurewebsites.net/) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least http://ai./ does exist!)

- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!

Edit: I've downloaded out the CSV to check my claims, and it shows:

  wixsite.com 0
  beian.gov.cn 0
Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?

Also addendum to crawling but I consider "probably forgivable":

- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.


> Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will.

I can confirm stuff like that - I'm writing a crawler&indexer-program (prototype in Python, now writing the final version in Rust) and assuming anything while crawling is NOK. I ended up adding URLs to my "to-index"-list by considering only links explicitly mentioned by other websites (or by pages within the same site).


It even says right at the top of the Majestic Million site "The million domains we find with the most referring subnets", not implying anything about reachability for http(s) requests.


Not surprising. We're far away from the glory days of the vibrant, chaotic web.

In countries like India that onboarded most users through smartphones instead of computers, websites are not even necessary. There's a huge dearth of local-focused web content as well since there just isn't enough demand.


One of the few things I like about blockchain is the promise of a less ephemeral web.


Is that actually true? Don’t most nodes hold heavily compressed pointers only while there are only a percentage of nodes that host the entire blockchain. I mean if what you’re saying is true then each node needs to have a copy of the entire internet which isn’t reasonable.


I'm thinking about things like Filecoin, a blockchain which is meant to power IPFS. To be fair though, IPFS itself is not a blockchain


One of many things I dislike about cryptoscams is making promises which are lies.


spoken like someone who is clueless about Blockchain




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: