Hacker News new | past | comments | ask | show | jobs | submit login
Wayback Machine Downloader (github.com/hartator)
212 points by pabs3 on July 12, 2021 | hide | past | favorite | 41 comments



I'm pretty fond of using this tool to take trips down memory lane, revisiting lost content I used to enjoy.

Browsing through crawls has this neat side-effect of being able to serendipitously discover things that I missed back in the day just by having everything laid out on the file system.

PSA: There's a lot of holes in most crawls, even for popular stuff. A good way to ensure that you can revisit content later is submitting links to the Wayback Machine with the "Save Page Now" [1] functionality. Some local archivers like ArchiveBox [2] let you automate this. Highly recommended to make a habit of it.

[1] https://web.archive.org/

[2] https://github.com/ArchiveBox/ArchiveBox


Another convenient way to interact with "Save Page Now" is just to email a bunch of links to the savepagenow address at archive.org. I especially like to copy all the HTML of a page and paste it into a HTML email to get all the links.


There are two things to note, neither of which are well-advertised:

1. The parent comment you're replying to links to the main page for the Wayback Machine, which includes a Save Page Now widget, but Save Page Now actually has a dedicated page <https://web.archive.org/save/>

2. If you have an archive.org account (lets you submit and comment on collections; the library is bigger than just the Wayback Machine) and you visit the Save Page Now page while logged in, you get more options, including the option "Save outlinks"



Yeah, I use that API from the browser, I found the bulk asynchronous zero-download email API more convenient, since for a while, the save API stopped supporting HEAD requests, although it seems to support it again now.


I made this!

I had an old website of mine (my old video game portafolio) that I wanted to bring back to life. I have no sources and no backups. But it was still on the Wayback Machine! I first wrote a quick wrapper in Ruby and it worked fine. I then decide to open source it and publish it. It was a fun adventure to see this being used by so many! <3


Kudos for taking the time to write up a nice, clean, and concise readme. I don't know about everyone, but this makes all the difference in the world to me.


Thanks very much for this! I used it a few years ago to recover some lost content from my blog: https://simonwillison.net/2017/Oct/8/missing-content/


OMG. Thank you for making this (and thanks to the archive.org folks for scraping in the first place)! I used it to recover my old blog which is probably lying around somewhere on some HDD in a closet, but it's so nice to have it on GitHub now!


Thanks for writing it, I used it to recover part of one of the Debian conference websites.

PS: could you merge my two pull requests? ;)


> Tip: If you run into permission errors, you might have to add sudo in front of this command.

That is not a tip, it's a dirty workaround.


And also a fantastic source of Stack Overflow questions when trying to troubleshoot why some old dependency starts showing up "randomly"

Heh, I just realized SO is a potential source of energy if we could harness the copy-paste to SO question to copy-paste cycle. Blockchain! :-P


> it's a dirty workaround

Yeah, but at the same time, if an attacker does have no-sudo access to a machine, everything interesting is most likely already compromised. Sudo does seem an hardly justifiable complication in most cases.


It would be handy to include an option to walk back in time without start downloading until a certain site size is met, to exclude 404s, domain for sale placeholders etc, which aren't uncommon among old sites, so that the Archive.org precious bandwidth isn't wasted.


Already has an option to include/exclude 404s. Please file issues for the other requests, except the filter for domains for sale, that is a hard problem that probably isn't in-scope for this tool.


The parent's suggestion to filter on size seems on the surface like it would work. What makes you think it's harder than that?


I'm not sure how filtering based on size relates to filtering out domains for sale, spam domains etc. Lots of legit domains have small size or large size and lots of spam domains similarly vary in size.

Filtering out unwanted domains (sale, spam etc) is a problem for a bunch of regexes, bayesian classification or machine learning.

Edit: I think I misread the original post quite badly and I don't understand the proposed feature.


My bad, I didn't specify what sizes I was referring to; I meant the page size, including graphics, number of links etc. If there's a way to extract a rough estimate of the page "weight", it could be used to filter out empty (as in clearly expired) pages without downloading them. I'm not a web dev, so I'm not sure if that is possible.


I think the best way to provide what you want would be to provide pre-download and post-download hooks, so you could write some code to prevent some downloads and detect downloaded files are spammy. Then you could write your size heuristic as a plugin and others could use regexes or machine learning to do something similar.


Thinking it over again, maybe JSON input/output is enough, then you can filter that and change the list of pages to download, or delete downloaded files.


Unfortunately a lot of adfarm sites that hoover up abandoned/popular domains scrape content from other sites and aggregate it to keep Google engaged at driving clicks to their ad-laced pages.

On the surface a Yelp like system to rate domains as legit vs. click bait seems logical until you realize the scammers would just work at gaming that system too :p

This is where a universal ID would really help - but the other ways something like that could be used make me even more uncomfortable so here we are with no real good solution :(


I had a look into this as part of a research project.

To reduce the burden of writing specific scraping software, I investigated the software listed by the Archive Team[0].

The command line tools and libraries aside (as they would require much more specific tailoring to make them work), I was particularly interested in HTTrack and Warrick.

Warrick[1] is defined as a tool to recover lost websites using various online archives and caches. Warrick[2] requires some expert knowledge in order to get it up and running.

I found Warrick was a bit outdated, so decided to try something similar but more up to date, and came across this (hartator/wayback_machine_downloader).

I found this to be a bit easier to work with as it would allow me to download snapshots within a time period, which is what I needed for this project.

After running it for ~12 hours on my local machine, it still had not completed, downloading only 11768 of 94518 files.

Instead, I found myself writing a Python based tool that could fetch with much more accuracy using the CDX server[3] and filtering by date, targeting only certain fails and allow for multi threading.

In order to improve the process, and narrow down the data of what is needed, I scoped to just July and December timeframes per year, from 2010 to this year, only targeting html files.

For example: http://web.archive.org/cdx/search/cdx?url=%s&matchType=prefi...

Hopefully someone finds this useful.

[0] https://archiveteam.org/index.php?title=Software

[1] https://github.com/oduwsdl/warrick

[2] https://code.google.com/archive/p/warrick/wikis/About_Warric...

[3] https://github.com/internetarchive/wayback/tree/master/wayba...


wayback-machine-downloader uses the CDX API too btw. IIRC it has some rate limiting to avoid overloading the Wayback server.


I would love a successor to HTTRACK. It worked wonderfully back in the day but modern websites have so much media and dynamic content you barely get anything useful.


I will be using this to restore a frequently disappearing webcomic called Newspaper Comic Strip. About a man who realizes he’s a Newspaper Comic Strip.[0]

[0] http://web.archive.org/web/20210119093354/http://riotfish.co...



I tried the Wayback Machine Downloader and it works great if I want to download the whole page or the index of all versions. But in my case, the page has been taken down and this version has been archived and is downloaded if I just use the tool without any additional commands. Alternatively I can download the index of all pages, but then I am missing all the images and the websites are just text files. How can I download a whole older website with this tool?


Seems like it would be easier to just use wget to download the site from the Wayback Machine URL - been doing this for years.

wget -r -np -k "https://web.archive.org/web/DATENUMBER/URL"


The tool lets you do more, for example it disables all the modifications that the Wayback web interface does to the original files. It also lets you download the whole history of the site.


What's the advantage to, say, using something like wget --mirror ?


This tool appears to be using the Wayback Machine's CDX index to assist with the download.

The CDX basically lists all the pages of the site that are archived in the wayback, as well as when each page was archived (e.g: "page x was archived 3 times, on these dates"). Using the CDX allows the tool to download a specific copy of the site (e.g: the latest) rather than trying to download every copy of the site that the wayback machine has.

This is important because for most sites, the wayback has multiple copies, and they're all interlinking. For example, the copy from May 2020 might not be complete so one of the links in that copy will take you to the January 2018 copy. Not a problem for a human viewer, but a bot / crawler will see pages in the January 2018 copy as separate from those in the May 2020 copy, so will begin downloading the January 2018 copy (because wayback URLs are of the form web.archive.org/<timestamp>/<archive-url> rather than web.archive.org/<archive-url>/<timestamp>). This copy will (inevitably) lead to other copies made at different dates, and before you know it you're downloading hundreds or even thousands of copies of the same site.

[source: tried to download a site from the wayback machine several years ago using wget - it didn't end well!]


It can also download from every copy of the site that the wayback machine has, sometimes that is useful.


It doesn’t add Wayback Machine’s navigation bar, presumably?


Correct, it appends "id_" to the timestamp in the wayback URLs, which gives you the unmodified file instead of one marked up by the Wayback Machine.


> Default is one file at a time (ie. 20)

I don’t understand this. Did they mean “e.g.” instead of “i.e.”?


Even with e.g. it doesn't make much sense.


1=20?


Lol just a crazy thought can we use this tool to download archives and create a replica of existing/old sites?


Definitely, that is basically what it is for.


Finally a free version, people have ripped of others with previous paying services for years (I say ripped off because they particularly used open source projects without attribution)


Which paid services are you referring to? It is likely that these services aren't distributing the projects they are based on, if so, then they are in compliance with the licenses of the open source projects, which probably don't require attribution unless you distribute them.

This project started in 2015 btw. Another similar project called waybackpack started in 2016. There are probably more projects. IMO wayback-machine-downloader is the better project though.

https://github.com/jsvine/waybackpack

The Wayback CDX Server API these projects are based on is quite simple to use btw, just some JSON responses to decode.

https://archive.org/help/wayback_api.php https://github.com/internetarchive/wayback/blob/master/wayba...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: