The tool captures screenshots in addition to text.

doodlesdev · on June 29, 2022

That seems awfully simillar to archive.org wayback machine. I do like to see all these archival projects though, they are certainly worthwhile.

lazyjeff · on June 29, 2022

irchiver captures text on the page, and separately OCRs the screenshots (specifically, the screenshot from your viewport). So you can search just what was shown on the page, or what was in the page. Both techniques have pros and cons.

While archive.org is fantastic, it can only capture pages that are both 1) publicly accessible (i.e. no social media content) that it happens to crawl, and 2) static content (you're out of luck if the content you want is loaded dynamically, or changes depending on user input).

pabs3 · on June 30, 2022

IIRC archive.org does save the JS and things it downloads, so you can replay them when you visit the archived site later.

doodlesdev · on July 1, 2022

I guess the difference in this case is that the JS on web archive relies on future browsers being backwards compatible, whereas irchiver relies on much less to stay timeless which is good. Although I don't think JavaScript will ever get a major update (as in breaking comparability) I believe relying on that is not a perfect way to archive web content. This kind of backwards compatibility breakage is something we have seen before with the deprecation of Adobe Flash and it could theoretically happen elsewhere on the web stack.

pabs3 · on July 1, 2022

Agreed, I wish they would archive the DOM too like archive.is does instead of just the requests.