Bookmark Archives That Don't

_grrr · on Nov 26, 2010

I'd be interested to know how you fully resolve external dependencies. For example - do you pull in js libraries that are linked to dynamically within other js files (as opposed to those that are simply referenced statically as includes in the html)? If so, are you rendering the page in a 'headless browser' to do this?

By the way, I think the idea for the service is great, although a little too pricey for me to start using yet ;-) I always need to search my bookmarks. As a proxy for doing this, I currently use Google's "search visited pages" feature: when you're logged in to Google and search, you now get the option to constrain the search to only those pages that you have visited in the past - a superset of bookmarks, but useful nonetheless.

riffraff · on Nov 26, 2010

connected to this: a lot of content in pages is often pulled in via js. For example, facebook's page as seen via links is basically a long list of script tags without any content.

Without javascript evaluation it seems that a lot of content would be lost.

tshaddox · on Nov 26, 2010

Based on my usage of Firebug I am under the impression that even if stuff was put onto the page with js (or if the html source is badly broken), the browser will build a valid DOM structure of the page and should be able to save it out as html/css.

hartbren · on Nov 26, 2010

Surely the service is exposed to copyright claims. If the developers/business owners are reading, I would be interested to hear about what issues have arisen so far.

idlewords · on Nov 26, 2010

None. Cached links are only visible to the user who saved them, and that seems to do the trick.

_debug_ · on Nov 26, 2010

What am I missing : why not just use Firefox ScrapBook and automate regular backups and sync-ing with your laptop (which is what I do right now)?

gwern · on Nov 26, 2010

Or have an SQLite script which pulls URLs out of your Firefox history and archives them using http://webcitation.org/ ? (What I am currently doing.)

pclark · on Nov 26, 2010

I bought this functionality but have honestly never found the need for it; it kind of reminds me of the dropbox "pack rat" feature.

Happy to support a great developer (in both cases, actually)

RexRollman · on Nov 26, 2010

I've tried making PDFs of interesting webpages before, for personal archiving, using various PDF printers for Windows. I always ended up with files that looked weird and not at all like the webpage.

I eventually found a FF extension that would save pages perfectly as a JPG or PNG but then the text was no longer selectable or searchable.

laktek · on Nov 26, 2010

One should build a service that will save the page as an image while keeping textual content of the page separately for searching.

mixmax · on Nov 26, 2010

Maybe it would even be possible to reproduce these webpages using HTML.

wladimir · on Nov 26, 2010

I suppose a big part of the storage problem he talks about is solved by agressively looking for duplicate and similar files over users. I mean, it is a given that a lot of people will be bookmarking the same sites.

mikeklaas · on Nov 26, 2010

Yes, but they will be also bookmarking a lot of stuff that only they bookmark.

In our dataset of over a billion bookmarks, 80% of urls were only bookmarked by a single user. These urls comprise about 50% of all bookmarks (user-document pairs).

Incidentally, worio.com (my startup) offers full-text search of your bookmarks (though not a viewable cached copy, like these services).

bruceboughton · on Nov 26, 2010

I'd bet the %age of external resources referenced by bookmarked pages that were unique to a user would be a lot lower... which is where the aggressive de-duplication would come in handy.

riffraff · on Nov 26, 2010

it's not only the same sites. It is probable that a lot of included content: javascript libraries are the same very often, header images, backgrounds, stylesheets, facebook/google/yahoo/twitter icons for various connect-like services, same youtube videos get published in many places etc.

There is a lot common across different pages I believe.

earl · on Nov 26, 2010

I use zotero to save full pages; I like it. It's software that runs locally in ff.

If someone is interested, I'd love to talk about some improvements -- you could start with 1 - remove ads

2 - create a single file website archive format like the one that ie used (uses?) It's stupid to litter your file system with js and images when all you want is a file that includes the html and all dependencies

3 - deduplication w/ single save

4 - better search

5 - better compression

6 - site tracking