I'd be interested to know how you fully resolve external dependencies. For example - do you pull in js libraries that are linked to dynamically within other js files (as opposed to those that are simply referenced statically as includes in the html)? If so, are you rendering the page in a 'headless browser' to do this?
By the way, I think the idea for the service is great, although a little too pricey for me to start using yet ;-) I always need to search my bookmarks. As a proxy for doing this, I currently use Google's "search visited pages" feature: when you're logged in to Google and search, you now get the option to constrain the search to only those pages that you have visited in the past - a superset of bookmarks, but useful nonetheless.
connected to this: a lot of content in pages is often pulled in via js.
For example, facebook's page as seen via links is basically a long list of script tags without any content.
Without javascript evaluation it seems that a lot of content would be lost.
Based on my usage of Firebug I am under the impression that even if stuff was put onto the page with js (or if the html source is badly broken), the browser will build a valid DOM structure of the page and should be able to save it out as html/css.
Surely the service is exposed to copyright claims. If the developers/business owners are reading, I would be interested to hear about what issues have arisen so far.
I've tried making PDFs of interesting webpages before, for personal archiving, using various PDF printers for Windows. I always ended up with files that looked weird and not at all like the webpage.
I eventually found a FF extension that would save pages perfectly as a JPG or PNG but then the text was no longer selectable or searchable.
I suppose a big part of the storage problem he talks about is solved by agressively looking for duplicate and similar files over users. I mean, it is a given that a lot of people will be bookmarking the same sites.
Yes, but they will be also bookmarking a lot of stuff that only they bookmark.
In our dataset of over a billion bookmarks, 80% of urls were only bookmarked by a single user. These urls comprise about 50% of all bookmarks (user-document pairs).
Incidentally, worio.com (my startup) offers full-text search of your bookmarks (though not a viewable cached copy, like these services).
I'd bet the %age of external resources referenced by bookmarked pages that were unique to a user would be a lot lower... which is where the aggressive de-duplication would come in handy.
it's not only the same sites. It is probable that a lot of included content: javascript libraries are the same very often, header images, backgrounds, stylesheets, facebook/google/yahoo/twitter icons for various connect-like services, same youtube videos get published in many places etc.
There is a lot common across different pages I believe.
I use zotero to save full pages; I like it. It's software that runs locally in ff.
If someone is interested, I'd love to talk about some improvements -- you could start with
1 - remove ads
2 - create a single file website archive format like the one that ie used (uses?) It's stupid to litter your file system with js and images when all you want is a file that includes the html and all dependencies
By the way, I think the idea for the service is great, although a little too pricey for me to start using yet ;-) I always need to search my bookmarks. As a proxy for doing this, I currently use Google's "search visited pages" feature: when you're logged in to Google and search, you now get the option to constrain the search to only those pages that you have visited in the past - a superset of bookmarks, but useful nonetheless.