Hacker News new | past | comments | ask | show | jobs | submit login
Hashify: what becomes possible when one is able to store documents in URLs? (hashify.me)
186 points by potomak on Dec 30, 2011 | hide | past | favorite | 77 comments



I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document. That'd chance the semantics of hyperlinks from "link to document at this internet address" to "link to document with these contents", just like Hashify does, but it could do arbitrarily large documents.

The tricky part with that system would be that you'd also need some new mechanism to retrieve the files. Instead of the regular WWW stack, you'd need something like a massive distributed hash table that could handle massive distributed querying and transferring the hashed files. Many P2P file sharing systems are already doing this, but a sparse collection of end-user machines containing a few hashed files each isn't a very efficient service cloud. If every ISP had this sort of thing in their service stack or if Amazon and Google decided to run the service, all of them dynamically caching documents in greater demand in more nodes, things might look very different.

This would mean that very old hypertext documents would still be trivially readable with working links, as long as a few copies of the page documents were still hashed somewhere, even if the original hosting servers were long gone. It would also make it easy to do distributed page caching, so that pages that get a sudden large influx of traffic wouldn't create massive load on a single server.

On the other hand, any sort of news sites where the contents of the URL are expected to change wouldn't work, nor would URLs expected to point to a latest version of a document instead of the one at the time of linking. Once the hash URL was out, no revision to the hashed document visible from following the URL would be possible without some additional protocol layer. The URL strings would also be opaque to humans and too long and random to be committed to memory or typed by hand. The web would probably need to be somehow split into human-readable URLs for dynamic pages and hash URLs for the static pieces of content served by those pages.

I'm probably reinventing the wheel here, and someone's already worked out a more thought out version of this idea.


I think Freenet already does this https://en.wikipedia.org/wiki/Freenet#Keys Edit: I should point out that it's a separate network from "the web".


> I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document

Git.

-

It may be of interest to view this duality as an analog to the duality of location addressing (iterative) vs value addressing (functional) in context of memory mangers. The general (hand wavy as of now) idea is a distributed memory system with a functional front-end (e.g. Scala/Haskell).




Right. Turns out my use of 'URL' everywhere in grandparent comment is a misnomer then. Should've used URN or URI.

I'm not quite sure if URN is exactly right for the hash thing either, given that it both fails to unify things which humans would probably assign the same URN to, such as two image files of the same picture using different encodings, and it has the theoretical chance of assigning the same hash to two entirely different things.


I think these issues are clearly answered by the rfc http://tools.ietf.org/html/rfc1737.

* Global uniqueness: The same URN will never be assigned to two different resources. ((the encoding would be part of the URN))

* Independence: It is solely the responsibility of a name issuing authority to determine the conditions under which it will issue a name. ((a URN wouldn't necessarily be a hash of the resource in question))

The second point makes it pretty clear that the assignment of URN's would be done by some authoritative parties, which makes sense if you think that in their initial view URN's would have been useful in linking citations, references; for research papers. Just that the Internet has far time ago branched from that scope.


This (non-canonical) interpretation of the UR<x> schemata works for me (in terms of dealing with ambiguity of the canonical specification):

Names: universally unique and fully scoping the life-cycle of the (logical) object. 1:1.

Identifiers: unique in context of an authority with a life-cycle that is maximally (but not necessarily) bounded by the life-cycle of the named entity (and of course, the authority that assigns it). e.g. http://www.ssa.gov/history/ssn/geocard.html is the authority that issues SSN identifiers. An entity can potentially have multiple such identifiers. 1:N

Locations: The location of an image or representation of the entity. 1:N (e.g. CDNs)


The creator of Freenet made something called dijjer, which mirrors http files in a p2p network accessible by prepending http://dijjer.org/get/ to it. But it looks like he's no longer maintaining it.

http://code.google.com/p/dijjer/


Check out http://en.wikipedia.org/wiki/Magnet_URI_scheme

In a single link for a file, it can contain multiple hashes for multiple means of retrieval.


Very neat idea, but I think the reliance in bit.ly is self-defeating. This kind of approach would allow people to distribute documents using the web without having to trust them to a particular server, which can be very convenient if your target audience is in a country where access to the server storing your documents can be closed. For this to work you need to be able to recover the document from the URL locally.

Some years ago a friend and I wrote http://notamap.com, a very similar idea for sharing/storing/embed geotagged notes fully encoded on a URL, without having to rely on a server. Looking at it now I wish we had not put all the crazy animations. Maybe I should recover it and simplify the UI.


Very neat idea, but I think the reliance in bit.ly is self-defeating. This kind of approach would allow people to distribute documents using the web without having to trust them to a particular server, which can be very convenient if your target audience is in a country where access to the server storing your documents can be closed.

I just can't see the gain here. You need a server to distribute the URLs in any case. You are just moving the data from the server that served the document to the server that servers the URLs. It is still the same data, just in different form.

For this to work you need to be able to recover the document from the URL locally.

How about saving the document?


Right, exactly. It's only really cool for magnet links because the law is different for "linking to content" vs "hosting content" right now.

One step further in this direction would just mean "including all the data for one or more documents on one page". You've just invented...a document, possibly with more than one media type.

Think about it this way: You have a 10K document which contains 200 bytes of link data; a URI to another hypertext document that is 5K in size...vs...you have a 15K document which contains 5K of "link data"; the other document.


I don't get it. Why do they claim that this is in any way better than a data:// url? (http://es.wikipedia.org/wiki/Data:_URL)


This lets you use the data as a piece of a URL, so you can pass it as a CGI query string to another web page.


I still don't get it. Passing data as pieces of URLs is what normal parameters are for and data URLs are for generating a "virtual file", id est a link that contains all the information of the file linked.

With those 2 things, everything should be covered.


Some limitations. You can't redirect to a data:// url, e.g. <img src=http://tinyurl.com/44c8ctt > or <a href=http://tinyurl.com/44c8ctt >link</a> gets stopped by Chrome as an injection attack. So you can't use data:// to (ab)use a link shortener as a CDN.



I looked at URL shortener limits some time ago and found these approximate limits by trial-and-error:

* TinyURL 65,536 characters and probably more, but requests timed out; there isn’t an explicit limit apparently

* Bit.ly 2000 characters.

* Is.Gd 2000 characters.

* Twurl.nl 255 characters.

This was 2.5 years ago, not sure how much of these have changed (other than bit.ly, which the linked article confirms is 2048, probably the same as when I tested it).

http://softwareas.com/the-url-shortener-as-a-cloud-database


2000 is roughly the maximum length of a URL that IE can handle, incidentally.



Boiling it down, it's a new file format with a built in viewer. You need to find a way to store the data.

Interesting, but I can't think of any practical application, apart from the service provider not having to worry about storage (maybe that's key ... more thinking needed).


It would be except it's not new, it's a dupe of post from not too long ago that had a large amount of points: http://news.ycombinator.com/item?id=2464213


I really like this approach, in fact that's what I used for http://www.patternify.com/

This way the whole tool can be 100% client-side javascript, without a need for any back-end.


The original version of Mr Doob's GLSL Sandbox at http://mrdoob.com/projects/glsl_sandbox/ used the same approach, but increased the maximum possible size of the document by doing LZMA compression before base64.

The project later moved to http://glsl.heroku.com/ with an app-driven gallery, and that particular feature went away. I think that is a pretty natural evolution of any such idea, so I'm not convinced of hashify's logevity, but hey, simple sometimes is really enough.


Cool to see but stupid idea, who in their right mind would use this for production?! By using such a "technology", you lose SEO strength due to urls-not-being-like-this.html and even worse, what can stop me from publishing a fake press release on there site/spamming porn and getting that URL indexed? And what are the benefits? To also bring SOPA into this, couldn't I share copyrighted material on someone's site like this? How could they control that?! Besides blocking each URL manually. Just seems dumb. As a concept, cool, but for production.... Yikes?!


Obviously you could also have a database with all content URL strings that you publish - but that makes this technology worth nothing at all.


"Internet Explorer cannot display the webpage" is what happens here (IE 8).


Not really a good reply, but I think that hashify.me's potential for an IE audience was probably small to start with. But consider this: if this idea took off, wouldn't this press MS into keeping IE more modern?


My mobile browser of choice does not support cross browser resource sharing, according to the article...or rather the error message I get in lieu of the article.


Same with Opera and Chrome on my desktop. Firefox works at least.


Yeah, his message was 5548 characters (so the URL is at least 7380) which is way too long for the generally accepted 2000 character limit. So this particular protocol could use some enhancements - maybe a feature to break up messages into several parts represented by other bit.ly addresses. This would keep URL length under control


Works happily enough with mobile Safari on iOS 5


I took a similar approach to this with http://cueyoutube.com and recently found snapbird which gives extended twitter search capabilities. So the URL contains the playlist and twitter becomes the data base, so I just tweet my playlists and they're "saved". You can see all the lists I've created by searching the account iaindooley and search term cueyoutube in snapbird.


What becomes possible? The entire internet could effectively get rid of hosting account providers, with each page in every site being contained in a hashify URL, and with each page linking to other pages using other hashify URLs.

Trouble is, there might be a DNS-like system needed to match hashify URLs to more human-readable strings (or a way for existing DNS to resolve to hashify style URLs).

Neat idea.


The data needs to be stored somewhere. In their implementation they in effect use bit.ly as the hosting provider for the data by shortening the url's, so while it's a fun little experiment, it boils down to a content addressable system. We already have good examples of content addressable systems. Git for example is built on content addressable storage.


It also assumes that there's no limit on URL space that bit.ly provide. Tomorrow they could just max out the "long_url" field or whatever they call it to just accept 1500 chars or something


Wouldn't the document also be stored in your URL browsing history?


There is:

    "Storing a document in a URL is nifty, but not terribly 
    practical. Hashify uses the [bit.ly API][4] to shorten 
    URLs from as many as 30,000 characters to just 20 or so.
    In essence, bit.ly acts as a document store!"
http://tinyurl.com/3n6h8px


This was my point, they're passing their 30,000 char URL to bit.ly to return a short URL (20 chars)

But what happens if bit.ly just say "Sorry incoming URLS (long urls) can only be a maximum of 1500 chars"?


And then the next paragraph:

  While the HTTP specification does not define an upper limit
  on the length of a URL that a user agent should accept,
  bit.ly imposes a 2048-character limit.


hence the "bit.ly – "rate limit exceeded" :\" message on top...

maybe switch to goo.gl?


The real trouble is that when you link to a hashified URL, you are actually embedding in your web page (an encoding of) the content of the page you are linking to. Think matryoshka.


Right. Even without link cycles, any link would have to contain (nearly) the whole internet.


Not to mention that makes link cycles impossible.


Nah! I am pretty sure that a quine can be made using this! http://en.wikipedia.org/wiki/Quine_(computing)


in essence, this would be moving away from a model of "large networks of connected pages/sites" to "a large amount of single documents with no meaningful mechanism of inter-connectedness".

Think about this like a PDF where stuff is embedded instead of in separate files.


Clever? Yes.

But URL shortening services are a public good, and hacking one to be your personal cloud storage platform is kind of a dick move.


agreed


This is cool, but I wouldn't use it for any real documents. I care about versioning, edit history, etc.


Great idea this. But it saves on each edit and likely to hit rate limit on bit.ly :(


This is an ancient idea. I read a 2600 article back in the early 2000s or possibly late 1990s that did essentially this same thing using a bash script and one of the first URL shortening services available at the time.


   What has been will be again,
   what has been done will be done again;
   there is nothing new under the sun.

   - Ecclesiates 1:9


...and the general concept of "embedding one type of data inside another" predates computers and modern civilization.


Check out https://neko.io/ ... we are scrambling/encrypting messages into URLs which you can then share on Facebook, Twitter or where-ever.




This is very cool. Not the tasydrive, your "keynav"

http://www.semicomplete.com/projects/keynav/


Maybe this would work well in an email? Especially if you want to get content past filtering.


Yes, but so would any pastebin website. You could say the advantage is that the content is not available to the server (since its transferred in the URL itself), but then it is when you actually read it, so it's not any more private.


ought to gzip the string before it goes to base64 while you're at it


We actually do this exact thing to send dynamic parameters to a chart-generating backend server. It works great; you get a surprising amount of compression using gzip (2-4x space savings) and the URLs are naturally cached by proxies without any magic!


If it's mostly ascii compression is something you should almost automatically think about.

There's also snappy http://code.google.com/p/snappy/


some observations: 1. when content changes every hyperlink to that content must change along with it. 2. pass by reference (url) is no longer possible.


The first use case that comes to mind is anonimity.


Solutions like pastebin.com provide the same, with a small url.


These messages are SOPA proof. They can never be "taken down" since they don't actually reside on the server


Considering that SOPA considers linking to copyright infringing content to be as bad as publishing it yourself, this project unifies those two actions nicely!


They do however reside on server hosting the link


No, that's not entirely true. I don't have to use a service like bit.ly to send one of these messages. And further, I could just as easily use _any_ or _many_ services. Since the technology is fundamentally browser-to-browser kind of distributed concept, it's just the URL shortening that's not SOPA compliant.

There's also several ways to obscure the impact of SOPA on the URL shortening anyway. For instance, if several services use the same hash algorithm for representing URLs, they can be used interchangeably (if you post the URL to all of them). Further, you can always set up your own temporary shortening service as well.


* No, that's not entirely true. I don't have to use a service like thepiratebay.org to send one of these files. And further, I could just as easily use _any_ or _many_ trackers inside my torrent. *

Altered to convey another point. Naturally, it would be quite difficult to "embed" a feature length movie into a single url, but if one was to split the file into chunks like torrent transferring does, or simply a multi-part rar like newsgroups still do, it enables each chunk to be more manageable.

I do agree with you though, but I think the reason that a service like this if changed in such a way to be user-friendly for file sharing, not just document sharing, would be able to get around a lot of the pitfalls a torrent tracker (for example) would have if it's DNS lookup was blocked (which aren't many) is due to the simple fact that SOPA is written in a way that assumes all IP addresses and DNS names are statically tied together and slow to alter, not that I can have a new domain name in a matter of minutes that resolves to my existing server. Even more so if the final URL hash was nothing more than a common and known algorithm, like base64, that one could easily plug into a basic desktop app and get the same result.


How is this substantially better than just sending someone a file?

Plaintext, the past, present and future.


Well, for one thing, you can't tweet a link.


Link shortening service?! Twitter automatically shortens it for you :)


Damn, so cool!


Fuckify this ify trend to emulate or ride the spotify (anti)fame.

Is it just me or does anyone else also just back away whenever there is a project which turns nouns into verbs with ify? Spotify is a sockpuppet of


-ify is a pretty common suffix in English. It means to turn something into something else. https://en.wiktionary.org/wiki/-ify and https://en.wiktionary.org/wiki/Category:English_words_suffix...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: