Hacker News new | past | comments | ask | show | jobs | submit login

I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document. That'd chance the semantics of hyperlinks from "link to document at this internet address" to "link to document with these contents", just like Hashify does, but it could do arbitrarily large documents.

The tricky part with that system would be that you'd also need some new mechanism to retrieve the files. Instead of the regular WWW stack, you'd need something like a massive distributed hash table that could handle massive distributed querying and transferring the hashed files. Many P2P file sharing systems are already doing this, but a sparse collection of end-user machines containing a few hashed files each isn't a very efficient service cloud. If every ISP had this sort of thing in their service stack or if Amazon and Google decided to run the service, all of them dynamically caching documents in greater demand in more nodes, things might look very different.

This would mean that very old hypertext documents would still be trivially readable with working links, as long as a few copies of the page documents were still hashed somewhere, even if the original hosting servers were long gone. It would also make it easy to do distributed page caching, so that pages that get a sudden large influx of traffic wouldn't create massive load on a single server.

On the other hand, any sort of news sites where the contents of the URL are expected to change wouldn't work, nor would URLs expected to point to a latest version of a document instead of the one at the time of linking. Once the hash URL was out, no revision to the hashed document visible from following the URL would be possible without some additional protocol layer. The URL strings would also be opaque to humans and too long and random to be committed to memory or typed by hand. The web would probably need to be somehow split into human-readable URLs for dynamic pages and hash URLs for the static pieces of content served by those pages.

I'm probably reinventing the wheel here, and someone's already worked out a more thought out version of this idea.




I think Freenet already does this https://en.wikipedia.org/wiki/Freenet#Keys Edit: I should point out that it's a separate network from "the web".


> I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document

Git.

-

It may be of interest to view this duality as an analog to the duality of location addressing (iterative) vs value addressing (functional) in context of memory mangers. The general (hand wavy as of now) idea is a distributed memory system with a functional front-end (e.g. Scala/Haskell).




Right. Turns out my use of 'URL' everywhere in grandparent comment is a misnomer then. Should've used URN or URI.

I'm not quite sure if URN is exactly right for the hash thing either, given that it both fails to unify things which humans would probably assign the same URN to, such as two image files of the same picture using different encodings, and it has the theoretical chance of assigning the same hash to two entirely different things.


I think these issues are clearly answered by the rfc http://tools.ietf.org/html/rfc1737.

* Global uniqueness: The same URN will never be assigned to two different resources. ((the encoding would be part of the URN))

* Independence: It is solely the responsibility of a name issuing authority to determine the conditions under which it will issue a name. ((a URN wouldn't necessarily be a hash of the resource in question))

The second point makes it pretty clear that the assignment of URN's would be done by some authoritative parties, which makes sense if you think that in their initial view URN's would have been useful in linking citations, references; for research papers. Just that the Internet has far time ago branched from that scope.


This (non-canonical) interpretation of the UR<x> schemata works for me (in terms of dealing with ambiguity of the canonical specification):

Names: universally unique and fully scoping the life-cycle of the (logical) object. 1:1.

Identifiers: unique in context of an authority with a life-cycle that is maximally (but not necessarily) bounded by the life-cycle of the named entity (and of course, the authority that assigns it). e.g. http://www.ssa.gov/history/ssn/geocard.html is the authority that issues SSN identifiers. An entity can potentially have multiple such identifiers. 1:N

Locations: The location of an image or representation of the entity. 1:N (e.g. CDNs)


The creator of Freenet made something called dijjer, which mirrors http files in a p2p network accessible by prepending http://dijjer.org/get/ to it. But it looks like he's no longer maintaining it.

http://code.google.com/p/dijjer/


Check out http://en.wikipedia.org/wiki/Magnet_URI_scheme

In a single link for a file, it can contain multiple hashes for multiple means of retrieval.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: