I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document. That'd chance the semantics of hyperlinks from "link to document at this internet address" to "link to document with these contents", just like Hashify does, but it could do arbitrarily large documents.
The tricky part with that system would be that you'd also need some new mechanism to retrieve the files. Instead of the regular WWW stack, you'd need something like a massive distributed hash table that could handle massive distributed querying and transferring the hashed files. Many P2P file sharing systems are already doing this, but a sparse collection of end-user machines containing a few hashed files each isn't a very efficient service cloud. If every ISP had this sort of thing in their service stack or if Amazon and Google decided to run the service, all of them dynamically caching documents in greater demand in more nodes, things might look very different.
This would mean that very old hypertext documents would still be trivially readable with working links, as long as a few copies of the page documents were still hashed somewhere, even if the original hosting servers were long gone. It would also make it easy to do distributed page caching, so that pages that get a sudden large influx of traffic wouldn't create massive load on a single server.
On the other hand, any sort of news sites where the contents of the URL are expected to change wouldn't work, nor would URLs expected to point to a latest version of a document instead of the one at the time of linking. Once the hash URL was out, no revision to the hashed document visible from following the URL would be possible without some additional protocol layer. The URL strings would also be opaque to humans and too long and random to be committed to memory or typed by hand. The web would probably need to be somehow split into human-readable URLs for dynamic pages and hash URLs for the static pieces of content served by those pages.
I'm probably reinventing the wheel here, and someone's already worked out a more thought out version of this idea.
> I've sometimes wondered about a system where the URL of a document is an actual hash, like SHA-1, of the document
Git.
-
It may be of interest to view this duality as an analog to the duality of location addressing (iterative) vs value addressing (functional) in context of memory mangers. The general (hand wavy as of now) idea is a distributed memory system with a functional front-end (e.g. Scala/Haskell).
Right. Turns out my use of 'URL' everywhere in grandparent comment is a misnomer then. Should've used URN or URI.
I'm not quite sure if URN is exactly right for the hash thing either, given that it both fails to unify things which humans would probably assign the same URN to, such as two image files of the same picture using different encodings, and it has the theoretical chance of assigning the same hash to two entirely different things.
* Global uniqueness: The same URN will never be assigned to two different resources. ((the encoding would be part of the URN))
* Independence: It is solely the responsibility of a name issuing authority to determine the conditions under which it will issue a name. ((a URN wouldn't necessarily be a hash of the resource in question))
The second point makes it pretty clear that the assignment of URN's would be done by some authoritative parties, which makes sense if you think that in their initial view URN's would have been useful in linking citations, references; for research papers. Just that the Internet has far time ago branched from that scope.
This (non-canonical) interpretation of the UR<x> schemata works for me (in terms of dealing with ambiguity of the canonical specification):
Names: universally unique and fully scoping the life-cycle of the (logical) object. 1:1.
Identifiers: unique in context of an authority with a life-cycle that is maximally (but not necessarily) bounded by the life-cycle of the named entity (and of course, the authority that assigns it). e.g. http://www.ssa.gov/history/ssn/geocard.html is the authority that issues SSN identifiers. An entity can potentially have multiple such identifiers. 1:N
Locations: The location of an image or representation of the entity. 1:N (e.g. CDNs)
The creator of Freenet made something called dijjer, which mirrors http files in a p2p network accessible by prepending http://dijjer.org/get/ to it. But it looks like he's no longer maintaining it.
Very neat idea, but I think the reliance in bit.ly is self-defeating. This kind of approach would allow people to distribute documents using the web without having to trust them to a particular server, which can be very convenient if your target audience is in a country where access to the server storing your documents can be closed. For this to work you need to be able to recover the document from the URL locally.
Some years ago a friend and I wrote http://notamap.com, a very similar idea for sharing/storing/embed geotagged notes fully encoded on a URL, without having to rely on a server. Looking at it now I wish we had not put all the crazy animations. Maybe I should recover it and simplify the UI.
Very neat idea, but I think the reliance in bit.ly is self-defeating. This kind of approach would allow people to distribute documents using the web without having to trust them to a particular server, which can be very convenient if your target audience is in a country where access to the server storing your documents can be closed.
I just can't see the gain here. You need a server to distribute the URLs in any case. You are just moving the data from the server that served the document to the server that servers the URLs. It is still the same data, just in different form.
For this to work you need to be able to recover the document from the URL locally.
Right, exactly. It's only really cool for magnet links because the law is different for "linking to content" vs "hosting content" right now.
One step further in this direction would just mean "including all the data for one or more documents on one page". You've just invented...a document, possibly with more than one media type.
Think about it this way: You have a 10K document which contains 200 bytes of link data; a URI to another hypertext document that is 5K in size...vs...you have a 15K document which contains 5K of "link data"; the other document.
I still don't get it. Passing data as pieces of URLs is what normal parameters are for and data URLs are for generating a "virtual file", id est a link that contains all the information of the file linked.
With those 2 things, everything should be covered.
Some limitations. You can't redirect to a data:// url, e.g. <img src=http://tinyurl.com/44c8ctt > or <a href=http://tinyurl.com/44c8ctt >link</a> gets stopped by Chrome as an injection attack. So you can't use data:// to (ab)use a link shortener as a CDN.
I looked at URL shortener limits some time ago and found these approximate limits by trial-and-error:
* TinyURL 65,536 characters and probably more, but requests timed out; there isn’t an explicit limit apparently
* Bit.ly 2000 characters.
* Is.Gd 2000 characters.
* Twurl.nl 255 characters.
This was 2.5 years ago, not sure how much of these have changed (other than bit.ly, which the linked article confirms is 2048, probably the same as when I tested it).
Boiling it down, it's a new file format with a built in viewer. You need to find a way to store the data.
Interesting, but I can't think of any practical application, apart from the service provider not having to worry about storage (maybe that's key ... more thinking needed).
The original version of Mr Doob's GLSL Sandbox at http://mrdoob.com/projects/glsl_sandbox/ used the same approach, but increased the maximum possible size of the document by doing LZMA compression before base64.
The project later moved to http://glsl.heroku.com/ with an app-driven gallery, and that particular feature went away. I think that is a pretty natural evolution of any such idea, so I'm not convinced of hashify's logevity, but hey, simple sometimes is really enough.
Cool to see but stupid idea, who in their right mind would use this for production?! By using such a "technology", you lose SEO strength due to urls-not-being-like-this.html and even worse, what can stop me from publishing a fake press release on there site/spamming porn and getting that URL indexed? And what are the benefits? To also bring SOPA into this, couldn't I share copyrighted material on someone's site like this? How could they control that?! Besides blocking each URL manually. Just seems dumb. As a concept, cool, but for production.... Yikes?!
Not really a good reply, but I think that hashify.me's potential for an IE audience was probably small to start with. But consider this: if this idea took off, wouldn't this press MS into keeping IE more modern?
My mobile browser of choice does not support cross browser resource sharing, according to the article...or rather the error message I get in lieu of the article.
Yeah, his message was 5548 characters (so the URL is at least 7380) which is way too long for the generally accepted 2000 character limit. So this particular protocol could use some enhancements - maybe a feature to break up messages into several parts represented by other bit.ly addresses. This would keep URL length under control
I took a similar approach to this with http://cueyoutube.com and recently found snapbird which gives extended twitter search capabilities. So the URL contains the playlist and twitter becomes the data base, so I just tweet my playlists and they're "saved". You can see all the lists I've created by searching the account iaindooley and search term cueyoutube in snapbird.
What becomes possible? The entire internet could effectively get rid of hosting account providers, with each page in every site being contained in a hashify URL, and with each page linking to other pages using other hashify URLs.
Trouble is, there might be a DNS-like system needed to match hashify URLs to more human-readable strings (or a way for existing DNS to resolve to hashify style URLs).
The data needs to be stored somewhere. In their implementation they in effect use bit.ly as the hosting provider for the data by shortening the url's, so while it's a fun little experiment, it boils down to a content addressable system. We already have good examples of content addressable systems. Git for example is built on content addressable storage.
It also assumes that there's no limit on URL space that bit.ly provide. Tomorrow they could just max out the "long_url" field or whatever they call it to just accept 1500 chars or something
"Storing a document in a URL is nifty, but not terribly
practical. Hashify uses the [bit.ly API][4] to shorten
URLs from as many as 30,000 characters to just 20 or so.
In essence, bit.ly acts as a document store!"
While the HTTP specification does not define an upper limit
on the length of a URL that a user agent should accept,
bit.ly imposes a 2048-character limit.
The real trouble is that when you link to a hashified URL, you are actually embedding in your web page (an encoding of) the content of the page you are linking to. Think matryoshka.
in essence, this would be moving away from a model of "large networks of connected pages/sites" to "a large amount of single documents with no meaningful mechanism of inter-connectedness".
Think about this like a PDF where stuff is embedded instead of in separate files.
This is an ancient idea. I read a 2600 article back in the early 2000s or possibly late 1990s that did essentially this same thing using a bash script and one of the first URL shortening services available at the time.
Yes, but so would any pastebin website. You could say the advantage is that the content is not available to the server (since its transferred in the URL itself), but then it is when you actually read it, so it's not any more private.
We actually do this exact thing to send dynamic parameters to a chart-generating backend server. It works great; you get a surprising amount of compression using gzip (2-4x space savings) and the URLs are naturally cached by proxies without any magic!
some observations:
1. when content changes every hyperlink to that content must change along with it.
2. pass by reference (url) is no longer possible.
Considering that SOPA considers linking to copyright infringing content to be as bad as publishing it yourself, this project unifies those two actions nicely!
No, that's not entirely true. I don't have to use a service like bit.ly to send one of these messages. And further, I could just as easily use _any_ or _many_ services. Since the technology is fundamentally browser-to-browser kind of distributed concept, it's just the URL shortening that's not SOPA compliant.
There's also several ways to obscure the impact of SOPA on the URL shortening anyway. For instance, if several services use the same hash algorithm for representing URLs, they can be used interchangeably (if you post the URL to all of them). Further, you can always set up your own temporary shortening service as well.
* No, that's not entirely true. I don't have to use a service like thepiratebay.org to send one of these files. And further, I could just as easily use _any_ or _many_ trackers inside my torrent. *
Altered to convey another point. Naturally, it would be quite difficult to "embed" a feature length movie into a single url, but if one was to split the file into chunks like torrent transferring does, or simply a multi-part rar like newsgroups still do, it enables each chunk to be more manageable.
I do agree with you though, but I think the reason that a service like this if changed in such a way to be user-friendly for file sharing, not just document sharing, would be able to get around a lot of the pitfalls a torrent tracker (for example) would have if it's DNS lookup was blocked (which aren't many) is due to the simple fact that SOPA is written in a way that assumes all IP addresses and DNS names are statically tied together and slow to alter, not that I can have a new domain name in a matter of minutes that resolves to my existing server. Even more so if the final URL hash was nothing more than a common and known algorithm, like base64, that one could easily plug into a basic desktop app and get the same result.
The tricky part with that system would be that you'd also need some new mechanism to retrieve the files. Instead of the regular WWW stack, you'd need something like a massive distributed hash table that could handle massive distributed querying and transferring the hashed files. Many P2P file sharing systems are already doing this, but a sparse collection of end-user machines containing a few hashed files each isn't a very efficient service cloud. If every ISP had this sort of thing in their service stack or if Amazon and Google decided to run the service, all of them dynamically caching documents in greater demand in more nodes, things might look very different.
This would mean that very old hypertext documents would still be trivially readable with working links, as long as a few copies of the page documents were still hashed somewhere, even if the original hosting servers were long gone. It would also make it easy to do distributed page caching, so that pages that get a sudden large influx of traffic wouldn't create massive load on a single server.
On the other hand, any sort of news sites where the contents of the URL are expected to change wouldn't work, nor would URLs expected to point to a latest version of a document instead of the one at the time of linking. Once the hash URL was out, no revision to the hashed document visible from following the URL would be possible without some additional protocol layer. The URL strings would also be opaque to humans and too long and random to be committed to memory or typed by hand. The web would probably need to be somehow split into human-readable URLs for dynamic pages and hash URLs for the static pieces of content served by those pages.
I'm probably reinventing the wheel here, and someone's already worked out a more thought out version of this idea.