Principles of Content Addressing (2014)

gojomo · on Aug 5, 2015

Some quibbles:

* URNs weren't just for ISBN-like identifiers from an assignment authority. Some of the early writing talks about that use case, but the key feature was "location-independent persistent names", for which hash-names have always been a good fit. Nothing in the relevant specs precludes hash-names as an URN scheme – it's not "breaking the spec" – and a number of projects have used hash-named URNs. While there's a policy for registration, in practice some URN namespaces have been 'de facto registered by use', much like common-law trademarks and a lot of URI/URL schemes. (Of course, the neat "URLs and URNs as distinct subtypes of URIs" view never fully took root, as W3C's 2001 "Clarification" note acknowledged.)

* Magnet-links were absolutely designed to promote P2P/CDN-network-agnostic content-addressing. But, they were also made flexible enough to squeeze in other descriptive metadata, aliases, or fallback locations as well. The JS-launching was an adaptive hack; the descriptive (and usually hash-based) content-names were the point. A key early use case was making a common, vendor-neutral hash link for competing Gnutella clients, but the loose stuff-anything-useful-in generality saw magnet-links adopted by other software (such as DC++ and BitTorrent) as well. The 'magnet' URI scheme was only ever 'common-law' registered.

* It's a bit odd to consider the algorithm the URI 'authority', though if I squint I can see a sort of funhouse-logic to it. Notably the similar URI-scheme that's made it through IETF standardization, the 'ni' scheme (RFC6920), usually leaves the 'authority' component blank, so three slashes appear in a row – but alludes to the optional declaration of an 'authority' that might be able to help accessing the referenced content.

btrask · on Aug 5, 2015

Great points, you've clearly read the RFCs. :)

- You're right that URNs were not just for ISBNs, but they shaped the formation of the standard and (IMHO) made it inapplicable for hashing. A content addressing system that can't resolve any of the standardized URNs wouldn't be very useful. FWIW one of my earliest prototypes used URNs, and I still use them for ISBN links!

- Magnet links work fine today, but if you look at the original proposal[1], they really were designed for all of the wrong reasons (including explicitly popping up a JS handler). In practice everyone who uses magnet links tunnels URNs through them, which serves no point for a general purpose system. A system that supports magnet links must also support URNs (meaning the arguments against using URNs apply, and magnet: doesn't add much value).

- I've considered eventually adding ni support to StrongLink, although it's not like anyone else uses it so it wouldn't be an interoperability win. I think my hash scheme proposal is much better, so I'm hoping we can just forget about ni entirely. (But to be clear, it's extremely easy for a system to support both.)

[1] http://magnet-uri.sourceforge.net/magnet-draft-overview.txt (warning: SourceForge link)

gojomo · on Aug 5, 2015

I wrote the original magnet-URI proposal, so trust me when I say the JS-stuff was a demo hack, and the content-based names the real point. (Essentially no one ever implemented the JS-handler-negotiation, which was a quasi-web-intents mechanism before that concept was named.)

Magnet-URI's immediate predecessor was the Hash/Urn Gnutella Extensions, 'HUGE' [1], and the reason that all the examples in the magnet-URI spec are hash-URNs, and that such hash-URNs are the main way magnet-URI has been used, is because that's what magnet-URI was for.

I respect that your design opinion is that URNs aren't good for this; it's just false for you to say hash-names are against the URN specs. Neither the language of the URN specs nor historical practice supports that idea. And, hashes are, as you clearly agree, a great way to generate "persistent, location-independent, resource identifiers" (the stated purpose of URNs).

A system (P2P, CDN, local content-addressed stores, etc.) can be plenty useful even if it chooses to support only some URNs, or only some magnet-URIs. All the magnet-using systems have essentially ignored standardized/assigned URNs, and instead used ad-hoc hash URNs, and in total they've been quite useful to a lot of people.

[1] http://rfc-gnutella.sourceforge.net/Proposals/HUGE/draft-gdf...

btrask · on Aug 5, 2015

Sorry, I didn't know Magnet was your work. I think it's an example of the inner-platform effect (tunneling URIs in URIs), but I agree it's served a lot of applications (especially BitTorrent) very well.

You're right that hashes aren't prohibited by the URN specification. My argument is that the URN spec doesn't prohibit anything, because it's too broad. In fact, I think that URNs boil down to URLs, because in order to resolve many schemes (including ISBNs), you need a dynamic lookup to a central authority. I've been considering an article called something like "Locations, names and hashes" in order to explain that locations and names are effectively the same, but hashes are fundamentally different. That is my opinion of the underlying reason why URNs failed to catch on (aside from BitTorrent).

Even the practical point of interoperability is moot, because BitTorrent namespaces its hashes. It's impossible for another system to support existing URN/magnet links without "emulating" torrent files (which introduces too much ambiguity anyway).

BTW, the http://memesteading.com/ link in your profile appears to be broken.

Edit: I see you've worked on a lot of things I've read about, e.g. WARC files. Do you currently work at the Internet Archive? I was planning on approaching them at some point with some ideas.

gojomo · on Aug 5, 2015

Yes, I think many now recognize that the original idea of a stark contrast between URLs and URNs doesn't fit the fuzzy reality. (RFC3986's "URI, URL, and URN" section, https://tools.ietf.org/html/rfc3986#section-1.1.3, acknowledges this point.) My interpretation is: there's quite a few de-facto URNs in use, just without the official label "urn:" or namespace registration, which in practice has turned out to be an unnecessary formality.

(Thanks for the note regarding memesteading.com; mapping updated to work now.)

I'm no longer regularly doing anything for the Internet Archive, but can definitely help make contact! If you're in the bay area, a good way to start learning more about its projects (or show off your own) is to attend the open-house lunches, held most Fridays. (You should just shoot them a note or call before showing up, so they know the expected attendance, or warn you if it's one of the occasional days that it's not held.)

btrask · on Aug 5, 2015

Rather than de facto URNs, I'd say de facto URLs, but we don't have to quibble over that.

I'm on the east coast, unfortunately (North Carolina).

HN is going to start capping the thread depth, but you're welcome to email me if you'd like to talk more (bentrask@comcast.net). I've been trying to come up with an archival web proxy or something sort of related to WARC and the tooling around it, possibly using content addressing (although converting existing web pages seems ugly and I haven't found an ideal way).

chmike · on Aug 5, 2015

The authority is optional in an URI, the path is not. A blanc authority is not the same as an absent authority. Look at the examples in RFC3968 in section 1.1.2 at page 6 in https://tools.ietf.org/html/rfc3986.

The URI for the mailto scheme has only a relative path without authority and subdirs.

gojomo · on Aug 5, 2015

The RFC3986 path may be empty – which is essentially the same as absent.

Also, when magnet-URI was composed, the URI spec was RFC 2396, which clearly describes <path> is one of the elements that "may be absent":

https://tools.ietf.org/html/rfc2396#section-3

Also both RFC2396 and RFC3986 allow for URI schemes where everything after the 'scheme:' is opaque, and need not be strictly interpretable as authority/path/etc. (RFC2396 mentions that it will still refer to this opaque-part as a 'path', because "they are mutually exclusive for any given URI and can be parsed as a single component".)

anoxic · on Aug 5, 2015

I like the idea of being able to address content based on it's actual content. Perhaps I don't have a very good imagination, but if all you have is a hash of the content, how to you know where to find it?

From within a single app it's easy, but what about in other apps or other machines. Would there be a (possibly distributed/voluntary) lookup service? Could comments or lookup "hints" be added to the spec?

btrask · on Aug 5, 2015

This is the secret sauce that makes every implementation unique. Camlistore, IPFS, StrongLink (my project), and others all have different answers. I think the important thing is that they all use hash URIs that can interoperate. Then you can find the content using whichever system you prefer or makes the most sense.

StrongLink doesn't use a distributed hash table, because one of my requirements is that it must work offline. In StrongLink, you pull from other repositories you're interested in, and then always resolve your requests against your own repo (locally).

adrusi · on Aug 5, 2015

This is usually achieved using distributed hash tables. Almost certainly the most prominent use of this is magnet links in bittorrent. Freenet has also been doing this for a long time, and ipfs is a relatively new player.

The internals of distributed hash tables are out of scope for a comment thread, but if you happen to know about the structure of the cassandra database, they're related concepts.

chmike · on Aug 5, 2015

Is there more information on the intended application domain (use cases) ? This was unclear for me after reading your referenced document.

Most of your arguments apply to hash referencing and make perfect sense to me. The compatibility with URI is also a good point and not difficult to achieve. I would add that compatibility with URL (http scheme) is also desirable and may be solved by use of a protocol bridge so that the information is accessible to web applications.

I'm also working on a distributed information system for many years now (not full time;). But I use a different referencing system. The reason is because one can't modify the information without invalidating the hash and thus the references. So using hashes as reference make sense for a system containing only immutable information. Your choice of reference is thus very application specific.

The system I'm working on allows to modify and move data without invalidating references and is distributed like the web. A reference is at most 32 byte long in its binary representation, a bit longer in its ASCII representation. Take that http! ;)

btrask · on Aug 5, 2015

I believe there is a fundamental distinction between URI schemes that are dynamic but centralized (meaning they require some form of coordination, to handle mutability), versus schemes that are static and decentralized (for example, they are defined by a hash algorithm that anyone can run independently). If you go the dynamic/centralized route you end up with something like the World Wide Web, which already works very well for that use case. Content addressing will never be useful for things like online shopping, where your requests are basically remote procedure calls. Instead I think it's good for publishing, sharing files, and things like that.

I do have a plan to build mutability on top of StrongLink eventually, using diffs like Git. I think it's important to have a clean separation, so that the sync protocol stays simple and impossible to get wrong.

The initial application for StrongLink is notetaking, which I find much easier to do with immutability. It's like writing in ink.

chmike · on Aug 5, 2015

To support mutability you would need redirections so that you can access to the most recent version with an old reference. The problem with that is that you can't check the validity of a redirection as you can do with a returned file.

Thus hash references work in a secure and trusted environent.

btrask · on Aug 4, 2015

I'm happy to answer any questions about this article or the implementation I've put together.

lisper · on Aug 5, 2015

You say you've been working on your CAS system for two years. What are the challenges you are facing that is making it take that long? Because a simple CAS is trivial: just hash the file, rename it to the hash, and serve it up via a normal HTTP server. What is making it hard?

btrask · on Aug 5, 2015

Here's what I've written up for a toy content addressing system I've been writing in Python:

"If it's so easy, why did it take us 3 years and 10,000 lines?" "Because we wrote it in C."

And because the real system:

- Tracks file meta-data (including tags and full-text indexing)

- Has a way to find files/hashes without file or hash (search)

- Uses a database instead of hardlinks everywhere

- Supports hash algorithm plugins (not technically done yet)

- Supports more hash formats (short, base-64, etc.)

- Is ACID (and coordinates transactions between DB and file system)

- Supports multiple repos per user (at arbitrary paths)

- Handles hash collisions (aside from the primary hash)

- Has a basic user interface

- Has a sync protocol

skybrian · on Aug 5, 2015

Do these links work on mobile phones? (Can an app register to handle them?)

Sophistifunk · on Aug 5, 2015

Should do, apps can register to be protocol handlers. You can only register at the "foo://" level though, not "foo://bar". What I mean is I don't believe an app can only handle some of the "foo:" URIs.

btrask · on Aug 5, 2015

For what it's worth, I consider this a feature. There have been proposals to make sha1: and other schemes, but every system will need hash agility (as better algorithms are invented and flaws are found in old ones). A single system should be able to resolve every relevant hash algorithm.

ClayFerguson · on Aug 5, 2015

Everything is content! The JCR (Java Content Repository) will eventually be the standard -- and sort of already is. Apache Jackrabbit Oak will take off as the open source technology of choice eventually. It's able to leverage both MongoDB and Lucene search, and has a standard for naming. I have an open source app located here (meta64.com), that's a full-stack modern standards implementation of a portal built on it. Any node is addressible on the url using the JCR standard.

falcolas · on Aug 5, 2015

It strikes me that this is one of those "a square is a rectangle, but a rectangle is not always a square" moments. JCR seems like it fits into the content identification family; it uniquely identifies content by a hash (in the case of JCR that hash is also a path), but that hash is system dependent. That is, I can't go to a system B and ask for the hash(path) '/foo/bar' and expect to get the same content as I would on system A.

Cryptographic hashes, on the other hand, make it possible to use the same hash on multiple systems and get the same content (if that content is available).

chmike · on Aug 5, 2015

You need an index to achieve that. The index will tell you where the information associated to the hash is stored. That index is in itself a distributed system.