Hacker News new | past | comments | ask | show | jobs | submit login
Dat: A P2P hypermedia protocol with public-key-addressed file archives (datprotocol.com)
149 points by goranmoomin on April 18, 2020 | hide | past | favorite | 56 comments



Kind of stagnant, but I liked the IDEA of what was being done with the Beaker Browser using the DAT protocol to run a self-hosted P2P web. I just wish someone would take the project to completion/usability. It would really disrupt things. https://beakerbrowser.com


We decided to take a year to go heads down and rework a lot of stuff, so we did get publicly stagnant but we're fulltime on the next version and should release a public beta soon. I tweet a lot about our progress if you can sort through my sillier tweets (@pfrazee).

Past year has had a lot of improvements

- Protocol moved to a hole-punching DHT for peer discovery (hyperswarm).

- Protocol now scales # of writes and # of files much better. We were able to put a Wikipedia export, which is millions of files in 2 flat dirs, into a single "drive" and get good read performance. This performance bump came from an indexing structure that's built into every log entry (hypertrie).

- Protocol now supports "mounting" which is a way to basically symlink drives onto each other. Good composition tool, esp useful for mounting deps in a /vendor directory.

- Browser now has a builtin editor that splits the screen for live editing. Feels similar to hackmd.

- Browser added a bash-like terminal for working with the protocol's filespace. It's glued to the current page so you can drive around the web using `cd`.

- Browser added a basic identity system. Every user has a profile site that's created automatically and maintains an address book of other users.

- We built out application tooling a fair amount. It's fairly easy to build multi-user apps now, where previously it was a bit of rocket surgery.

Some of the year was spent prototyping ideas and throwing them away as well. A bit inefficient, but helped us learn.


I've been reading about dat and doing some basic projects for about a year in my spare time, stagnating a bit lately as am finishing my degree. As I remember there was some talk about Python and Rust implementations at the time, is that still under development as well? I am not as enthusiastic with Js but really like the protocol and the idea behind it.... Thanks


Yeah I do know there's a Rust impl that people are working on


I’ve been thinking of deploying dat for Nanopore DNA sequencing, but recall something about dat not being great at handling that kind of data.

Does the newest version of dat handle large files well (10gb)? Does it handle tons of files nested in a few directories well? Are there any issues I should know about there?

What is the command line support like for multi-writer?

Do you have any metrics for how much Dat is currently being used?

Thanks!


> Does the newest version of dat handle large files well (10gb)?

Large files work fine but currently any change to a file rewrites it in its entirety. That will mean history will be large until the GC kicks in, and any file that's modified has to be redownloaded in its entirety.

The team spent a fair amount of time looking at a solution to partial file-updates which works like inodes. They ultimately decided it was too difficult to pull off for now.

> Does it handle tons of files nested in a few directories well?

Yep, no issues there

> What is the command line support like for multi-writer?

We're still deciding on how to handle multi-writer. It's a priority for us after the upcoming stable release.

> Do you have any metrics for how much Dat is currently being used?

Nothing concrete atm. If I had to guess, it'd be no more than 1k.


> Does the newest version of dat handle large files well (10gb)?

I'd be curious if anyone wanted to try it ;) https://github.com/mafintosh/hyperdrive/blob/master/index.js...

> What is the command line support like for multi-writer?

There is an experimental multiwriter CLI using hyperdrive and kappa-db (github.com/kappa-db)

https://cobox.cloud


We've solved efficient partial file updates in Peergos, which is built on ipfs. Happy to talk you through our data structures if you're interested. The key ones are cryptree and Merkle-champs.


Thanks. Has anyone by any chance benchmarked hypertrie against u/marknadal's [0] gun.js?

Speaking of which, the gun.js network hosts the internet-archive (archive.org) [1] which is likely to be bigger than the English Wikipedia (?). Cloudflare runs an IPFS gateway. Are there organizations of similar size dat is looking to onboard or has onboard-ed?

---

A few questions:

1a. Re: hole-punching: Do you support clients on LTE (or, networks behind CG-NAT)? Is hole-punching just for the DHT connections (which seems to operate over UDP4) or for Feed (hypercore) connections, too (which default to TCP?).

1b. Are there plans to use WebRTC (like WebTorrent) instead of TCP or UDP sockets?

1c. In case of TURN relays, how long do you intend to support traffic through your servers? It might get cost-prohibitive at some point. Are (will) TURN relays (be) deployed worldwide to combat latencies?

2. Are hyperswarm and hyperdrive different example usages for hypercore or are they complementary? If so, how are those projects related, as in, is hyperdrive built on top of hyperswarm?

3. The project is spread across disparate GitHub accounts-- Some top-level packages in mafintosh's account and some in hyperswarm's, for instance. Is this intentional?

4. I saw references to merkle-trees, flat-trees, kademlia, HAMT, bitfields... is there a documentation on why these data structures are used? I can guess kademlia is for the DHT and merkle-trees for some form of entropy management. Are such implementation details documented (looking for something similar to gun.js [2] or redis [3] docs)?

5a. Can you please point out a few major differences to gun.js and ProtocolLab's IPFS?

5b. Conversely, what are some use-cases for which dat really outshines other such protocols?

[0] https://news.ycombinator.com/item?id=15818856

[1] https://news.ycombinator.com/item?id=17685682

[2] https://gun.eco/docs/RAD

[3] https://news.ycombinator.com/item?id=15384396


Once a read key becomes public, it's always public right? I appreciate that arguably that's always the case.


They are working on a new version, it's almost done. It's going to be a lot better! https://github.com/beakerbrowser/beaker/pull/1435


Skynet is a new protocol that has been working on something like this. You can see some of the webapps built on Skynet here: https://skynethub.io/_B19P18DEI4Y_37wg6yXtfulEq2U8AcIO_lWM7s...

Main project website: https://siasky.net


For me one of the main positives of dat is there isn’t an attached cryptocurrency.


Attaching a cryptocurrency to Sia allows us to draw a distinction between producers and consumers.

In typical p2p networks, you are expected to be both in roughly equal proportions, which means your network gets polluted by a lot of low-quality users who are just trying to do their part.

Sia has the consumers pay the producers, and has a marketplace mechanism that selects the highest performing producers to perform the jobs and receive the revenue.

This allows the network as a whole to be much, much more efficient.


Wait, consumers pay the producers? As in, “you need to pay to download files”?


Doesn't not paying make us end up rebuilding the internet again with ads and tracking? Free is not always an option.


Doesn't paying encourage useless sites highly optimized to grab user attention, low quality content, shock sites, cloned sites, site bloat (if payment is per KB), and so on?

That said, I think there's definitely a space for paid internet. I was just surprised because siasky homepage says "Build a Free Internet" - perhaps it should say "Build a pay per view internet" instead?


Free as in freedom, not free as in beer.

I would argue that the current model is more friendly to attention grabbing content (clickbait, etc) because advertising has the limitation that all views are worth the same value. In a pay-as-you-go Internet, users can whitelist high quality content sources as being okay to charge 10x or 100x what you'd typical accept to view a webpage. This would incentive content creators to build a brand and reputation that makes users comfortable putting them on the 'high quality' list, so that their content can see a massive revenue multiple relative to the number of eyeballs.


This may work, but the problem is, that'll encourage internet made of content creators whose goal is to get a revenue out of the websites.

There are certainly a lot of those, but they are not nearly the whole internet. A lot of interesting sites would not want to be on the "pay as you go model" at all!

For example, we are on the news.ycombinator.com in the thread discussing datprotocol.com, and you are pointing me to siasky.net. Right now, the top 5 HN pages are adecentralizedworld.com, handsonscala.com, arxiv.org, a16z.net, and torproject.com. None of those websites make money from webpage ads. None of them are likely to move to pay-as-you-go internet -- because they care about their audience and not website revenue.

I suspect that Sia's decentralized pay-as-you-go world will be much worse (productivity-wise and information-wise) than the current internet -- all the interesting technical/science blogs and docs would be missing; while buzzfeed clones will be plentiful. There might be occasional high-quality journalism website, but most of those are getting paywalls anyway, so will they justify all new protocol?


It's perfectly possible to not ask for a fee on your websites. Plenty of websites will continue to exist for free even though they could ask for money.

Also, the internet is already something that users have to pay for. Users pay their ISP every month for access. We envision a world where this utility payment extends to cover the content creators in addition to the infrastructure providers.


As far as I understand Sia is functionally equivalent to S3: You're not getting content the "producer" wants, you're getting exactly the content you, as a consumer, want.


Is there anyway to pin skynet files using storage rather than Sia, or a way to download skynet files using the command line (independent of a curl request from a portal)?


For your first question: technically yes, but it's not really the way it's meant to be done. If you want to pin something with storage, you can become a host on the Sia network and then make that content available through your host. But this also means the content is subject to going offline if you go offline (unless someone has re-pinned it to other hosts), and Skynet is really meant to be this high reliability platform where you don't need to worry about particular users going offline.

If you want to download skynet files using the command line, you can run 'siad' in portal mode as a daemon, and then use 'siac' to query the daemon. 'siac skynet download [skylink] [destination]' is the command to perform the download.

Doing it this way means you gain direct decentralized access to Skynet instead of needing to funnel through a portal.


Is there a way to be a host and then pin other people's content that I like? I'd love to be a host that pins public content without an explicit financial transaction.


I don't see a github repository or any link to the open source code there. Is it open source?



The Cliqz browser implements the Dat protocol. Here's a writeup on the implementation: https://0x65.dev/blog/2020-03-02/implementing-the-dat-protoc...

Disclosure: I work at Cliqz.


Amazing ! Dat is probably one of the most interesting technologies out there, it's good to see more people step in, especially if it's a growing company.

Does that mean that Cliqz will start indexing dat sites ? The way I see it, the biggest issue of decentralized anything is always discoverability. If Cliqz is interested in it, it could have a leverage in making dat more widespread.

Dat even helps you with the indexing, because it's the sites themselves that tell you when they are updated and when they are updated, it's like an indexer's dream come true. And seeing how dat operates, Cliqz could even be a peer in the swarm of the sites it indexes to give them more chance to be reached (at the cost of Cliqz having a view on who visits what, which seems contrary to your DNA if I understand correctly)


You say you like the idea, do you have a problem with the implementation?


"How Dat Works" linked from the page is still a great visualization of all the bits that make up the protocol:

- https://datprotocol.github.io/how-dat-works/

- https://news.ycombinator.com/item?id=20363813


Also https://news.ycombinator.com/item?id=20811936, a related project from 2019.


Conceptually, this sounds very similar to Freenet; a DAT URL seems analogous to a Freenet SSK.

What I don't see is anything analogous to a USK -- there's no obvious way for an author to distribute an update to content they have published. It's also unclear how much (if any!) privacy this protocol provides to content publishers or requestors -- the use of discovery keys only provides protection for users requesting content which an eavesdropper has no knowledge of.


There is no equivalent to an USK because dat doesn't give direct access to old revisions. What you get is always the latest.

The old content is still there though, and you can access it, just not in an "easy" manner: https://docs.dat.foundation/docs/faq#does-dat-store-version-....

The dat url references a given private key, and that's about it in terms of privacy. Transfer is done between two endpoints on the normal internet, so neither peer is "hidden".

It's a shame but the dat content is spread in many places and it's hard to get access to all the documentations. The most impressive and interesting part of the project right now is probably Beaker, you should have a look: https://beakerbrowser.com/


> There is no equivalent to an USK because dat doesn't give direct access to old revisions. What you get is always the latest.

How do you guarantee that you're getting the latest content, though? If a peer has a cached copy of an older revision, wouldn't you end up with that instead, since there's no way to distinguish between them?

> The dat url references a given private key, and that's about it in terms of privacy. Transfer is done between two endpoints on the normal internet, so neither peer is "hidden".

What concerns me here is that, from what I'm reading, it seems like any client on the local network could eavesdrop on mDNS requests to determine what content other clients are viewing. Worse, a client could announce itself with the discovery key for a well-known piece of content to be notified when any other client, anywhere, requests that content.

This is a worse privacy model than unencrypted HTTP. Are you aware of any plans to mitigate this?


> How do you guarantee that you're getting the latest content, though?

Only the original creator can update the content. You can never know you're at the latest version until you've connected to them and they've told you "this is the last I have"

> What concerns me here is that, from what I'm reading, it seems like any client on the local network could eavesdrop on mDNS requests to determine what content other clients are viewing. Worse, a client could announce itself with the discovery key for a well-known piece of content to be notified when any other client, anywhere, requests that content.

Disclaimer: I'm not an expert on the project, only following it because it's cool.

As far as I know, the only obfuscation is that keys are hashed so that you can't infer what content is being exchanged just by listening to the network. However when you want to watch a specific key, you can get on the swarm and see who's there.

Note that dat doesn't attempt to solve the same problems as Freenet does. They have different goals, and as such can't be compared on something that one explicitly focuses.


You have the details right. I think it's fair to say that the protocol is very leaky with its metadata right now. In a local network, it would be wise to only exchange announcements with trusted devices. In the global network, it would be wise to introduce some kind of proxy (distributed or not).


I have failed time and again to grasp how exactly Dat works. IPFS is easy for me to grok; your content is hash-addressable, thus immutable. Can someone explain in a sentence or two how Dat handles content?


Bittorrent is immutable, Dat is the mutable version of Bittorrent.

You create an archive, that is identified with a cryptographic pubkey. You add files in it, dat stores some metadata in it. You give the archive's id to a friend, who starts retrieving the metadata, and can then see there are files; he can download files as he pleases. Syncing can also be realtime, so he gets the new content as soon as you put it.

Only you, the holder of the cryptographic privkey, can add content to the archive. Crypto is used to sign all content, so there's no doubt it was legitimately written by you. Since the id doesn't move, it is possible for multiple peers to inter-connect and exchange data as needed in a swarm fashion, even if you're offline


That's a great explanation, thank you. How does this handle versioning? If I add one file to a vast dataset, can people download just that file if they already have the previous version? IPFS hashes at the chunk level, so even if you append something to a large file, you can download just the new chunk and be up to date.


Yes, in fact dat was initially created for this use case: data analysts want to exchange their data, and that data evolves in time so it needs to be transported "efficiently", as in, you don't need to redownload a full .zip just for a single file change. The only metadata you'll receive will concern the new file and the old content is still valid. You can even seek inside any file if you don't want to download the whole file.

Changes _inside_ a file, though, are not handled. Today if a file is modified dat will consider any bytes of the old one to be garbage and will not reuse it.

dat is a sexy frontend on top of hyperdrive(https://github.com/mafintosh/hyperdrive), I personally think it's easier to see what dat can do by looking at what hyperdrive does


Thanks, that's very informative, and the intro in the Hyperdrive README clarifies the goals very well. I have a much better idea now, thanks again.


I think the difference here is that the URL in dat is not a guarantee of immutability - it is just a public key and the author of the dat is free to update the content at any time without changing the key (there is a change history though).

I am not an expert on this - just a casual passing interest so might have it wrong.

This explains more: https://datprotocol.github.io/how-dat-works/


Oh hmm, thanks. I've seen that link, but I'm not interested in getting into the byte-level, I just want a high-level overview. A URL being a public key that you can update if you have the private key makes sense, thank you. Do you know how propagation/distribution happens? With IPFS, whoever accesses your content serves it for some time.


I believe the distribution is largely the same concept, i.e. you get a copy of the data from somewhere, then other nodes that are also looking for the same data might discover that you have it as well and so you might then serve it on to those other nodes too.


When I last looked at dat some time ago I seem to recall that there was no way to revoke a dat or set a TTL-type field.

Can dat forget?

E.g. If I created a dat, that was it and the data in the dat was then potentially out there forever and ever and ever in the distributed network. There was no way to tell a client "this dat is only good for 90 seconds/minutes/hours/days/years/decades/etc - after that please drop/delete".

I know there is this desire in certain circles for Blockchain-stylee "everything is stored for ever and we can cryptographically prove every single byte all the way back to the dinosaurs!" sentiment, but I am not sure that really jives with a distributed data/Web publishing protocol (at least in my mind) - I want to be able to reliably "delete" something. If I know that anything I ever publish will be irrevocably around forever, it has a chilling effect on what I choose to do and publish with dat (plus maybe legal challenges? IANAL but e.g. GDPR? Right to be forgotten?)

E.g would we actually gain anything if hypothetically Google supported dat and so then every single Google search result page ever generated was stored forever in a dat? Would future users benefit from storing decades of archived versions of Google search result pages for "Facebook" (because people search for "Facebook" then click the first result instead of just typing "facebook.com") or "weather" which then need to be endlessly duplicated around the network for the rest of eternity? Seems unlikely to be of any benefit to me - surely better to mark some of it as ephemeral and let the data naturally timeout and die?

Does anyone know if that has changed with dat and data can now die? Or have I just misunderstood?


Data in a dat, just like data in a torrent, only lives as long as there are peers interested in seeding that particular dat; it's not a blockchain-style "everybody has to replicate everything".

You can't unilaterally withdraw content somebody else is also seeding; the best you can do to my knowledge is publish a new version that replaces all the dat's content with a "please delete your history of this archive" message, but as with any distributed system you have no real way of knowing if that was respected.


Here are all the projects that are using components from the dat team over the years: https://dat.foundation/explore/projects/

and https://cobox.cloud/


Wouldn't a protocol like this be a killer feature in the next package manager to solve problems npm had? Then you've just got a discoverability issue without the whole infrastructure headache on top to maintain.


DAT is not new, I've been building stuff on top of it. Good to see more people are awared of it.


Nice. Aral Balkan of small-tech.org is having plans with Dat for Tincan and Site.js:

https://small-tech.org/research-and-development/



Can Dat really be described as a "new" protocol 3 years in?


Yes, as it's still under heavy development and has not yet seen heavy usage.

This mentality that things either flip instantly to mass adoption or are "old" needs to die. All the easy stuff that can be done in 6-12 months has already been done. Anything worth doing today is going to take a minimum of 1-2 years of R&D unless it's nothing more than a packaging and polish/branding of something already in existence.


> This mentality that things either flip instantly to mass adoption or are "old" needs to die.

That's honestly not where I'm coming from, though. I think Dat is a cool project and I've been aware of it from, likely, close to the beginning. The scale between "new" and "old" is not binary, nor is it necessarily related to "mass adoption". There are people who still treat React as though it's "the new hotness", despite the fact that it's been out for 6 years and is widely adopted.

I'll grant that different people will have different thresholds for what they see as "new". I just personally think Dat has gone beyond "new" and is in a phase of maturation. It's even got a browser with deep support!


We can take 'new' out of the title above, though I'm reminded of a second-hand clothing store in my home town which used to be called "New to You".


Yeah, thanks! I'm totally on board with this link being here and would love for more people to learn about Dat.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: