Hacker News new | past | comments | ask | show | jobs | submit login

Content Defined Chunking is one of my favorite algorithms because it has some "magic" similar to HyperLogLogs, Bloom filters, etc... This algorithm is good to explain to people, to get them inspired by computer science. I usually explain the simplest variant with rolling hashes.

It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

PS. I don't use this algorithm in ClickHouse, but it always remains tempting.




Implementing fastCDC is fun (2016).

Do you have a suggestion on what to read on the topic since then?

I don't keep up with these things. A quick search came up with the following but I haven't read it yet.

Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.

https://ranger.uta.edu/~sjiang/index.html


Microsoft have this for DFS-R and the spec is open https://learn.microsoft.com/en-us/openspecs/windows_protocol... - its pretty straight forward to implement.


> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.


That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise

https://en.wikipedia.org/wiki/Illegal_number https://ansuz.sooke.bc.ca/entry/23 https://shkspr.mobi/blog/2022/11/illegal-hashes/


I’m not suggesting the hashes themselves are illegal to possess, but that transferring the bytes corresponding to those hashes is problematic: if both sides are lowly trusted, that puts you at risk as a hoster of that content. This is indeed an issue with IPFS, for instance, where I believe the solutions are “pinning” content that is already vetted by another party, or denylists of “bad bits”. I assume it’s similar to any other clearnet hosting. Btw, I make zero value judgments about all of that.

Off topic: I see downvotes on my parent comment, please let me know if I said something bad to help me improve.


Shared bytes could be construed in the opposite direction: if two or more of my users have the same chunk in their files, it is more likely to be some legal piece of data.

Files become piracy when there is evidence of intentional copyright infringement, for example when the chunk is part of a valid MPEG4 file and the MPEG4 file is titled "Wednesday_S2E4_FullHD_NetflixRip.MP4"


Re last para: probably because it's full of very certain, but also quite certainly wrong, statements along the lines of "Under current legal doctrine, blobs need some form of chain of custody." Citation needed.


I can see how that’s overly assuming. Thanks for being candid.


It's not the illegalness I'm challenging, it's the problematicness. Maybe it is illegal to even think about those bit patterns. But I'm not aware of cases where people get _actually_ thrown in jail or fined for possessing or transmitting them. In all of the cases I know about there is intent involved.


It is hard to tell if this is what you are saying. But a common misconception of ipfs seems to be that you may end up hosting random unwanted files. this is untrue, you only end up hosting files you want.


Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc. How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.


What planet have you been living on? Bittorrent is widely used to distribute copyrighted material - movies, TV shows, games, programs, porn... I'd imagine a large majority of bittorrent traffic worldwide is pirated material, with a small portion being datasets as you describe, and other legally-shared data like actual Linux distros, etc.


I suppose there could be many things happening on the internet that we are unaware of; however, torrents are very good and specifically tailored as a protocol for scientific data and ML.

It solves the link-rot issues that occur due to moving institutions, it allows huge storage for essentially free (ever tried to store 9 TB of training data or CERN data on Dropbox?), and it scales extremely beautifully.

It's really the absolute perfect solution for reproducible research in large data studies.


Torrents are no longer main source of copyrighted materials, at least for shows and movies. There is a bunch of illegal services that provide Netflix like experience against pirated content.


Don't these services usually use torrents under the hood though? Thinking about stremio and popcorntime.


Now I feel old, using bittorrent and soulseek.


If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.


I think for piracy/DMCA it’s enough to simply remove it. As for CSAM or more serious stuff, I don’t know if that’s enough? Does section 230 cover that? Is there a difference between being a company and an individual?


15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.


Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D

Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)

The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.

> but there were some patents on it.

The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!


EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area. I believed that’s in the early 2000’s. One of the household name big techs had an internal product with similar design. ZFS has similar design.

Mine was at the block device level. The advantage is you can format it to whatever file system of your choice, with read/write support and deduplication just works.


> Mine was at the block device level.

Same! :) Originally I wrote it with an interface kinda similar to `tar` -- you add or extract huge blobs to/from what I called a coalesced archive. I could re-image a machine about 8x faster than Norton Ghost.

After $WORK went under, I kept the code and toyed around with it, making it speak NBD so instead of extracting a huge blob from the archive to a destination block device you could also access it directly. I feel like I never Properly solved write support though.

I'm curious, did you think of anything better than refcounting the data blocks and then keeping a list when the count goes to zero, then adding the next unique block to the zero list? That's all I could think of, and it adds at _least_ one additional layer of indirection which I didn't like bc it would have a performance impact.

> EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area.

I know this _NOW_ but certainly didn't know back then. :) And still doesn't take away the fact that according to the USPTO, "compression via coalescion" is miiiiine. ;-)

Again, I interpret this NOT as evidence of "how clever I am", but as evidence of how silly and broken the patent system is.


Yes. Depending on how the claim languages are phrased, patents on the same idea can be approved.

For reclaiming deleted blocks, I just had a garbage collection phase to run from time to time. Like you've mentioned on refcount, I've considered it but it amplified writes 2X~3X and worse they were random access writes. Garbage collection was not so bad since it's only going through the virtual file control blocks containing the content-address-hash.

The storage layout was: file block -> virtual file control block -> dedup-data-block. The virtual file control block contained the dedup block hash entries where one control block hosted N file blocks. GC only needed to scan the control blocks to find out which dedup-data-blocks were in use.

Freed dedup-data-blocks remained in place and were linked to the free list; the first couple bytes of the free block were cooped to store the pointer to the next free block.

At the end, brand new file write performance degraded about 10% compared to normal file write, which I considered acceptable. M block writes -> M dedup block writes + M/N control block writes + M/K db index updates, where N was the number of hash entries hosted in the control block and K is the number of hashes stored in one db index page. Repeated file writes were much faster due to deduplication.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: