Content Defined Chunking is one of my favorite algorithms because it has some "magic" similar to HyperLogLogs, Bloom filters, etc... This algorithm is good to explain to people, to get them inspired by computer science. I usually explain the simplest variant with rolling hashes.
It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
PS. I don't use this algorithm in ClickHouse, but it always remains tempting.
Do you have a suggestion on what to read on the topic since then?
I don't keep up with these things. A quick search came up with the following but I haven't read it yet.
Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.
> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.
That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise
I’m not suggesting the hashes themselves are illegal to possess, but that transferring the bytes corresponding to those hashes is problematic: if both sides are lowly trusted, that puts you at risk as a hoster of that content. This is indeed an issue with IPFS, for instance, where I believe the solutions are “pinning” content that is already vetted by another party, or denylists of “bad bits”. I assume it’s similar to any other clearnet hosting. Btw, I make zero value judgments about all of that.
Off topic: I see downvotes on my parent comment, please let me know if I said something bad to help me improve.
Shared bytes could be construed in the opposite direction: if two or more of my users have the same chunk in their files, it is more likely to be some legal piece of data.
Files become piracy when there is evidence of intentional copyright infringement, for example when the chunk is part of a valid MPEG4 file and the MPEG4 file is titled "Wednesday_S2E4_FullHD_NetflixRip.MP4"
Re last para: probably because it's full of very certain, but also quite certainly wrong, statements along the lines of "Under current legal doctrine, blobs need some form of chain of custody." Citation needed.
It's not the illegalness I'm challenging, it's the problematicness. Maybe it is illegal to even think about those bit patterns. But I'm not aware of cases where people get _actually_ thrown in jail or fined for possessing or transmitting them. In all of the cases I know about there is intent involved.
It is hard to tell if this is what you are saying. But a common misconception of ipfs seems to be that you may end up hosting random unwanted files. this is untrue, you only end up hosting files you want.
Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc.
How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.
What planet have you been living on? Bittorrent is widely used to distribute copyrighted material - movies, TV shows, games, programs, porn... I'd imagine a large majority of bittorrent traffic worldwide is pirated material, with a small portion being datasets as you describe, and other legally-shared data like actual Linux distros, etc.
I suppose there could be many things happening on the internet that we are unaware of; however, torrents are very good and specifically tailored as a protocol for scientific data and ML.
It solves the link-rot issues that occur due to moving institutions, it allows huge storage for essentially free (ever tried to store 9 TB of training data or CERN data on Dropbox?), and it scales extremely beautifully.
It's really the absolute perfect solution for reproducible research in large data studies.
Torrents are no longer main source of copyrighted materials, at least for shows and movies. There is a bunch of illegal services that provide Netflix like experience against pirated content.
If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.
I think for piracy/DMCA it’s enough to simply remove it. As for CSAM or more serious stuff, I don’t know if that’s enough? Does section 230 cover that? Is there a difference between being a company and an individual?
15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.
Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D
Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)
The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.
> but there were some patents on it.
The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!
EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area. I believed that’s in the early 2000’s. One of the household name big techs had an internal product with similar design. ZFS has similar design.
Mine was at the block device level. The advantage is you can format it to whatever file system of your choice, with read/write support and deduplication just works.
Same! :) Originally I wrote it with an interface kinda similar to `tar` -- you add or extract huge blobs to/from what I called a coalesced archive. I could re-image a machine about 8x faster than Norton Ghost.
After $WORK went under, I kept the code and toyed around with it, making it speak NBD so instead of extracting a huge blob from the archive to a destination block device you could also access it directly. I feel like I never Properly solved write support though.
I'm curious, did you think of anything better than refcounting the data blocks and then keeping a list when the count goes to zero, then adding the next unique block to the zero list? That's all I could think of, and it adds at _least_ one additional layer of indirection which I didn't like bc it would have a performance impact.
> EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area.
I know this _NOW_ but certainly didn't know back then. :) And still doesn't take away the fact that according to the USPTO, "compression via coalescion" is miiiiine. ;-)
Again, I interpret this NOT as evidence of "how clever I am", but as evidence of how silly and broken the patent system is.
Yes. Depending on how the claim languages are phrased, patents on the same idea can be approved.
For reclaiming deleted blocks, I just had a garbage collection phase to run from time to time. Like you've mentioned on refcount, I've considered it but it amplified writes 2X~3X and worse they were random access writes. Garbage collection was not so bad since it's only going through the virtual file control blocks containing the content-address-hash.
The storage layout was: file block -> virtual file control block -> dedup-data-block.
The virtual file control block contained the dedup block hash entries where one control block hosted N file blocks. GC only needed to scan the control blocks to find out which dedup-data-blocks were in use.
Freed dedup-data-blocks remained in place and were linked to the free list; the first couple bytes of the free block were cooped to store the pointer to the next free block.
At the end, brand new file write performance degraded about 10% compared to normal file write, which I considered acceptable. M block writes -> M dedup block writes + M/N control block writes + M/K db index updates, where N was the number of hash entries hosted in the control block and K is the number of hashes stored in one db index page. Repeated file writes were much faster due to deduplication.
I take heart when reading about recovered parts of expensive ashes.
But I firmly believe that no project is wasted especially one that pushed the boundaries of what is commonly done. All the people working there learnt something. They might apply their new found knowledge to other fields or future carriers. The money invested is not lost but converted into bigger brains.
I look at VC money in much the same way. It's not great when a startup fails. But a lot is learnt.
1. I thought this was going to be something about the Center for Disease Control
2. This is a really neat project, and super expensive ashes to rise from Stadia.
3. This is targeted at Windows to Linux, but given the speed advantages that it has over rsync couldn’t this be Linux to Linux? That said, I’d be afraid to use this over rsync just from the body of experience and knowledge that exists about rsync.
Slightly OT, but I like the schematic gifs used in the Readme.md (pretty amazing doc overall!) like this one [0]. Does anyone have suggestions what tools they might have used (or might be used in general) to create those?
We used an internal Chrome extension to capture the gifs. There also seem to be some externally available GIF capturing tools, like https://chrome.google.com/webstore/detail/chrome-capture-scr... (disclaimer: I haven't tried those). We then used https://ezgif.com/optimize to remove dupe/unwanted frames, tweak timings and compress the gif. The actual content was done in Google Slides.
Nope, luckily we still partially work from home and use Chrome Remote Desktop. The command prompt was running on my Windows machine, which I accessed remotely, so I could capture it with the extension. That wouldn't have worked if I had been in the office that day.
This might be a dumb question.. but when I read this
"scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression."
my thinking was: If you want to send diffs.. why not just use git?
It does compression and chunks and all that. Maybe the defaults perform poorly for binary files? But couldn't you fix that with a custom diff algo? (if it's somehow not appropriate, it'd still be nice to port whatever secret sauce they use here to git..)
They are separate problems and not really comparable, git is a higher level tool. The diffing in git is for humans to compare code, and as such it is actually computed on the fly and not stored anywhere (unless you use git patch). Similarly, git does not provide any mechanism for sending/receiving files, it re-uses existing solutions like HTTP and SSH.
This tool exists to send/receive files, and the diffing is an implementation detail used to achieve a high level of performance. It would make more sense for git to use this library under the hood as you mentioned.
There is an analogy here, as git does deduplication 'under the hood'.
It's kind of weird actually - at the architectural level we interact with, we talk in terms of diffs - both in terms of display and what we put in.
At the next level down (content addressable store) git is storing whole files, and the git tooling translates the diffs we communicate about down into whole files for each commit.
Then at the next level down, git puts files together in packfiles (when the repo is packed) which is a compression system to make use of the fact that most files are just tweaks of other files. So, once again it's diffs.
I'm sort of in the same boat, but with the sentence,
> To help this situation, we developed two tools, cdc_rsync
Why not use rsync?
(It does seem they ended up faster than rsync, so perhaps that's "why", but that seems more like a post-hoc justification.)
The "we use variable chunk windows" bit is intriguing, but the example GIF sort of just pre-supposes that the local & remote chunks match up. That could have happened in the rsync case/GIF, but that wasn't the case considered, so it's an oranges/apples comparison. (Or, how is it that the local manages to be clairvoyant enough to choose the same windows?)
>> Or, how is it that the local manages to be clairvoyant enough to choose the same windows?
Yes! In content defined chunking, the chunk boundaries depend on the content, in our case a 64 byte window. If the local and the remote files have the same 64 byte sequence anywhere, and that 64 byte sequence has some magic pattern of 0s and 1s, they will both have chunk boundaries there. A chunk is the range of data between two chunk boundaries, so if N consecutive chunk boundaries match, then N-1 consecutive chunks match.
I was wondering the same thing. It isn't explicit in the write-up but it's because rsync does not have a native Windows implementation, only rsync under Cygwin, so they developed this to achieve the same thing except it turned out faster than rsync.
> However, this was impractical, especially with the shift to working from home during the pandemic with sub-par internet connections. scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.
I didn't appreciate the scope of this problem until a friend of mine visited from the valley. I live in semirural Canada, and they were floored by the speed and low latency of the fibre connection I have.
I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience, or at least ask their employer to fund a better connection.
It caused me to reach out, and I found many wfh peers with terrible connections, low powered laptops, and few second monitors. And proper desks? The exception.
My family is mostly in trades, and so spending a little cash to improve my tools felt like common sense. Apparently it isn't.
You're not wrong. It's amazing how poor access can be.
I live in a tiny town on Vancouver Island and I have gigabit symmetric, for a very reasonable price. When I lived in Vancouver that simply wasn't an option.
Yeah. In my particular case the utility was already paid to deploy fiber, but they did “fiber to the node” which runs fiber to the DSL station. It’s still copper from there to the house, and in my case (end of the line in a cul de sac) it’s degraded down to only 30Mbps max bandwidth.
Unless the local node is completely saturated, I would expect you should be able to get much in excess of that. If you're ever up at 2 am try running a speed test then.
I'd also try to rule out everything on your side, house wiring, routers, switches. Basically try to speed test it with the line they wired directly to the outside.
And upgrade any old network hardware.
Unless you're talking all they offer is 30Mbps. Then that's an ISP problem.
My opinion- Internet providers running cables through public land are a natural monopoly but aren’t really regulated as such in most of the US, where it is believed that because DSL and an 5G cell phone plan exists there is competition.
that's wild, my parents in a pop.~ 1000 rural town in france get 2GB fiber; the next "average" town (60k inhabitants) is 30km away, and the next big town (Toulouse, pop. 460k) is 120km away
It still tickles me when people wonder why Starlink is anything someone would want.
The telcos were all given deals with assurances that they would bring rural America to something resembling what their urban counterparts have, but they've remained stagnant for some time now.
Starlink has finally given those people hope of seeing reasonable connectivity options. I would add in that T-Mobile Home Internet, and the likes, are doing so too, but on not as grand of a scale.
I'm starting the process of shifting to Australia (from NZ), and this is going to be the hardest pill to swallow - I'm paying $83 AU for 900/400 currently!
It’s because of Silicon Valley I think. We all got broadband with the first generation of technology in the 90’s. Since then, there has been so much government subsidies for rolling out internet infrastructure that no telco will invest in upgrades unless there is government money to pay for it. But most of those subsidies only apply to new deployments, not upgrades. We don’t qualify because we already have “broadband” (which here just means faster than dialup). So nobody is willing to pony up the cash for fiber-to-the-home deployments.
It's surprising that it would still need government subsidies to start with. I'm not really into the details of such thing that much, but around here (northern Europe) we're basically getting fiber everywhere, first in cities but later also in very rural and (by European standards) low-population areas, and there are no subsidies involved. Private companies compete and goes to communities (in rural areas) or neighborhoods (in city- or city-like areas) and try to drum up interest, and if there's enough interest they start digging. And in densely populated areas it's a no-brainer, they just do it. Due to competition (just more than one supplier and you have the competition) it's kind of "got to start this area before the competitor does.."
The fiber is also used for TV etc., with several providers on the same fiber, so I imagine that the fiber provider gets income from the competing TV providers as well. Could be part of the why. As for myself, I only need and only pay for actual internet.
I've had 1Gb/1Gb fiber for many years now, and lately fiber arrives in the most unexpected areas (long distances, few residents). It (the deployment, not the monthly) used to be more costly, but it's not anymore. And no public money involved. I know that it used to be, as an experiment, a couple of decades or more ago, but only in certain areas.
This problem is by no means limited to the US. Germany has that problem as well in rural areas and there government is also providing subsidies to expand broadband access.
Sometimes this can even be a problem in cities when the available infrastructure is at the limit. It's quite possible to move to an area, after having checked that good internet is available and suddenly the provider says no when signing up.
Yep! I lived all over the Bay Area for the last 20 years and I never had the option for fiber until I moved to Oakland two months ago. It’s only $40 a month for symmetric gigabit fiber. But before I simply had no option for it.
You don't appreciate the complete lack of incentive for companies to roll out truly high speed internet in the U.S. There is minimal gov't backing, little competition, most people don't know any better and at the end of the day those companies will still have to charge the same amount per month they do now.
I spend part of my time in a rural area. Only in the last couple years was fiber even available near my house, and then it's a matter of $300/mo on a 3-year contract for 250mbps symmetric, plus the cost of running last-mile from the road to my house.
Thankfully, Starlink came out, and I was able to get like 50/20 for $110/mo on a month-to-month + $500 fixed.
I'd like additional info about the Content Defined Chunking part of the specification.
How is the content defined? Where I'd try to begin is with a single pass that looks for runs of 'null' ('\0') bytes, even one long, as potential boundary ends. During that pass also look for 'magic signatures' for known stream types like already compressed content streams (all the more so to just not try compressing anyway). The CDC might also be aware of some file structures, zip, 7z, tar, etc; and have a dedicated segment creation algorithm for them. At a low level, the two ends should exchange a list of segment offsets, lengths, checksum (partial?) and maybe some short fragment of bytes to check. (E.G. 4 byte chunks at various powers of 2 offsets or major chunk starts.) Where the two ends have differences in chunks existing they might also expend some minor additional effort to investigate if the chunks that were identified on the other side exist locally; in case the two versions are using different filters or happened to reach different conclusions.
uint64_t hash = 0;
uint64_t magic_pattern = 0b001000010000100001000...;
for (size_t n = 0; n < data.size(); ++n) {
hash = (hash << 1) + random_table[data[n]];
if ((hash & magic_pattern) == 0) {
SetChunkBoundaryAt(n);
}
}
In practice, there's more bells and whistles, but that's the gist of it. By tweaking the numbers of 1's in magic_pattern you can influence the average chunk size (distance between two boundaries). With every additional 1, your chunk size halves. There's no special handling of compressed file types. You'd probably want to do that at a much higher level, e.g. just check for extensions.
This is great for syncing, but what about separating courgette from Chromium so that we can finally have a descent delta diff program? I am tired of Windows community relying on SmartVersion to create ISO diff files just because you need to be a master at compiling Chromium just to use courgette.
It would be interesting to also try this on more underpowered systems. E.g. aging armv5-based NAS boxes. Would we have similar performance improvements, or any improvements would be hampered by poor I/O performance and architectural inefficiencies?
Built this cdc tool for software update of embedded (Linux) systems and have deployed it with good enough performance on a couple of arm CPUs; https://github.com/oll3/bita
Though main goal has been keeping data usage low rather than speed up.
One of the biggest problems with extremely large files is how a simple insert or delete near the front of the file causes all bytes following it to be shifted and re-written to disk. Add 3 bytes to the beginning of a 50 GB file and you are writing 50 GB to disk.
I have been implementing a file system replacement project for several years. It is designed to handle hundreds of millions of files within a single container; put contextual meta-data tags on them; and enable lightning fast searches for things based off file type and/or tags. (https://www.youtube.com/watch?v=dWIo6sia_hw)
One of the ideas (not yet fully implemented) was to break up large files at the file system level. You might have a 50 GB file of data that looks exactly like a normal file to any application accessing it, but in reality it might be 10 separate chunks of 5 GB each. If you add or delete any bytes within any individual chunk, it only adjusts that specific chunk. For example, deleting 100 bytes at offset 6 GB causes the second chunk to shrink by 100 bytes. All the chunks following it are unaffected. The file still looks to be 100 bytes smaller to the application, but it doesn't realize that a chunk in the middle was just reduced in size instead of all the bytes after the change being shifted down.
This feature would also make it easier to copy large files from one system to another. Data could be transferred one chunk at a time. If the copy was interrupted, only missing chunks would need to be copied when the process was restarted.
Every file system I know (even FAT32) supports file fragmentation and could do this (give or take block boundaries), but I don't know if there are any OS APIs to take advantage of that to actually let applications insert or remove data in the middle of a file. I'm assuming it's not in POSIX.
This is also how Restic works, changing a few bytes in a 50 GB file won't reupload 50 GB of data unlike most backup solutions that work primarily with files.
This change would not be just be for backups or file transfers. It would fundamentally change how big files are stored by the file system. Much like the way fragmentation is handled solely by the file system and an application accessing the file is completely unaware if the file is stored within a single fragment or multiple fragments; the file system would manage the individual 'chunks'.
For example, a 6 GB file might be made up of 3 separate 2 GB chunks. An application might delete 20 bytes from the front of the file. This causes the first chunk to now be 2 GB - 20 bytes. The other 2 chunks are unchanged.
Current file systems do not allow this where a file can have a block somewhere in its interior that is just a partial block.
To elaborate, rsync chunks in fixed sizes, so inserting or deleting a few bytes makes all different chunks from that point onward.
If instead you chunk based off of local content (conceptually like chunking text into sentences at periods, but its a binary thing on has an upper size limit and lower size limit and I couldn't find the algorithm specification) so that after an insertion or deletion in a small number of bytes you start getting the same chunks as before.
This drastically reduces the cost of identifying unmodified chunks.
Your description of content defined chunking is exactly right though. There are a number of techniques for doing it. FastCDC is one of them, although not the one used in rsync.
rsync does use fixed-size chunks, but the rolling hash allows them to be identified even at non-integer chunk offsets.
So a change partway through the file doesn't force rsync to actually re-transfer all of the subsequent unmodified chunks, but it does incur a computational cost to find them since it has to search through all possible offsets.
The gotcha of "inserting or deleting a few bytes" is not in detection, it's in replicating this discovery to the target copy.
Say, we have 1GB file and we detected an extra byte at the head of our local copy. Great, what next? We can't replicate this on the receiving end without recopying the file, which is exactly what happens - rsync recreates target file from pieces of its old copy and differences received from the source. Every byte is copied, it's just that some of them are copied locally.
In that light, sync tools that operate with fixed-size blocks have one very big advantage - they allow updating target files in-place and limiting per-sync IO to writes of modified blocks only. This works exceptionally well for DBs, VMs, VHDs, file system containers, etc. It doesn't work well for archives (tars, zips), compressed images (jpgs, resource packs in games) and huge executables.
In other words - know your tools and know your data. Then match them appropriately.
Technically if you update a zip on the remote machine it'll work fine (the data gets appended in an update and the central directory record is always at the end of the zip.
I recall that tar has no end market at all so you can just append a new entry to it as well and when unpacked it'll overwrite the file from earlier in the archive. So they would work fine with rsync unless the tar is also compressed.
The tradeoff between zip and tar.{gz,xz,z} is that zip entries are compressed in the individual file context whereas in a compressed tar the entire archive is compressed in the same context. This may be a slight win for archives with many small files with similar structure.
For reference, here's the paper that describes the in-place update algo in rsync:
https://www.usenix.org/legacy/events/usenix03/tech/freenix03.... I haven't looked into it more deeply, but I think it's possible to apply the same idea to variable sized chunks.
Also, most modern compression tools have an "rsyncable" option that makes the archives play more nicely with rsync.
Still, with modern NVMe SSD speeds, usually the network will be the bottleneck. My system with a budget WD Blue gets a decent 1800MB/s sequential write (which somehow caused Win10 to freeze and my taskbar to disappear for a few seconds :/ ) and 2600MB/s sequential read, so even if everything else is unoptimized and the file has to be copied it will still take <1s for your hypothetical 1GB file. Copying the file over a 1Gb network link will take an order of magnitude longer.
But You can keep CDC choices in memory and if there is update to one chunk, just compute new boundaries from that chunk until next boundaries matches? A bit more code to write but doable?
That's because most Stadia devs used Windows, but the cloud instances ran on Linux, so devs had to copy their games from Windows to Linux. We're currently adding support for Windows to Windows as well.
Short answer, we didn't need it. While the code is largely cross-platform, there is some work involved when it gets down to the details.
We are currently working on supporting Windows to Windows. Linux to Linux has lower priority as rsync already provides all functionality, it's just a bit slower on fast connections. On slow connections, rsync and cdc_rsync perform very similarly as the sync speed is dominated by the network.
We also ran the experiment with the native Linux rsync, i.e syncing Linux to Linux, to rule out issues with Cygwin. Linux rsync performed on average 35% worse than Cygwin rsync, which can be attributed to CPU differences.
"It employs an array of 256 random 64-bit integers to map the values of the byte contents in the sliding window"
I presume using something else would skew the distribution of selected chunks. I don't know if that would help or hinder.
In my very limited experience, a picture of Rick Astley as the constants gave a similar distribution™ of chunks over a mixture of documentation files and Linux ISOs.
Comparing it with rsync running on Cygwin is a bit unfair, as Cygwin is known to be terribly inefficient. I don't doubt that their CDC based algorithm is faster, but probably not by the margin they claim if Cygwin is taken out of the equation
IIUC, rsync computes a relatively expensive Rabin-Karp rolling hash (https://librsync.github.io/rabinkarp_8c_source.html) and performs a hash map lookup for every byte. Hash map lookups might not be very cache friendly for larger data sets. In comparison, cdc_rsync only computes
I should also note that we used a fairly fast 100 MB/sec connection to upload the data, so the rsync diffing algorithm running at 50 MB/sec is actually a bottleneck. The difference would be smaller on a slower connection, where the network overhead would dominate the results.
Imagine an rsync that was A LOT smarter for backup use cases. Detecting files that have been renamed, files that have been compressed, similar files, files that have only been appended to, files that are the same across multiple systems (Oh, yeah, I probably already have a copy of this kernel file from this other host). Basically, deduplication as part of rsync.
Exactly what I'm looking for recently.. I dismissed ZFS (not mature on Linux) and btrfs (seems to be too complex and buggy as seen in a few horror stories)
BorgBackup has most of what you're looking for, though it doesn't implement CDC and doesn't replicate the files as-is in the backup location (instead using a compressed/deduped/chunked storage format)
For single system backups I've switched to using restic and it's been pretty great. I don't trust Borg, a couple years ago I tried doing a recovery using it and ran into some unicode issue in, I believe, a filename, and I couldn't track down exactly what file it was or get any files backed up after that file in the archive. I ended up using another backup I had.
For my multiple backups to a backup host where I'm using rsync, restic really doesn't work (having 100+ systems backed up to the same destination).
Any idea how it compares to aspera?
I may give benchmarking a shot later when I get to my computer, but if anyone has any intuition that could be helpful.
I skimmed through the readme, which explains the concepts quite well, but am unclear on what needs to be installed on each machine (assuming Windows as the source and Linux as the destination). There’s a mention of copying the Linux build output, cdc_rsync_server, to the Windows machine. Why is this needed? And is there something on the Linux machine that needs to be (newly) added in the PATH?
Just uncompress the binaries on the Windows machine and run cdc_rsync. The Linux component, cdc_rsync_server, is deployed automatically on first run. It is scp'ed to ~/.cache/cdc-file-transfer/bin. So nothing has to be installed on the Linux machine.
> At Stadia, game developers had access to Linux cloud instances to run games. Most developers wrote their games on Windows, though. Therefore, they needed a way to make them available on the remote Linux instance.
Am I reading this right that onboarding your game to Stadia as a developer involved essentially rsyncing data directly to a Linux cloud instance?
Interesting! I recently wrote about this in the context of Google and Apple ecosystems and did not know it had a name. [1] I called in offline-first vs. online-first design.
I imagine that one could develop this further into both ends being cross-platform but the readme seems to really strongly define it is solely built to transfer files from Windows to Linux. If that's all you need for internal tool, that's all you build (and test).
Can I use this to replace some rsync cronjobs I have? I'd love to convert them to 'streaming' rather than updating every few minutes when the cron fires.
Yes, it was a contentious topic when I was working on The Division and its sequel, since it runs on a game engine that was built in house.
However, even though the games did not ship on Linux proper, the work that went in to supporting stadia had a profound effect on the games ability to work under WINE.
People often overlook Stadia as putting significant pressure on devs to support Linux, and give a lot of credit to Valve, but the reality is that both are responsible.
Yes, this was a core architectural requirement of Stadia (running the games on Linux), as opposed to hosting virtualized Windows machines (which would've allowed serving up Windows-based games).
Whether this specific design choice alone killed Stadia, no one can say -- but it was probably one of the most important factors in its demise.
For me, the two most important aspects were: 1) it's made by Google so it will probably be shut down soon, and 2) if you already have any hardware which can play games, Stadia is strictly more expensive than buying and playing games locally. That's before I even got to the point of considering the size of its game library or the technical qualities of the service.
I play all my games on Linux, and in general, there are extremely few games I haven't gotten to run. Many just work out of the box through Proton, games not on Steam usually work out of the box with Lutris. This suggests to me that the Linux thing wasn't necessarily a big issue, if Google was willing to put in some work to get existing Windows games to work. AFAIK, they didn't do this and instead required games to run natively on Linux, which would significantly reduce the size of their games library (after all, why would a company invest time into porting their games to a Google product that's going to be shut down?).
Whether the US is even a market that's ready for game streaming is another question. Are there enough people with a high quality, low latency Internet connection that's close enough to a Stadia data center, who are interested in playing demanding games, but don't have the money for hardware yet can afford a monthly subscription on top of a high per-game price? Maybe it would be interesting if it worked like other subscription services and the subscription itself gave you access to games. I think Spotify would have flopped if you had to buy albums at their retail price in addition to the monthly subscription.
Likely it was caused by your computer and browser not playing nice with x265 hardware decoding, I used to run the gamelet on the same network as my desktop and would get the same on some configurations.
Frustratingly, it was not easy to know at a glance that this was what was happening, but there were times (many times, actually) where my chromecast ultra at home was performing much better with Stadia than my gaming PC
I got the same when I was running Stadia on a 2.4 GHz wifi network. Once I switched to a wired connection, it worked like a charm. 5 GHz supposedly works as well, but my robot vacuum cleaner needs a 2.4 GHz network facepalm.
I have a PV inverter which doesn't work if there's more than one AP with the same SSID and is 2.4GHz only... ended up buying an el-cheapo AP and giving it a unique name. PITA
Shouldn't require any tweaking, the sliding window should be able to work on sqlite files just as well as the game content files used in this article. If you want something smarter and very much more sqlite optimized, you probably want to look at litefs. https://github.com/superfly/litefs
I haven't tested but I would guess it's a whole lot faster to compute but will end up sending much more data, given that it only works via exact chunks. A big part of binary patching is finding the largest possible chunks, even if they mismatch slightly.
Speaking of Courgette, though, I suggest looking into Zucchini which is faster and often produces smaller patches: https://news.ycombinator.com/item?id=29028534 (sorry for linking to my own comment but I haven't found any good explanation or benchmarks from Google)
Only if it could go the other way. Everyone seems to forget about Windows/Windows Server. 80%+ Desktop, and 55%+ Server(Include on-prem 70%+, Cloud 50%) Market Share.
Note that cdc_rsync runs on Windows and syncs to Linux. rsync is a Linux-only tool where you'd have to jump through some hoops to make it work on Windows.
Maybe I’m projecting (probably) but I detect some subtext in the README’s language. Like a “developers who are irate that Stadia was shut down, wanting to free anything they can for the community” type feeling.
Aha! I wasn’t at all aware of that. Okay, this reads more like that to me second time around. Just excitement to open source a cool piece of tech that came from Stadia.
Apache Wave is still one-click installable on Sandstorm and occasionally hilarious to share instances of with people. Won't do it here though, would kill my poor Intel NUC, Wave is not exactly performance-friendly.
I'm currently working on a required CDC (Center for Disease Control) reporting function for a COVID test. For a second I thought this article was going to be extremely helpful.
The first computer I used professionaly was a Control Data Corporation (CDC) 6600. For a second I thought I could use this to transfer some of my old files.
Still, "syncing" is supposed to be shorthand for "synchronizing", which has the "h" you feel comes out of nowhere. So I guess both makes sense, but I don't use that form myself nor have seen anyone else use it in the wild.
They don't claim to have invented CDC, or FastCDC, they just made and are sharing a useful implementation of it.
And if that Tarsnap presentation is from 2013, and FastCDC was published in 2016 [1] according to Wikipedia [2], then presumably Tarsnap didn't invent FastCDC either.
Another well-cited predecessor is "A low-bandwidth network file system." (https://dl.acm.org/doi/abs/10.1145/502034.502052), which was published in 2001. It uses Rabin fingerprinting to define chunk boundaries.
There are many older and more venerable file transfer protocols, like FTP or TFTP. Add encryption and then you have SCP and SFTP. SFTP is the default protocol used by the scp command, and TFTP is still often used to communicate with routers and access points and similar equipment.
And I'm not sure what that has anything to do with TCP not as reliable. TCP has real problems on the modern Internet but the first part does not imply the second part.
That's not a problem with TCP, that's more of a problem with IP itself. Connections are currently identified by the source/destination IP and port numbers, but that should not have been the case. Imagine we use a UUID for the connection and keep an updatable cache between UUID and IP.
Still nothing to do with TCP.
Also, this kind of partial file transfer protocol is still needed because what if a host crashes? Even if we completely solve the mobile IP problem we would still rely on both ends of the connection tracking state. That state is lost due to a crash or power failure.
These chunked transfers are used to efficiently sync only the changed parts of the file. The actual network traffic is still over TCP. You're conflating several layers and abstractions.
Always interesting to see the sort of tangential projects that get rolled into the budget of larger projects at companies like Google. Funding the development of yet another file transfer utility, interesting as this one is, definitely wasn't necessary.
For devs it's only inconvenient and annoying, but for the devs' company the inefficiency means money.
At our company we were almost all of us working remotely during the pandemic (and many still do), and anything which slowed down or made a developer inefficient was looked at and improved on where possible. Because it's about (though not only about) money.
The acronym CDC meaning Content-Defined Chunking wasn't invented by Google I don't think. Here's the 2016 paper introducing the FastCDC algorithm[1]. Disclosure: I work at Google, but not on anything related to this.
It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
PS. I don't use this algorithm in ClickHouse, but it always remains tempting.