CDC File Transfer

zX41ZdbW · on Jan 8, 2023

Content Defined Chunking is one of my favorite algorithms because it has some "magic" similar to HyperLogLogs, Bloom filters, etc... This algorithm is good to explain to people, to get them inspired by computer science. I usually explain the simplest variant with rolling hashes.

It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

PS. I don't use this algorithm in ClickHouse, but it always remains tempting.

piceas · on Jan 8, 2023

Implementing fastCDC is fun (2016).

Do you have a suggestion on what to read on the topic since then?

I don't keep up with these things. A quick search came up with the following but I haven't read it yet.

Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.

https://ranger.uta.edu/~sjiang/index.html

windowsworkstoo · on Jan 8, 2023

Microsoft have this for DFS-R and the spec is open https://learn.microsoft.com/en-us/openspecs/windows_protocol... - its pretty straight forward to implement.

klabb3 · on Jan 9, 2023

> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.

ketralnis · on Jan 9, 2023

That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise

https://en.wikipedia.org/wiki/Illegal_number https://ansuz.sooke.bc.ca/entry/23 https://shkspr.mobi/blog/2022/11/illegal-hashes/

klabb3 · on Jan 9, 2023

I’m not suggesting the hashes themselves are illegal to possess, but that transferring the bytes corresponding to those hashes is problematic: if both sides are lowly trusted, that puts you at risk as a hoster of that content. This is indeed an issue with IPFS, for instance, where I believe the solutions are “pinning” content that is already vetted by another party, or denylists of “bad bits”. I assume it’s similar to any other clearnet hosting. Btw, I make zero value judgments about all of that.

Off topic: I see downvotes on my parent comment, please let me know if I said something bad to help me improve.

HelloNurse · on Jan 9, 2023

Shared bytes could be construed in the opposite direction: if two or more of my users have the same chunk in their files, it is more likely to be some legal piece of data.

Files become piracy when there is evidence of intentional copyright infringement, for example when the chunk is part of a valid MPEG4 file and the MPEG4 file is titled "Wednesday_S2E4_FullHD_NetflixRip.MP4"

flipbrad · on Jan 9, 2023

Re last para: probably because it's full of very certain, but also quite certainly wrong, statements along the lines of "Under current legal doctrine, blobs need some form of chain of custody." Citation needed.

klabb3 · on Jan 9, 2023

I can see how that’s overly assuming. Thanks for being candid.

ketralnis · on Jan 9, 2023

It's not the illegalness I'm challenging, it's the problematicness. Maybe it is illegal to even think about those bit patterns. But I'm not aware of cases where people get _actually_ thrown in jail or fined for possessing or transmitting them. In all of the cases I know about there is intent involved.

somat · on Jan 9, 2023

It is hard to tell if this is what you are saying. But a common misconception of ipfs seems to be that you may end up hosting random unwanted files. this is untrue, you only end up hosting files you want.

chaxor · on Jan 9, 2023

Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc. How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.

vonseel · on Jan 9, 2023

What planet have you been living on? Bittorrent is widely used to distribute copyrighted material - movies, TV shows, games, programs, porn... I'd imagine a large majority of bittorrent traffic worldwide is pirated material, with a small portion being datasets as you describe, and other legally-shared data like actual Linux distros, etc.

chaxor · on Jan 20, 2023

I suppose there could be many things happening on the internet that we are unaware of; however, torrents are very good and specifically tailored as a protocol for scientific data and ML.

It solves the link-rot issues that occur due to moving institutions, it allows huge storage for essentially free (ever tried to store 9 TB of training data or CERN data on Dropbox?), and it scales extremely beautifully.

It's really the absolute perfect solution for reproducible research in large data studies.

AlfeG · on Jan 9, 2023

Torrents are no longer main source of copyrighted materials, at least for shows and movies. There is a bunch of illegal services that provide Netflix like experience against pirated content.

mandarax8 · on Jan 9, 2023

Don't these services usually use torrents under the hood though? Thinking about stremio and popcorntime.

loxias · on Jan 9, 2023

Now I feel old, using bittorrent and soulseek.

saagarjha · on Jan 9, 2023

If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.

klabb3 · on Jan 9, 2023

I think for piracy/DMCA it’s enough to simply remove it. As for CSAM or more serious stuff, I don’t know if that’s enough? Does section 230 cover that? Is there a difference between being a company and an individual?

ww520 · on Jan 9, 2023

15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.

loxias · on Jan 9, 2023

Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D

Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)

The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.

> but there were some patents on it.

The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!

ww520 · on Jan 9, 2023

EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area. I believed that’s in the early 2000’s. One of the household name big techs had an internal product with similar design. ZFS has similar design.

Mine was at the block device level. The advantage is you can format it to whatever file system of your choice, with read/write support and deduplication just works.

loxias · on Jan 9, 2023

> Mine was at the block device level.

Same! :) Originally I wrote it with an interface kinda similar to `tar` -- you add or extract huge blobs to/from what I called a coalesced archive. I could re-image a machine about 8x faster than Norton Ghost.

After $WORK went under, I kept the code and toyed around with it, making it speak NBD so instead of extracting a huge blob from the archive to a destination block device you could also access it directly. I feel like I never Properly solved write support though.

I'm curious, did you think of anything better than refcounting the data blocks and then keeping a list when the count goes to zero, then adding the next unique block to the zero list? That's all I could think of, and it adds at _least_ one additional layer of indirection which I didn't like bc it would have a performance impact.

> EMC had a disk based deduplication storage at the time. NetAppliance had a competing product. They had patents in the area.

I know this _NOW_ but certainly didn't know back then. :) And still doesn't take away the fact that according to the USPTO, "compression via coalescion" is miiiiine. ;-)

Again, I interpret this NOT as evidence of "how clever I am", but as evidence of how silly and broken the patent system is.

ww520 · on Jan 9, 2023

Yes. Depending on how the claim languages are phrased, patents on the same idea can be approved.

For reclaiming deleted blocks, I just had a garbage collection phase to run from time to time. Like you've mentioned on refcount, I've considered it but it amplified writes 2X~3X and worse they were random access writes. Garbage collection was not so bad since it's only going through the virtual file control blocks containing the content-address-hash.

The storage layout was: file block -> virtual file control block -> dedup-data-block. The virtual file control block contained the dedup block hash entries where one control block hosted N file blocks. GC only needed to scan the control blocks to find out which dedup-data-blocks were in use.

Freed dedup-data-blocks remained in place and were linked to the free list; the first couple bytes of the free block were cooped to store the pointer to the next free block.

At the end, brand new file write performance degraded about 10% compared to normal file write, which I considered acceptable. M block writes -> M dedup block writes + M/N control block writes + M/K db index updates, where N was the number of hash entries hosted in the control block and K is the number of hashes stored in one db index page. Repeated file writes were much faster due to deduplication.

petesergeant · on Jan 9, 2023

https://restic.net/blog/2015-09-12/restic-foundation1-cdc/

hot_gril · on Jan 9, 2023

> Born from the ashes of Stadia

Love it. Those were some very expensive ashes. I hope more comes from them too.

haarts · on Jan 9, 2023

I take heart when reading about recovered parts of expensive ashes.

But I firmly believe that no project is wasted especially one that pushed the boundaries of what is commonly done. All the people working there learnt something. They might apply their new found knowledge to other fields or future carriers. The money invested is not lost but converted into bigger brains.

I look at VC money in much the same way. It's not great when a startup fails. But a lot is learnt.

nayuki · on Jan 9, 2023

It's like how Google Docs' concurrent editing was born from the ashes of Google Wave.

collinmanderson · on Jan 10, 2023

I don’t believe that Google docs concurrent editing came from wave. I thought they were separate though I might be wrong.

kragen · on Jan 15, 2023

i thought it came from etherpad

archon810 · on Jan 9, 2023

Yeah, like opening up BT in the Stadia controllers.

Apofis · on Jan 9, 2023

I wonder if they'll open source some of the streaming tech...

ssth · on Jan 9, 2023

We can feel developer's bitterness from that sentence alone.

hot_gril · on Jan 9, 2023

Stadia did impact my work a little, but not enough to matter to me. I'd be upset if I worked on it directly.

tmchow · on Jan 9, 2023

1. I thought this was going to be something about the Center for Disease Control

2. This is a really neat project, and super expensive ashes to rise from Stadia.

3. This is targeted at Windows to Linux, but given the speed advantages that it has over rsync couldn’t this be Linux to Linux? That said, I’d be afraid to use this over rsync just from the body of experience and knowledge that exists about rsync.

rsync · on Jan 9, 2023

"I thought this was going to be something about the Center for Disease Control"

I thought cultdeadcow. Tool for transferring t-files.

tga_d · on Jan 9, 2023

I was thinking something similar to your third point, though more for cdc_stream, since as of last year sshfs is no longer maintained anyway.

josephg · on Jan 9, 2023

I'm really impressed with the readme file in this repository. Its a master class in effective documentation. Just look at the structure:

- 1-2 sentence summary of the project

- History (why we built this and didn't just use existing tools)

- What we've built, including what it does and how it compares to other tools (rsync in this case).

- How to install and use it

And the whole thing is full of images and animations showing the tool in action, and explaining enough of its internals.

I consider myself good at documentation, but I'm taking notes. This is excellent.

hot_gril · on Jan 9, 2023

This is how I strive to document things. Focus primarily on what the tool is used for and why, not trying to define what it is.

aembleton · on Jan 9, 2023

Also for commit messages - why the change is made, not what it is.

zmmmmm · on Jan 9, 2023

The documentation in the code itself is pretty great as well:

https://github.com/google/cdc-file-transfer/blob/main/fastcd...

SV_BubbleTime · on Jan 9, 2023

It’s always amazing to me how little emphasis people give to INTENT on their repo.

c7b · on Jan 9, 2023

Slightly OT, but I like the schematic gifs used in the Readme.md (pretty amazing doc overall!) like this one [0]. Does anyone have suggestions what tools they might have used (or might be used in general) to create those?

[0] https://github.com/google/cdc-file-transfer/blob/main/docs/l...

ljusten · on Jan 9, 2023

We used an internal Chrome extension to capture the gifs. There also seem to be some externally available GIF capturing tools, like https://chrome.google.com/webstore/detail/chrome-capture-scr... (disclaimer: I haven't tried those). We then used https://ezgif.com/optimize to remove dupe/unwanted frames, tweak timings and compress the gif. The actual content was done in Google Slides.

c7b · on Jan 9, 2023

Awesome, thanks for replying here! If I may be greedy: I suppose you used something else for capturing the command prompt gif?

ljusten · on Jan 9, 2023

Nope, luckily we still partially work from home and use Chrome Remote Desktop. The command prompt was running on my Windows machine, which I accessed remotely, so I could capture it with the extension. That wouldn't have worked if I had been in the office that day.

skies457 · on Jan 9, 2023

D2 (https://d2lang.com/tour/intro/) might be the tool you are looking for, but it doesn't handle animation as far as I know

c7b · on Jan 9, 2023

That looks useful, thanks!

alixanderwang · on Jan 9, 2023

Coming very soon!

r4kbY44D · on Jan 9, 2023

ImageMagick does the trick, for example `convert -delay 20 -loop 0 *.png out.gif`

pornel · on Jan 9, 2023

https://gif.ski should give you higher quality and smaller file sizes.

yellow_lead · on Jan 9, 2023

Maybe just draw.io plus Photoshop or some gif generator. The gifs don't look very complicated

geokon · on Jan 9, 2023

This might be a dumb question.. but when I read this

"scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression."

my thinking was: If you want to send diffs.. why not just use git?

It does compression and chunks and all that. Maybe the defaults perform poorly for binary files? But couldn't you fix that with a custom diff algo? (if it's somehow not appropriate, it'd still be nice to port whatever secret sauce they use here to git..)

mindwok · on Jan 9, 2023

They are separate problems and not really comparable, git is a higher level tool. The diffing in git is for humans to compare code, and as such it is actually computed on the fly and not stored anywhere (unless you use git patch). Similarly, git does not provide any mechanism for sending/receiving files, it re-uses existing solutions like HTTP and SSH.

This tool exists to send/receive files, and the diffing is an implementation detail used to achieve a high level of performance. It would make more sense for git to use this library under the hood as you mentioned.

geokon · on Jan 9, 2023

Oh right, git doesn't store diffs. So each time you push/pull, it fetches/uploads the whole file.. for every commit?

Damn..

Then I guess even using this under the hood would be essentially a rewrite or how git fundamentally works internally

rkangel · on Jan 9, 2023

There is an analogy here, as git does deduplication 'under the hood'.

It's kind of weird actually - at the architectural level we interact with, we talk in terms of diffs - both in terms of display and what we put in.

At the next level down (content addressable store) git is storing whole files, and the git tooling translates the diffs we communicate about down into whole files for each commit.

Then at the next level down, git puts files together in packfiles (when the repo is packed) which is a compression system to make use of the fact that most files are just tweaks of other files. So, once again it's diffs.

pornel · on Jan 9, 2023

There is git-pack which compresses similar files together.

planede · on Jan 9, 2023

git does store deltas, but it's an implementation detail. This is what git pack files mostly are.

I don't know if git sends or receives deltas though, I think it does?

deathanatos · on Jan 9, 2023

I'm sort of in the same boat, but with the sentence,

> To help this situation, we developed two tools, cdc_rsync

Why not use rsync?

(It does seem they ended up faster than rsync, so perhaps that's "why", but that seems more like a post-hoc justification.)

The "we use variable chunk windows" bit is intriguing, but the example GIF sort of just pre-supposes that the local & remote chunks match up. That could have happened in the rsync case/GIF, but that wasn't the case considered, so it's an oranges/apples comparison. (Or, how is it that the local manages to be clairvoyant enough to choose the same windows?)

ljusten · on Jan 9, 2023

>> Or, how is it that the local manages to be clairvoyant enough to choose the same windows?

Yes! In content defined chunking, the chunk boundaries depend on the content, in our case a 64 byte window. If the local and the remote files have the same 64 byte sequence anywhere, and that 64 byte sequence has some magic pattern of 0s and 1s, they will both have chunk boundaries there. A chunk is the range of data between two chunk boundaries, so if N consecutive chunk boundaries match, then N-1 consecutive chunks match.

mindwok · on Jan 9, 2023

I was wondering the same thing. It isn't explicit in the write-up but it's because rsync does not have a native Windows implementation, only rsync under Cygwin, so they developed this to achieve the same thing except it turned out faster than rsync.

int_19h · on Jan 9, 2023

MSYS2 has rsync. Unlike Cygwin, I would consider that native.

desi_ninja · on Jan 9, 2023

Even if one tries to repurpose git for it, git doesn't work well with diffs in binaries

ithkuil · on Jan 9, 2023

I did a similar thing a few years ago

https://github.com/mkmik/imsy

It's less sophisticated but it uses the same core idea and the implementation is super simple

dleslie · on Jan 9, 2023

> However, this was impractical, especially with the shift to working from home during the pandemic with sub-par internet connections. scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

I didn't appreciate the scope of this problem until a friend of mine visited from the valley. I live in semirural Canada, and they were floored by the speed and low latency of the fibre connection I have.

I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience, or at least ask their employer to fund a better connection.

It caused me to reach out, and I found many wfh peers with terrible connections, low powered laptops, and few second monitors. And proper desks? The exception.

My family is mostly in trades, and so spending a little cash to improve my tools felt like common sense. Apparently it isn't.

adastra22 · on Jan 9, 2023

> I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience

I think you took for granted the mere availability of fiber as an option.

30Mbps DSL is the best option I have, other than Starlink. And I live in San Jose!

dleslie · on Jan 9, 2023

You're not wrong. It's amazing how poor access can be.

I live in a tiny town on Vancouver Island and I have gigabit symmetric, for a very reasonable price. When I lived in Vancouver that simply wasn't an option.

Topgamer7 · on Jan 9, 2023

The Canadian government gives grants to telecoms to install fibre. And I think there is some relation to signing up end users too but I can't recall.

dan-robertson · on Jan 9, 2023

I thought the USG did a similar thing but maybe didn’t require the installed fibre actually be used /connected to the cabinets?

adastra22 · on Jan 9, 2023

Yeah. In my particular case the utility was already paid to deploy fiber, but they did “fiber to the node” which runs fiber to the DSL station. It’s still copper from there to the house, and in my case (end of the line in a cul de sac) it’s degraded down to only 30Mbps max bandwidth.

But somehow that counts as a fiber deployment :/

Topgamer7 · on Jan 10, 2023

Unless the local node is completely saturated, I would expect you should be able to get much in excess of that. If you're ever up at 2 am try running a speed test then.

I'd also try to rule out everything on your side, house wiring, routers, switches. Basically try to speed test it with the line they wired directly to the outside.

And upgrade any old network hardware.

Unless you're talking all they offer is 30Mbps. Then that's an ISP problem.

consumer451 · on Jan 9, 2023

I am sitting in a farming village in central EU and I have 900mbps fiber, as of last year.

My family lives in old South Seattle neighborhoods and has excellent fiber.

I had something like 100mpbs co-axial 21 years ago in West Seattle.

What is the problem with San Jose?

fnordpiglet · on Jan 9, 2023

I think people have been asking that question for generations.

mattnewton · on Jan 9, 2023

My opinion- Internet providers running cables through public land are a natural monopoly but aren’t really regulated as such in most of the US, where it is believed that because DSL and an 5G cell phone plan exists there is competition.

sgtnoodle · on Jan 9, 2023

Lol, we have 46Mbps DSL over in Half Moon Bay. My friend in Los Altos sent me his 8Gbps symmetric internet speed test result...

jcelerier · on Jan 9, 2023

that's wild, my parents in a pop.~ 1000 rural town in france get 2GB fiber; the next "average" town (60k inhabitants) is 30km away, and the next big town (Toulouse, pop. 460k) is 120km away

qup · on Jan 9, 2023

10/1 (Mbps) is the "guarantee" where I'm at. It's pretty rural, so I'm happy to have anything.

I would pay $200/month extra for fiber.

AviationAtom · on Jan 9, 2023

It still tickles me when people wonder why Starlink is anything someone would want.

The telcos were all given deals with assurances that they would bring rural America to something resembling what their urban counterparts have, but they've remained stagnant for some time now.

Starlink has finally given those people hope of seeing reasonable connectivity options. I would add in that T-Mobile Home Internet, and the likes, are doing so too, but on not as grand of a scale.

girvo · on Jan 9, 2023

I pay $130 AUD a month for 500/50 over HFC. Not ideal price wise but it works well enough, but man the NBN could’ve been so much better.

I could go to 1000mbps, but then I’m limited to 25mbps upload which is just terrible

hrrsn · on Jan 9, 2023

I'm starting the process of shifting to Australia (from NZ), and this is going to be the hardest pill to swallow - I'm paying $83 AU for 900/400 currently!

Sophistifunk · on Jan 9, 2023

With who? I'm paying $200 for 250mb from Aussie BB.

girvo · on Jan 10, 2023

Superloop! On the NBN for the infrastructure

BooneJS · on Jan 9, 2023

Bay Area broadband is surprisingly bad, and doesn’t fit the outside assumptions of the “Silicon Valley”.

adastra22 · on Jan 9, 2023

It’s because of Silicon Valley I think. We all got broadband with the first generation of technology in the 90’s. Since then, there has been so much government subsidies for rolling out internet infrastructure that no telco will invest in upgrades unless there is government money to pay for it. But most of those subsidies only apply to new deployments, not upgrades. We don’t qualify because we already have “broadband” (which here just means faster than dialup). So nobody is willing to pony up the cash for fiber-to-the-home deployments.

Tor3 · on Jan 9, 2023

It's surprising that it would still need government subsidies to start with. I'm not really into the details of such thing that much, but around here (northern Europe) we're basically getting fiber everywhere, first in cities but later also in very rural and (by European standards) low-population areas, and there are no subsidies involved. Private companies compete and goes to communities (in rural areas) or neighborhoods (in city- or city-like areas) and try to drum up interest, and if there's enough interest they start digging. And in densely populated areas it's a no-brainer, they just do it. Due to competition (just more than one supplier and you have the competition) it's kind of "got to start this area before the competitor does.."

The fiber is also used for TV etc., with several providers on the same fiber, so I imagine that the fiber provider gets income from the competing TV providers as well. Could be part of the why. As for myself, I only need and only pay for actual internet.

I've had 1Gb/1Gb fiber for many years now, and lately fiber arrives in the most unexpected areas (long distances, few residents). It (the deployment, not the monthly) used to be more costly, but it's not anymore. And no public money involved. I know that it used to be, as an experiment, a couple of decades or more ago, but only in certain areas.

DasIch · on Jan 9, 2023

This problem is by no means limited to the US. Germany has that problem as well in rural areas and there government is also providing subsidies to expand broadband access.

Sometimes this can even be a problem in cities when the available infrastructure is at the limit. It's quite possible to move to an area, after having checked that good internet is available and suddenly the provider says no when signing up.

SequoiaHope · on Jan 9, 2023

Yep! I lived all over the Bay Area for the last 20 years and I never had the option for fiber until I moved to Oakland two months ago. It’s only $40 a month for symmetric gigabit fiber. But before I simply had no option for it.

strangescript · on Jan 9, 2023

You don't appreciate the complete lack of incentive for companies to roll out truly high speed internet in the U.S. There is minimal gov't backing, little competition, most people don't know any better and at the end of the day those companies will still have to charge the same amount per month they do now.

ClassyJacket · on Jan 9, 2023

Nobody is avoiding fibre because it costs extra, we don't have fibre because it isn't offered. We're not as stupid as you're accusing us of being.

wyager · on Jan 9, 2023

I spend part of my time in a rural area. Only in the last couple years was fiber even available near my house, and then it's a matter of $300/mo on a 3-year contract for 250mbps symmetric, plus the cost of running last-mile from the road to my house.

Thankfully, Starlink came out, and I was able to get like 50/20 for $110/mo on a month-to-month + $500 fixed.

sigmonsays · on Jan 8, 2023

can we gut rsync and implement this?

rsync doesn't even have a great protocol specification... maybe it's a new tool...

mjevans · on Jan 8, 2023

I'd like additional info about the Content Defined Chunking part of the specification.

How is the content defined? Where I'd try to begin is with a single pass that looks for runs of 'null' ('\0') bytes, even one long, as potential boundary ends. During that pass also look for 'magic signatures' for known stream types like already compressed content streams (all the more so to just not try compressing anyway). The CDC might also be aware of some file structures, zip, 7z, tar, etc; and have a dedicated segment creation algorithm for them. At a low level, the two ends should exchange a list of segment offsets, lengths, checksum (partial?) and maybe some short fragment of bytes to check. (E.G. 4 byte chunks at various powers of 2 offsets or major chunk starts.) Where the two ends have differences in chunks existing they might also expend some minor additional effort to investigate if the chunks that were identified on the other side exist locally; in case the two versions are using different filters or happened to reach different conclusions.

ljusten · on Jan 9, 2023

In a nutshell, the algorithm computes

  uint64_t hash = 0;
  uint64_t magic_pattern = 0b001000010000100001000...;
  for (size_t n = 0; n < data.size(); ++n) {
    hash = (hash << 1) + random_table[data[n]];
    if ((hash & magic_pattern) == 0) {
      SetChunkBoundaryAt(n);
    }
  }

In practice, there's more bells and whistles, but that's the gist of it. By tweaking the numbers of 1's in magic_pattern you can influence the average chunk size (distance between two boundaries). With every additional 1, your chunk size halves. There's no special handling of compressed file types. You'd probably want to do that at a much higher level, e.g. just check for extensions.

progbits · on Jan 8, 2023

> Overall, cdc_rsync syncs files about 3 times faster than Cygwin rsync.

Sounds promising.

dmarlow · on Jan 8, 2023

The write up was super clear and easy to follow. Thanks for the insights, context, and explanations! Oh, and the technology is cool too.

ohjeez · on Jan 8, 2023

I'm so old that I assumed this post was about interfacing with Control Data mainframes.

Tor3 · on Jan 9, 2023

You were not the only one..

encryptluks2 · on Jan 8, 2023

This is great for syncing, but what about separating courgette from Chromium so that we can finally have a descent delta diff program? I am tired of Windows community relying on SmartVersion to create ISO diff files just because you need to be a master at compiling Chromium just to use courgette.

xearl · on Jan 9, 2023

Is there any chance to get this faster sync algorithm into rsync itself?

ljusten · on Jan 9, 2023

The same question was asked here: https://github.com/google/cdc-file-transfer/issues/56

m000 · on Jan 8, 2023

It would be interesting to also try this on more underpowered systems. E.g. aging armv5-based NAS boxes. Would we have similar performance improvements, or any improvements would be hampered by poor I/O performance and architectural inefficiencies?

oll3 · on Jan 8, 2023

Built this cdc tool for software update of embedded (Linux) systems and have deployed it with good enough performance on a couple of arm CPUs; https://github.com/oll3/bita

Though main goal has been keeping data usage low rather than speed up.

didgetmaster · on Jan 9, 2023

One of the biggest problems with extremely large files is how a simple insert or delete near the front of the file causes all bytes following it to be shifted and re-written to disk. Add 3 bytes to the beginning of a 50 GB file and you are writing 50 GB to disk.

I have been implementing a file system replacement project for several years. It is designed to handle hundreds of millions of files within a single container; put contextual meta-data tags on them; and enable lightning fast searches for things based off file type and/or tags. (https://www.youtube.com/watch?v=dWIo6sia_hw)

One of the ideas (not yet fully implemented) was to break up large files at the file system level. You might have a 50 GB file of data that looks exactly like a normal file to any application accessing it, but in reality it might be 10 separate chunks of 5 GB each. If you add or delete any bytes within any individual chunk, it only adjusts that specific chunk. For example, deleting 100 bytes at offset 6 GB causes the second chunk to shrink by 100 bytes. All the chunks following it are unaffected. The file still looks to be 100 bytes smaller to the application, but it doesn't realize that a chunk in the middle was just reduced in size instead of all the bytes after the change being shifted down.

This feature would also make it easier to copy large files from one system to another. Data could be transferred one chunk at a time. If the copy was interrupted, only missing chunks would need to be copied when the process was restarted.

kalleboo · on Jan 10, 2023

Every file system I know (even FAT32) supports file fragmentation and could do this (give or take block boundaries), but I don't know if there are any OS APIs to take advantage of that to actually let applications insert or remove data in the middle of a file. I'm assuming it's not in POSIX.

themaninthedark · on Jan 9, 2023

I know that BitTorrent has to allocate all the space before it downloads but would that be a good base to start with?

nicolaslem · on Jan 9, 2023

This is also how Restic works, changing a few bytes in a 50 GB file won't reupload 50 GB of data unlike most backup solutions that work primarily with files.

didgetmaster · on Jan 9, 2023

This change would not be just be for backups or file transfers. It would fundamentally change how big files are stored by the file system. Much like the way fragmentation is handled solely by the file system and an application accessing the file is completely unaware if the file is stored within a single fragment or multiple fragments; the file system would manage the individual 'chunks'.

For example, a 6 GB file might be made up of 3 separate 2 GB chunks. An application might delete 20 bytes from the front of the file. This causes the first chunk to now be 2 GB - 20 bytes. The other 2 chunks are unchanged.

Current file systems do not allow this where a file can have a block somewhere in its interior that is just a partial block.

riceart77 · on Jan 9, 2023

> Current file systems do not allow this where a file can have a block somewhere in its interior that is just a partial block.

Because it would be costly for uncertain benefit?

rshaban · on Jan 9, 2023

can I ask when this might need to happen ?

brendoncarroll · on Jan 8, 2023

FastCDC is the same chunking algorithm used in Got.

https://github.com/gotvc/got

jws · on Jan 8, 2023

To elaborate, rsync chunks in fixed sizes, so inserting or deleting a few bytes makes all different chunks from that point onward.

If instead you chunk based off of local content (conceptually like chunking text into sentences at periods, but its a binary thing on has an upper size limit and lower size limit and I couldn't find the algorithm specification) so that after an insertion or deletion in a small number of bytes you start getting the same chunks as before.

This drastically reduces the cost of identifying unmodified chunks.

brendoncarroll · on Jan 8, 2023

I̶ ̶d̶o̶n̶'̶t̶ ̶t̶h̶i̶n̶k̶ ̶r̶s̶y̶n̶c̶ ̶u̶s̶e̶s̶ ̶f̶i̶x̶e̶d̶ ̶s̶i̶z̶e̶d̶ ̶c̶h̶u̶n̶k̶s̶.̶ The algorithm is described here; it's a rolling hash.

https://rsync.samba.org/tech_report/node3.html

Your description of content defined chunking is exactly right though. There are a number of techniques for doing it. FastCDC is one of them, although not the one used in rsync.

https://en.wikipedia.org/wiki/Rolling_hash

EDIT: Corrected in the comments below. Fixed sized chunks searched for at any offset with a rolling hash. The rsync algorithm description is here.

https://rsync.samba.org/tech_report/node2.html

teraflop · on Jan 8, 2023

rsync does use fixed-size chunks, but the rolling hash allows them to be identified even at non-integer chunk offsets.

So a change partway through the file doesn't force rsync to actually re-transfer all of the subsequent unmodified chunks, but it does incur a computational cost to find them since it has to search through all possible offsets.

eps · on Jan 8, 2023

The gotcha of "inserting or deleting a few bytes" is not in detection, it's in replicating this discovery to the target copy.

Say, we have 1GB file and we detected an extra byte at the head of our local copy. Great, what next? We can't replicate this on the receiving end without recopying the file, which is exactly what happens - rsync recreates target file from pieces of its old copy and differences received from the source. Every byte is copied, it's just that some of them are copied locally.

In that light, sync tools that operate with fixed-size blocks have one very big advantage - they allow updating target files in-place and limiting per-sync IO to writes of modified blocks only. This works exceptionally well for DBs, VMs, VHDs, file system containers, etc. It doesn't work well for archives (tars, zips), compressed images (jpgs, resource packs in games) and huge executables.

In other words - know your tools and know your data. Then match them appropriately.

bhawks · on Jan 9, 2023

> It doesn't work well for archives (tars, zips).

Technically if you update a zip on the remote machine it'll work fine (the data gets appended in an update and the central directory record is always at the end of the zip.

I recall that tar has no end market at all so you can just append a new entry to it as well and when unpacked it'll overwrite the file from earlier in the archive. So they would work fine with rsync unless the tar is also compressed.

The tradeoff between zip and tar.{gz,xz,z} is that zip entries are compressed in the individual file context whereas in a compressed tar the entire archive is compressed in the same context. This may be a slight win for archives with many small files with similar structure.

ljusten · on Jan 9, 2023

For reference, here's the paper that describes the in-place update algo in rsync: https://www.usenix.org/legacy/events/usenix03/tech/freenix03.... I haven't looked into it more deeply, but I think it's possible to apply the same idea to variable sized chunks.

Also, most modern compression tools have an "rsyncable" option that makes the archives play more nicely with rsync.

xnzakg · on Jan 9, 2023

Still, with modern NVMe SSD speeds, usually the network will be the bottleneck. My system with a budget WD Blue gets a decent 1800MB/s sequential write (which somehow caused Win10 to freeze and my taskbar to disappear for a few seconds :/ ) and 2600MB/s sequential read, so even if everything else is unoptimized and the file has to be copied it will still take <1s for your hypothetical 1GB file. Copying the file over a 1Gb network link will take an order of magnitude longer.

liuliu · on Jan 9, 2023

But You can keep CDC choices in memory and if there is update to one chunk, just compute new boundaries from that chunk until next boundaries matches? A bit more code to write but doable?

wahern · on Jan 8, 2023

Well, that's confusing--there's another Git follow on project from some OpenBSD developers also called Got: http://gameoftrees.org/

They both seem like very cool projects, so may the best Got win!

worldsavior · on Jan 9, 2023

Why is it only from windows to Linux?

ljusten · on Jan 9, 2023

That's because most Stadia devs used Windows, but the cloud instances ran on Linux, so devs had to copy their games from Windows to Linux. We're currently adding support for Windows to Windows as well.

worldsavior · on Jan 9, 2023

But why not make cross platform? If already creating a file transfer program, make it cross platform. What're the complications?

ljusten · on Jan 9, 2023

Short answer, we didn't need it. While the code is largely cross-platform, there is some work involved when it gets down to the details.

We are currently working on supporting Windows to Windows. Linux to Linux has lower priority as rsync already provides all functionality, it's just a bit slower on fast connections. On slow connections, rsync and cdc_rsync perform very similarly as the sync speed is dominated by the network.

worldsavior · on Jan 9, 2023

Uh I understand. I thought this efficient program would be pretty useful as for the speed.

aembleton · on Jan 9, 2023

It does do Linux to Linux as that's covered by the Readme. For anything else, I guess it didn't fill the scope of what was needed for Stadia

jamespo · on Jan 9, 2023

And linux to linux?

aembleton · on Jan 9, 2023

We also ran the experiment with the native Linux rsync, i.e syncing Linux to Linux, to rule out issues with Cygwin. Linux rsync performed on average 35% worse than Cygwin rsync, which can be attributed to CPU differences.

Source: https://github.com/google/cdc-file-transfer

ssth · on Jan 9, 2023

I suspect that author made comparison using windows box and linux box with "cpu difference" because those 2 have different spec.

cwaffles · on Jan 9, 2023

CDC slides explaining the algorithm: https://www.usenix.org/sites/default/files/conference/protec...

ddalex · on Jan 9, 2023

What I don't get is how are the Gear CDC constants (the lookup table for gears) generated. Any insight ?

piceas · on Jan 9, 2023

The fastCDC paper says they used random values.

"It employs an array of 256 random 64-bit integers to map the values of the byte contents in the sliding window"

I presume using something else would skew the distribution of selected chunks. I don't know if that would help or hinder.

In my very limited experience, a picture of Rick Astley as the constants gave a similar distribution™ of chunks over a mixture of documentation files and Linux ISOs.

Ciantic · on Jan 9, 2023

I can't find any mention of rclone [1].

It would be intersting to know how this compares to rclone copy or sync, because it uses threading, rsync doesn't.

[1]: https://rclone.org/

dark-star · on Jan 9, 2023

Comparing it with rsync running on Cygwin is a bit unfair, as Cygwin is known to be terribly inefficient. I don't doubt that their CDC based algorithm is faster, but probably not by the margin they claim if Cygwin is taken out of the equation

ljusten · on Jan 9, 2023

IIUC, rsync computes a relatively expensive Rabin-Karp rolling hash (https://librsync.github.io/rabinkarp_8c_source.html) and performs a hash map lookup for every byte. Hash map lookups might not be very cache friendly for larger data sets. In comparison, cdc_rsync only computes

  hash = (hash << 1) + random_table[data[n]];
  bool chunk_boundary = (hash & magic_pattern) == 0;

per byte. That's only a few ops and very cache friendly. The random table only has 256 entries, 8 bytes each, so it easily fits into L1.

ljusten · on Jan 9, 2023

I should also note that we used a fairly fast 100 MB/sec connection to upload the data, so the rsync diffing algorithm running at 50 MB/sec is actually a bottleneck. The difference would be smaller on a slower connection, where the network overhead would dominate the results.

lathiat · on Jan 9, 2023

They addressed that specifically in the README. They tested on Linux and it was worse but had a worse CPU so assumed it was comparable.

paravz · on Jan 8, 2023

sounds like cdc_rsync should be a good replacement for rsync in a generic backup

linsomniac · on Jan 9, 2023

Imagine an rsync that was A LOT smarter for backup use cases. Detecting files that have been renamed, files that have been compressed, similar files, files that have only been appended to, files that are the same across multiple systems (Oh, yeah, I probably already have a copy of this kernel file from this other host). Basically, deduplication as part of rsync.

paravz · on Jan 9, 2023

Exactly what I'm looking for recently.. I dismissed ZFS (not mature on Linux) and btrfs (seems to be too complex and buggy as seen in a few horror stories)

So far I started using --link-dest for rsync, as explained in https://lincolnloop.com/insights/detecting-file-moves-rename... and used in https://github.com/dparoli/hrsync/blob/master/hrsync#L52

RandomBK · on Jan 9, 2023

BorgBackup has most of what you're looking for, though it doesn't implement CDC and doesn't replicate the files as-is in the backup location (instead using a compressed/deduped/chunked storage format)

linsomniac · on Jan 9, 2023

For single system backups I've switched to using restic and it's been pretty great. I don't trust Borg, a couple years ago I tried doing a recovery using it and ran into some unicode issue in, I believe, a filename, and I couldn't track down exactly what file it was or get any files backed up after that file in the archive. I ended up using another backup I had.

For my multiple backups to a backup host where I'm using rsync, restic really doesn't work (having 100+ systems backed up to the same destination).

chaxor · on Jan 9, 2023

Any idea how it compares to aspera? I may give benchmarking a shot later when I get to my computer, but if anyone has any intuition that could be helpful.

AnonC · on Jan 9, 2023

I skimmed through the readme, which explains the concepts quite well, but am unclear on what needs to be installed on each machine (assuming Windows as the source and Linux as the destination). There’s a mention of copying the Linux build output, cdc_rsync_server, to the Windows machine. Why is this needed? And is there something on the Linux machine that needs to be (newly) added in the PATH?

ljusten · on Jan 9, 2023

Just uncompress the binaries on the Windows machine and run cdc_rsync. The Linux component, cdc_rsync_server, is deployed automatically on first run. It is scp'ed to ~/.cache/cdc-file-transfer/bin. So nothing has to be installed on the Linux machine.

AnonC · on Jan 10, 2023

Thank you so much for responding here, OP! Really appreciate it. Perhaps you could add this explanation in the readme?

RyanShook · on Jan 9, 2023

This looks like a very useful tool with a wide range of applications. Is it Windows to Linux only? It would be so nice if it was system agnostic.

ljusten · on Jan 9, 2023

We're currently adding support for Windows to Windows to cdc_rsync. If there is demand, Linux to Linux would also be possible.

deanCommie · on Jan 9, 2023

> At Stadia, game developers had access to Linux cloud instances to run games. Most developers wrote their games on Windows, though. Therefore, they needed a way to make them available on the remote Linux instance.

Am I reading this right that onboarding your game to Stadia as a developer involved essentially rsyncing data directly to a Linux cloud instance?

That's.....

ljusten · on Jan 9, 2023

Just the compiled and baked game, not your sources. You still developed on Windows or wherever, the cloud instance was just used for running the game.

teleforce · on Jan 8, 2023

This is an excellent candidate to be used with local-first software [1]. Thanks for sharing, never knew that this algorithm existed.

[1]Local-first software: You own your data, in spite of the cloud:

https://www.inkandswitch.com/local-first/

glacials · on Jan 9, 2023

Interesting! I recently wrote about this in the context of Google and Apple ecosystems and did not know it had a name. [1] I called in offline-first vs. online-first design.

[1]: https://twos.dev/apple.html#offline

elromulous · on Jan 8, 2023

I took a quick glance at the repo, it wasn't totally clear to me if the client is windows only. Anyone know more?

ocdtrekkie · on Jan 9, 2023

I imagine that one could develop this further into both ends being cross-platform but the readme seems to really strongly define it is solely built to transfer files from Windows to Linux. If that's all you need for internal tool, that's all you build (and test).

Moissanite · on Jan 8, 2023

I thought the same; seems to be perfectly fine as a Linux-to-Linux tool.

skanga · on Jan 9, 2023

Thanks! Does it work Windows to Windows also?

ljusten · on Jan 9, 2023

Windows to Windows is being worked on, see https://github.com/google/cdc-file-transfer/compare/main...s....

Linux to Linux is also an option if there is demand, but currently it's Windows to Linux only.

skanga · on Jan 9, 2023

That's great. Thanks!

megak1d · on Jan 8, 2023

Can I use this to replace some rsync cronjobs I have? I'd love to convert them to 'streaming' rather than updating every few minutes when the cron fires.

winrid · on Jan 9, 2023

So game devs had to port their games to Linux to run on Stadia servers? That seems like a huge bottleneck...

dijit · on Jan 9, 2023

Yes, it was a contentious topic when I was working on The Division and its sequel, since it runs on a game engine that was built in house.

However, even though the games did not ship on Linux proper, the work that went in to supporting stadia had a profound effect on the games ability to work under WINE.

People often overlook Stadia as putting significant pressure on devs to support Linux, and give a lot of credit to Valve, but the reality is that both are responsible.

disqard · on Jan 9, 2023

Yes, this was a core architectural requirement of Stadia (running the games on Linux), as opposed to hosting virtualized Windows machines (which would've allowed serving up Windows-based games).

Whether this specific design choice alone killed Stadia, no one can say -- but it was probably one of the most important factors in its demise.

mort96 · on Jan 9, 2023

For me, the two most important aspects were: 1) it's made by Google so it will probably be shut down soon, and 2) if you already have any hardware which can play games, Stadia is strictly more expensive than buying and playing games locally. That's before I even got to the point of considering the size of its game library or the technical qualities of the service.

I play all my games on Linux, and in general, there are extremely few games I haven't gotten to run. Many just work out of the box through Proton, games not on Steam usually work out of the box with Lutris. This suggests to me that the Linux thing wasn't necessarily a big issue, if Google was willing to put in some work to get existing Windows games to work. AFAIK, they didn't do this and instead required games to run natively on Linux, which would significantly reduce the size of their games library (after all, why would a company invest time into porting their games to a Google product that's going to be shut down?).

Whether the US is even a market that's ready for game streaming is another question. Are there enough people with a high quality, low latency Internet connection that's close enough to a Stadia data center, who are interested in playing demanding games, but don't have the money for hardware yet can afford a monthly subscription on top of a high per-game price? Maybe it would be interesting if it worked like other subscription services and the subscription itself gave you access to games. I think Spotify would have flopped if you had to buy albums at their retail price in addition to the monthly subscription.

mattlondon · on Jan 9, 2023

What killed stadia (IMO) was that you could not access your existing steam library.

Nvidia allow this (kinda - some publishers do not allow their games to be played which is annoying), and I happily pay for it.

verdverm · on Jan 9, 2023

Input lag and pixelated graphics is why I quit, I had gigabit fiber at the time

dijit · on Jan 9, 2023

Likely it was caused by your computer and browser not playing nice with x265 hardware decoding, I used to run the gamelet on the same network as my desktop and would get the same on some configurations.

Frustratingly, it was not easy to know at a glance that this was what was happening, but there were times (many times, actually) where my chromecast ultra at home was performing much better with Stadia than my gaming PC

ljusten · on Jan 9, 2023

I got the same when I was running Stadia on a 2.4 GHz wifi network. Once I switched to a wired connection, it worked like a charm. 5 GHz supposedly works as well, but my robot vacuum cleaner needs a 2.4 GHz network facepalm.

baq · on Jan 9, 2023

I have a PV inverter which doesn't work if there's more than one AP with the same SSID and is 2.4GHz only... ended up buying an el-cheapo AP and giving it a unique name. PITA

punnerud · on Jan 8, 2023

With a small bit of tweaking this could be used for syncing live SQLite databases (probably)

linsomniac · on Jan 9, 2023

Shouldn't require any tweaking, the sliding window should be able to work on sqlite files just as well as the game content files used in this article. If you want something smarter and very much more sqlite optimized, you probably want to look at litefs. https://github.com/superfly/litefs

josephg · on Jan 9, 2023

The problem is conflicts. If two remote peers both edit data concurrently, its not obvious what the resulting sqlite state should be.

Better to embed a CRDT inside sqlite that can understand the semantics of sqlite's data, like cr-sqlite is doing:

https://github.com/vlcn-io/cr-sqlite

manigandham · on Jan 9, 2023

You can use Litestream which is purpose built for streaming SQLite database replicas: https://litestream.io/

xuhu · on Jan 9, 2023

How does cdc_rsync compare to Courgette for sending a 500MB rebuilt ELF binary where .text and .data now have slightly different offsets ?

hcs · on Jan 9, 2023

I haven't tested but I would guess it's a whole lot faster to compute but will end up sending much more data, given that it only works via exact chunks. A big part of binary patching is finding the largest possible chunks, even if they mismatch slightly.

Speaking of Courgette, though, I suggest looking into Zucchini which is faster and often produces smaller patches: https://news.ycombinator.com/item?id=29028534 (sorry for linking to my own comment but I haven't found any good explanation or benchmarks from Google)

rsync · on Jan 9, 2023

What is required on the server (linux) side ?

I see how you put together the two .exe files but what is required on the linux side of things ?

roydivision · on Jan 9, 2023

This would be interesting applied to local file copy, for instance for some backup strategies where large files change slightly.

ljusten · on Jan 9, 2023

This is currently being worked on, see https://github.com/google/cdc-file-transfer/compare/main...s...

Dowwie · on Jan 9, 2023

Could this serve as a general-purpose protocol that all remote development would benefit by using?

sirjaz · on Jan 9, 2023

Only if it could go the other way. Everyone seems to forget about Windows/Windows Server. 80%+ Desktop, and 55%+ Server(Include on-prem 70%+, Cloud 50%) Market Share.

up2isomorphism · on Jan 9, 2023

This also has been discovered long time ago by many people including myself for a backup system.

omeid2 · on Jan 9, 2023

How does it compare with Syncthing?

Avamander · on Jan 9, 2023

I was just wondering that, is the approach free, could Syncthing utilize it?

wankle · on Jan 9, 2023

If it's for Linux then will we finally have a Google Drive Linux Client?

yread · on Jan 9, 2023

It would be cool to use this instead of blobfuse

greatgib · on Jan 9, 2023

I don't understand, you can have rsync use a ssh tunnel directly. Easily. Isn't that enough?

ljusten · on Jan 9, 2023

Note that cdc_rsync runs on Windows and syncs to Linux. rsync is a Linux-only tool where you'd have to jump through some hoops to make it work on Windows.

Waterluvian · on Jan 8, 2023

Maybe I’m projecting (probably) but I detect some subtext in the README’s language. Like a “developers who are irate that Stadia was shut down, wanting to free anything they can for the community” type feeling.

elromulous · on Jan 8, 2023

Google / abc has a solid track record of open sourcing shutdown projects. See Makani[1] and iirc also Loon.

[1] https://news.ycombinator.com/item?id=24456613

Waterluvian · on Jan 8, 2023

Aha! I wasn’t at all aware of that. Okay, this reads more like that to me second time around. Just excitement to open source a cool piece of tech that came from Stadia.

codetrotter · on Jan 9, 2023

Another example is Google Wave, which was also open sourced after it was shut down.

https://en.wikipedia.org/wiki/Google_Wave

https://incubator.apache.org/projects/wave.html

Apache Wave was retired in 2018, but the source remains at least.

ocdtrekkie · on Jan 9, 2023

Apache Wave is still one-click installable on Sandstorm and occasionally hilarious to share instances of with people. Won't do it here though, would kill my poor Intel NUC, Wave is not exactly performance-friendly.

josephg · on Jan 9, 2023

Always nice to see my old work pop up! I'm really proud to have worked on the team that opensourced wave!

5440 · on Jan 8, 2023

I'm currently working on a required CDC (Center for Disease Control) reporting function for a COVID test. For a second I thought this article was going to be extremely helpful.

stefan_ · on Jan 8, 2023

> USB communications device class (or USB CDC class) is a composite Universal Serial Bus device class.

I may add this (and it fits somewhat nicely with file transfer..)

userbinator · on Jan 9, 2023

That's what my mind also went to first upon seeing the title.

gtirloni · on Jan 8, 2023

I thought Cult of the Dead Cow was back.

dn3500 · on Jan 8, 2023

The first computer I used professionaly was a Control Data Corporation (CDC) 6600. For a second I thought I could use this to transfer some of my old files.

lucb1e · on Jan 8, 2023

I have nothing to do with the CDC and am from another continent but had the same confusion. Not be enough TLAs are available it seems!

dvasdekis · on Jan 8, 2023

A new CDC (Change Data Capture) approach open-sourced by Google? Quickly I, a subsequently disappointed data engineer, started reading

pjot · on Jan 9, 2023

I similarly assumed something was coming out from BigQuery

jsight · on Jan 9, 2023

Looks interesting, but when did we start saying "synching" instead of "syncing"? I mean, its literally based on the idea behind rsync.

Why the sudden need to add an "h"?