Out-Tridging Tridge by improving rsync

chetanahuja · on Aug 28, 2013

ok this article had me intrigued (and the fix presented to speedup rsync for large files is good) but then, this gem:

"The thing is, modern networks aren’t like that at all, they’re high bandwidth and low latency"

No they're not. Unless you mean two machines sitting side by side in a data center or perhaps within the same metro area connected via wired connections.

Two machines sitting on either coast of US with the best wired connectivity at either end are at least 70-80 ms away from each other. That's not "low" latency.

Connected via anything less than a perfect connection on either end, now you're looking at 100+ ms latency.

One of the ends is on non-wired internet connection (LTE, Wimax etc.) and now you're looking at 150+ ms and with high standard deviation (in ping latencies).

Slow pace of latency improvements is as much a fact of life as the slow pace of battery life improvements. Perhaps because both are constrained by hard physical limits of nature.

mcpherrinm · on Aug 28, 2013

In the article, the author mentions sending "1.2 terabytes in a few hours" with 10ms latency. This sounds like a gigabit network with a few router hops in the middle. Maybe 100mbit if "a few hours" is interpreted longer than I would. So we're talking about connecting two machines, possibly in different buildings, but likely within a thousand kilometers.

It's honestly the use case I have the most often, and certainly one useful to have tools supporting the use case. This person seems to be worried about off-site backups, so I'm thinking "enterprise", not "cross-country home user".

bigiain · on Aug 28, 2013

And for some of us, this is a typical experience:

  25 packets transmitted, 25 packets received, 0.0% packet loss
  round-trip min/avg/max/stddev = 263.361/321.076/375.393/29.289 ms

(That's from Sydney Australia to a DigitalOcean Droplet in nyc1)

kbenson · on Aug 28, 2013

Although in some cases being able to assume low latency and use the appropriate changes is appropriate. Rsync is used quite a bit to sync data on servers within the same datacenter, or even cabinet/switch.

kbenson · on Aug 28, 2013

does copying/moving a file over another not trigger copy-on-write in btrfs? If not, it seems a much simpler (but much less cool and useful for all) solution would be to patch rsync with an option to allow writing the temp file over the original when done. While still non-atomic, you'll get the copy-on-write semantics you need. Unfortunately it will use much more IO. There are ways to mitigate the extra IO, such as creating a special diff formatted temp file, but that getting out of the territory of "simple".

Also, in case the FS is saving copies of the temp file and you don't like that, the --temp-dir option might help with that.

Then again, depending on how btrfs treats overwriting a file with a move, if the temp file had a timestamp in the name (patch probably needed) in some form before replacing the original, that might be good enough.

keeperofdakeys · on Aug 28, 2013

Most Unix file systems don't have the semantics of a "move". You unlink an inode from a filename, and link another inode (usually the inode for the tmp file). Then you unlink the original tmp filename. As far as btrfs is concerned, there is no relation between these inodes, and without copying the file (like you suggest), you can't improve this.

gmac · on Aug 28, 2013

Sounds like a worthwhile improvement to rsync, but I wonder why this setup is preferred to duplicity [1] or rdiff-backup [2], which both also use rsync (librsync) to perform incremental backups. I've had good experiences with duplicity in particular.

[1] http://duplicity.nongnu.org/

[2] http://www.nongnu.org/rdiff-backup/

fhars · on Aug 28, 2013

rdiff-bachup will give you copies of the changed files (so 1.2TiB for the database file mentioned in the article) every time you run a backup for each old version you keep. It will not transfer that much data over the network (since it uses the rsync algorithm), but it stores that much on disk.

On the other hand, if you use a filesystem with copy-on-write snapshots and in-place modification of the changed files, you will only use as much disk space as there are changed blocks in the file for each version you keep. (Of course you have no additional redundancy if you keep n older version, as each bit of data is only stored once physically. But you only ever store one version of an unchanged file in the rdiff-backup scenario either, so you should alternate between different backup disks anyway.)

kbuck · on Aug 29, 2013

This isn't entirely correct; rdiff-backup will give you a full copy of the latest version of the file as well as a set of binary diffs that can be applied in sequence to roll it back to an earlier version. rdiff-backup will actually end up being a little more space efficient for each incremental change since its diffs don't need to store entire filesystem blocks.

dylangs1030 · on Aug 28, 2013

I love rsync. It's also really useful for backing up local files incrementally. It's an easy CLI utility, but there's a GUI as well, called grsync.

You can use cron to automate them, which is how I backup my linux system nightly. Totally recommend it for server and local backups.

koenigdavidmj · on Aug 28, 2013

One of the useful modes for backups is to take a parameter (--link-dest) specifying a directory containing a previous backup. It will build hard links to the previous backup directory for files that did not change.

extra88 · on Aug 28, 2013

I use rsnapshot to automate this hard link creation and backup rotations hourly, daily, weekly, monthly. You still use cron or something else (I've used Launchd on OS X systems) to schedule the runs but rsnapshot takes care of the rest.

huwr · on Aug 29, 2013

Forgive me for not understanding, if btfs is cheap for copying files why do they want to use --in-place?