Looks really good. I am pleased more projects are adding Google Cloud Drive support now. What I really want to do is:
- create documents on my Mac which autosync to Cloud Drive in encrypted format (this should tick that box)
- be able to access said documents on any device including iOS, which transparently handles the encryption
The use case is I now scan all my documents into PDF format, but keeping them secure and accessing them on iOS seem to be almost mutually exclusive.
I looked at some other solutions for this which had their own iOS app and security mechanism (Boxcryptor mainly) and I didn't like it - I just didn't feel in control. And I got nervous about what happens if Boxcryptor goes under; I don't want to rely on them keeping their app up-to-date to read my documents.
I know Apple will never allow it but wouldn't it be nice to be able to mount your own network drive which all apps could access.
Apple's iCloud Drive has a decent web interface for the non-Apple subset of "any device", and otherwise seems to offer what you want. Curious if you had other reasons it doesn't meet your needs.
Unfortunately, it appears that binary diffs are not supported.
This is a really important aspect for many workflows dealing with large files (like TrueCrypt containers). Contrary to what is stated by the rclone developer [1], at least Dropbox supports binary diffs [2].
For the curious, the rsync algorithm is, at its core, ridiculously small[0] and involves rolling checksums. I just keep wondering why nobody (but Dropbox) bothers to implement it.
That's a really nice resource, thanks for the pointer.
The ideas behind are actually used in multiple places, through the librsync project (https://github.com/librsync/librsync). rdiff-backup and duplicity are 2 known examples, and surely in other projects as well.
It could be done in different ways, but as the FAQ says, rclone wants to keep file/object 1:1 mapping; which is useful in lots of cases; but not if you want to use rclone as a backup tool.
Binary diffs would be useful if you're syncing big files that change often; but then you have the limitations of the object storage implementations and (most of) their APIs. I'm not familiar with all the storage, but in the case of OpenStack Object Storage (swift), you can't upload part of an object.
You could split each file in chunks and then store those chunks, but the 1:1 mapping wouldn't be there (meaning that you need rclone to get your files back).
And it seems to store each file as one object. This should lead to a lot of overhead and additional costs (for e.g. S3 API requests) with a lot of small files.
FWIW: tarsnap is also rsync for cloud storage and Colin (guy who founded and runs tarsnap) also has won a putnam award for his work in mathematics and crypto.
>> Colin (guy who founded and runs tarsnap) also has won a putnam award for his work in mathematics and crypto.
Colin won the Putnam as an undergraduate student. The Putnam award is not a mathematics research award like the Fields Medal or the Abel Prize. It's a mathematics competition. As such, Colin didn't win the award for any particular work.
It's still quite impressive though. I would say Colin's work developing scrypt has more applicability to cryptography than his Putnam award.
I don't think they're comparable. Rclone is rsync for a variety of cloud storages while tarsnap is rsync for the tarsnap service.
That's like saying the Google Drive client or the Dropbox client are "rsync for cloud storage" just because they incrementally add changes to a cloud service.
How about if a single byte changes? Now the encrypted output looks 100% unique, and you have to treat it as an entirely new file. You lose the ability to do diffs on a file by file basis or proper deduplication across all files.
If you don't CBC, wherever is on the cloud side will have two files with one block off. Let's assume the file is a txt file. For smaller cipher block sizes, it is becoming very easy to guess your encryption key.
Tldr, you want any attacker to lose any diff ability on your encrypted data.
That depends on how you're storing the files. I was really just trying to highlight that for deduplication across files you need to deduplicate before you encrypt.
It is not free as in freedom, and it's not even "open source". The license does not allow modification, nor use outside using it for tarsnap's backup service.
This fills a real need for me. It does nearly everything I want.
Aside from the program itself, your documentation is really good, and special +1 for documenting the crypto thoroughly (and another +1 for using NaCl's building blocks in a safe way).
As a related point, I recently bought a Chromebook (still unopened), which pushes you heavily towards storing your files in Google Drive. It makes me uneasy to store certain things unencrypted, so I'll investigate writing a compatible implementation for ChromeOS.
Some of the supported providers (e.g. Amazon Cloud Drive) have a reputation for days-long service outages. Some users of Amazon Cloud Drive have even reported files going missing on occasion.
But the great thing with git-annex is you can have your data on multiple clouds (in addition to being on your own equipment), so partial or complete loss of a cloud provider does not need to result in availability or durability issues.
I'm trying to find a recipe for Rclone to connect to rsync.net since it won't work over the usual path (rsync over SSH) ... but we do support git-annex so ...
Rclone+gitannex, over SSH, to rsync.net ? Want a free account to try it ?
It does seem to use S3 behind the scenes based on URLs I've seen. The data loss incidents I've seen reported have tended to be after outages. Amazon Cloud Drive seems to keep an index mapping amazon cloud drive filenames to S3 objects, I suspect the index entry was corrupted/lost/rolled back. While the object likely still existed in S3 somewhere, that doesn't do much if users don't any way of accessing it.
s3s3mirror [0] is another tool for copying data between S3 buckets or the local filesystem. full disclosure: I am the author.
At the time I wrote it, I only needed to work with AWS, and needed something very fast to copy huge amounts of data. It works like a champ, but I do think about what it would take to make it cloud-independent; it wouldn't be easy to maintain the performance that's for sure.
For s3 and local filesystem, couldn't you also install aws-cli and do a sync b/w the 2 ? I do that for my stuff. Any reason you wrote this and not use aws-cli ?
AWS-cli only really got robust S3 support in the last 18 months or so. For a long time it couldn't handle multipart uploads and didn't support a bunch of corner cases (versioning, glacier, cross-account, etc)
"I started with "s3cmd sync" but found that with buckets containing many thousands of objects, it was incredibly slow to start and consumed massive amounts of memory. So I designed s3s3mirror to start copying immediately with an intelligently chosen "chunk size" and to operate in a highly-threaded, streaming fashion, so memory requirements are much lower.
Running with 100 threads, I found the gating factor to be how fast I could list items from the source bucket (!?!) Which makes me wonder if there is any way to do this faster. I'm sure there must be, but this is pretty damn fast."
This approach works well enough for relatively small amounts of objects. Once you start getting in to the millions (and significantly higher) then it begins to break down. Every "sync" operation has to start from scratch, comparing source and target (possibly through an index) on a file by file basis. There are definitely faster ways of doing it that scale to much larger object counts, but then they have their own drawbacks.
It's a shame the S3 Api doesn't let you order by modified date, or this would be trivial to do efficiently.
I'm curious if you can share how to synchronize N files without doing at least N comparisons.
the main innovations in s3s3mirror are (1) understanding this & going for massive parallelism to speed things up and (2) where possible, comparing etag/metadata instead of all bytes.
so far, it has scaled pretty well, i know of no faster tool to synchronize buckets with millions of objects.
Sorry, I should have perhaps put a disclaimer in my original comment. I work for a company called StorReduce and built our replication feature* (an intelligent, continuous "sync" effectively). We currently have a patent pending for our method, so I'm not sure if I can offer any real insight unfortunately.
I haven't looked at your project, but based on what you've said I agree the way you're doing it is conceptually as fast as it can be (massively parallel and leveraging metadata) whilst being a general purpose tool that "just works" and has no external dependencies or constraints.
rsclone operates like a limited rsync between different cloud providers (and local filesystem too), so either copy files or mirror two directories; duplicity does incremental backups.
I read in duplicity's man page that "In order to determine which files have been deleted, and to calculate diffs for changed files, duplicity needs to process information about previous sessions. It stores this information in the form of tarfiles where each entry’s data contains the signature (as produced by rdiff) of the file instead of the file’s contents."
Where are these tarfiles stored? The cloud?
BTW rclone checks size/timestamp and/or checksum to determine what to upload, the same way rsync does. So you don't have incremental "snapshots" the way duplicity does.
yes, and these tarfiles are in an entirely difficult format, have to be pruned, are the source of many mysterious bugs, and I've even had to write a "synthetic backup" script of my own to automatically turn a large list of "incremental" backups into a "full" backup every night. duplicity has been doing the thing I need but I spent weeks writing tools around it. Even though duplicity is written in Python, it invents a lot of its own Python installation idioms and also is not organized to provide a usable API, so scripting for it pretty much means you have to exec it.
"encrypted rsync to S3" is all I ever wanted in the first place so very much hoping this can replace it.
This looks awesome. I've made several attempts at something that could write encrypted files with obfuscated file names to several backends but never ended up with something I was happy with.
I'll definitely give this a try.
Edit: One feature I would like would be to split files into n chunks to obfuscate the length of files (assuming it wasn't obvious which chunks go together to make up a file), so instead of a 1:1 relationship there was a 1:n for large files. I suspect this is a lot more work though...
Looks promising, but I'm not sure about the crypto-part. Can someone give some notes about the security of NaCl Secretbox using Poly1305 as authenticator and XSalsa20 for encryption?
Is it justified to assume that this is adequate crypto as long as the nonces are choosen correctly (= as random as possible) and the keysize is bigger than 128bit (rclone uses 256bit key derived from user password)?
> Can someone give some notes about the security of NaCl Secretbox using Poly1305 as authenticator and XSalsa20 for encryption?
(Speaking as an unqualified outsider) Both Poly1305 and Salsa20 are creations of Daniel Bernstein / djb, who seems about as highly respected as you can be in the crypto community. And NaCl, the library they use that implements them (also by djb), is often highly recommended as a 'good' crypto library to use.
That said, it does go against the usual advice not to trust code from people who make their own encryption rather than using existing standards, but maybe this is the exception?
There was an article recently with some good commentary about how uneasy some people feel with how much of modern crypto being used in production is coming from relatively few people, including djb, but I can't seem to find it now...
Great summary. The 'don't roll your own crypto' argument is mostly just shorthand to 'defer to the opinion of experts, use ready-made constructs when possible, and if not, then exercise caution when hooking crypto primitives together in unproven ways'. djb is without a doubt a crypto expert and his NaCl library provides sane defaults and good interfaces for implementing crypto in your application.
The other relevant tptacek post is 'Cryptographic Right Answers' [1], which suggests using the NaCl default for encrypting (ie. Secretbox [2][3]), so the rclone author is deferring entirely to NaCl for crypto, as it's recommended.
Neat. I wrote my own hacky little Python app to upload to dropbox, but they recently broke that with changes to the dropbox python library. I hadn't bothered to fix it :)
I'll check this out instead - thanks for sharing OP.
No, it doesn't. Only timestamps on files are preserved.
It could store that information as metadata, like it does with the timestamps; and then restore if supported by the destination file system when copying/syncing files back.
I'm not sure that syncing to cloud is really the best for most personal users, at least not anymore. If you have multiple devices and use SyncThing / etc to sync between them, you're protected against device loss and damage without having to put your personal files on a server controlled by someone other than yourself.
I was about to install it anyway, but I saw that it doesn't have bi-directional sync. If 3 people at work shared a google drive folder and they all tried to sync to it, it sounds like whoever synced last would always win and it could potentially delete / alter some file.
I'm considering using this google service to backup an s3 bucket to google cloud.
Does anyone have experience for how fast this transfers and is there any info about how efficient the service is in terms of API calls? With a bucket with millions of objects, needless extra calls to List or Get can really add up to a ton of money.
Has anyone had success with Amazon Drive? 60 USD for unlimited storage or just 12 USD unlimited storage using stenography is hard to beat. If it works better for backup than Backblaze or Crashplans terrible clients and horrid performance it would be a good alternative.
When you sell "Securely store all of your photos, videos, files and documents" in a world where drives holding multiple TB of files has been available for low prices for years, then using hundreds of GB certainly isn't abuse in my book.
Are we talking temporary throttling after having transferred hundreds of GB in a short time span (hours? days? do you know how fast they allow you to upload?) or throttling more or less forever once you store just hundredes of GB?
> Are we talking temporary throttling after having transferred hundreds of GB in a short time span (hours? days? do you know how fast they allow you to upload?) or throttling more or less forever once you store just hundredes of GB?
The former, from the people I've heard who run into it. These people are generally uploading tens of terabytes however.
Have you tried getting your data back out? Getting my wife's files back out of Crashplan was so bad we eventually settled for only getting the stuff that was absolutely vital.
I had ~4TB on a dedicated Gbit connection (sitting in a DC, not Google Fiber or something) and was averaging ~400GB/day, which is ~40Mbit. This was through duplicati though, I'm trying it now with rclone to see if it's a bottleneck elsewhere.
I've seen others saturate 300Mbit connections so it might be on my end.
EDIT: Just ran with rclone on the same server and I get about 30Mbit/s with small files (even with --transfers=16). With a single large file, I'm getting ~250Mbit/s. I think my issues in the past have been due to file creation overhead.
EDIT2: Ran again with numerous large files and --transfers=16 and I'm getting ~900Mbit. It seems the bottleneck is the api calls rather than the upload bandwidth.
I forgot that with Prime you get free unlimited photo storage (a lookalike of the $12/year tier) until a sibling comment on the OP.
Since then I've done bursts of uploading with rclone from two different networks, first a slower one, and then a faster one. Amazon throttles you pretty aggressively, and rclone occasionally just appears to pause indefinitely (on Win64) as it's waiting for the pacer or a response. It's a bit annoying and confusing and I'm not sure how to proceed other than to kill the process.
Luckily it's nondestructive to just start it again with the same args. Though low-tech, a wrapper script around it could automate that away too.
I genuinely don't understand the use case for this since, for example, Dropbox already just syncs the changes you make to a file and not the whole file, automatically and does so bidirectionally, which this tool does not. So, if anyone can help state more clearly what this is adding over and above the features that the various cloud storage vendors already provide I would benefit from the explanation.
I use it to back up a linux server to amazon drive via a cron job. I don't think amazon drive even supports linux. Also if I decide to use backblaze or dropbox in the future its the same command to sync the files.
Arguably the most distinguishing benefit is transparent en-/decryption of the contents of your sync-set, such that the cloud copy is always encrypted by your key that's only available at your endpoints --- this is unrelated to being encrypted with a vendor-controlled key.
EDIT: Also, most 'official' cloud drive clients place a folder into your homedir, like ~/Google Drive or ~/OneDrive or ~/Box Sync. Only files placed into these folders get synced up and down. This client allows arbitrary local paths to be synced up or down.
I have used it with a cheap VPS (Ovh. I will test it soon with Scaleway) and it worked fine transferring data between Google Drive and Amazon drive. ;)
ps: I did not tested it with encrypted files as it is a few weeks ago option.
pps: see also reddit datahoader board for examples. ;)
Does rclone support multi-threaded/multi-processing in a similar way to how gsutil support it via the -m option? As a clarification I'm referring to an equivalent to "gsutil -m rsync ..."
I haven't been able to find anything in the documentation mentioning this
use this in production since one year, installed on a synology nas to backup on ovh storage. please get the github version as the download on the website is quite different
Does anyone else work mainly with Linux but use Google Drive?
95% of stuff I work with is Linux but that last 5% is done in Windows for work. I use Google Drive but the lack of syncing is really annoying. I also have a NAS that runs Linux that I would love to use to sync my GDrive/Amazon Drive to.
I've been brainstorming ideas including but not limiting to seeing if I could use W10 IOT for the RPi and install Drive on there (Pretty sure its impossible).
It boggles my mind there isn't a elegant solution to this that doesn't require me to pay for a service.
I came across https://owncloud.org/ yesterday. I'm going to sound like a shill for bringing it up in a couple of different threads, but I haven't used it so can't recommend it.
instead of thinking of a solution that is cross platform and works on your walled-garden devices, you want to bring a walled-garden solution to your other platforms?
- create documents on my Mac which autosync to Cloud Drive in encrypted format (this should tick that box)
- be able to access said documents on any device including iOS, which transparently handles the encryption
The use case is I now scan all my documents into PDF format, but keeping them secure and accessing them on iOS seem to be almost mutually exclusive.
I looked at some other solutions for this which had their own iOS app and security mechanism (Boxcryptor mainly) and I didn't like it - I just didn't feel in control. And I got nervous about what happens if Boxcryptor goes under; I don't want to rely on them keeping their app up-to-date to read my documents.
I know Apple will never allow it but wouldn't it be nice to be able to mount your own network drive which all apps could access.