Hacker News new | past | comments | ask | show | jobs | submit login

I̶ ̶d̶o̶n̶'̶t̶ ̶t̶h̶i̶n̶k̶ ̶r̶s̶y̶n̶c̶ ̶u̶s̶e̶s̶ ̶f̶i̶x̶e̶d̶ ̶s̶i̶z̶e̶d̶ ̶c̶h̶u̶n̶k̶s̶.̶ The algorithm is described here; it's a rolling hash.

https://rsync.samba.org/tech_report/node3.html

Your description of content defined chunking is exactly right though. There are a number of techniques for doing it. FastCDC is one of them, although not the one used in rsync.

https://en.wikipedia.org/wiki/Rolling_hash

EDIT: Corrected in the comments below. Fixed sized chunks searched for at any offset with a rolling hash. The rsync algorithm description is here.

https://rsync.samba.org/tech_report/node2.html




rsync does use fixed-size chunks, but the rolling hash allows them to be identified even at non-integer chunk offsets.

So a change partway through the file doesn't force rsync to actually re-transfer all of the subsequent unmodified chunks, but it does incur a computational cost to find them since it has to search through all possible offsets.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: