Tracking Changes in Directories with Python

yummyfajitas · on Dec 13, 2013

I can't tell exactly what the goal is, but inotify might also be a simpler solution.

(Specifically, if the goal is to monitor changes as they happen and the service can be assumed to be continually running.)

arethuza · on Dec 13, 2013

And the command line inotifywait - which I've used from Python to monitor changes to directories. There is even a Windows port:

https://github.com/thekid/inotify-win

icebraining · on Dec 13, 2013

pyinotify is also pretty simple to use: https://github.com/seb-m/pyinotify

arethuza · on Dec 13, 2013

Thanks, I was after something that would work from Python on Linux and Windows and calling out to inotifywait seemed to work well for that - this is for my home grown RESTful Dropbox-lite application which is sitting at 95% complete...

rytis · on Dec 13, 2013

Or even use incrond [0], and get it call your script that does whatever needs to be done with the modified files.

[0] http://linux.die.net/man/8/incrond

EDIT: As a use case example I use this to detect when a new cert request is made to my Puppet master server. Once a CSR file is created, I check if the host is created by our provisioning system, and if so, then sign it.

EDIT2: This will eventually go into 3.4.0 - https://projects.puppetlabs.com/issues/7244

nodata · on Dec 13, 2013

How does the puppet master get the csr unless the host has been provisioned?

rytis · on Dec 18, 2013

The host is provisioned using some other automation tool, then when it comes up, it runs puppet agent. Puppet agent makes cert sign request (at which point csr file is created on puppet server).

Puppet server doesn't have this host in autosign.conf, so doesn't automatically sign the cert. And this is where incron kicks in and runs my script. The script queries the provisioning tool, sees that the host is in there, and runs puppet cert sign command.

tsileo · on Dec 13, 2013

The service is not continually running, I use this method to make incremental backups with archives stored on AWS Glacier and meta-data stored on S3 (the index is stored on S3, and I can't access files on Glacier to compute deltas).

cmsd2 · on Dec 13, 2013

I've been thinking about the best way to do this, and I too didn't want to rely on having a backup script running all the time.

Windows seems to have a low-level api for querying the filesystem/vfs for changes since you last looked: http://msdn.microsoft.com/en-us/library/windows/desktop/aa36...

And BTRFS has some ability to do this with find-new: http://www.tummy.com/blogs/2010/11/01/fun-with-btrfs-what-fi...

It's nice that btrfs has these new interesting features, also see send/receive, but it's not available in the vfs, and I suppose never will be.

xradionut · on Dec 13, 2013

The Windows solution would seem to only work if you have admin access to the specific server or the rights to access the journal.

aray · on Dec 13, 2013

I'm not sure what you mean by continually running, but inotifywait is basically just waiting on an event. As long as the process sticks around it doesn't have to do anything until it gets an inotify.

tsileo · on Dec 13, 2013

I don't want to have a running process just for this, I want to be able to take a "snapshot"/"index", store some data on Glacier, and the index on S3, and later, given the index, be able to compute deltas without accessing the full archive store on Glacier.

blueblob · on Dec 13, 2013

Is there a reason that you want it in python? Couldn't you do this with diff -q if you only want to know what data has changed? and then use diff to get the deltas?

johtso · on Dec 13, 2013

The watchdog library is great for this. It comes with an API and a command line tool: https://github.com/gorakhargosh/watchdog

Compatibility:

  Linux 2.6 (inotify)
  Mac OS X (FSEvents, kqueue)
  FreeBSD/BSD (kqueue)
  Windows (ReadDirectoryChangesW with I/O completion ports; ReadDirectoryChangesW worker threads)
  OS-independent (polling the disk for directory snapshots and comparing them periodically; slow and not recommended)

netnichols · on Dec 13, 2013

Since he asks for feedback at the end...

Using sha256 just to compute changes is probably overkill. Using md5 instead is almost certainly adequate and will be a good deal faster.

pudquick · on Dec 13, 2013

> good deal faster

You could always try it and see :)

Example test: openssl speed md5 sha256

As for why it was chosen, I think it's because there are known examples of MD5 hash collisions (though the likelihood of it on a filesystem is remote) and likely SHA-1 was skipped because it's considered 'likely' a collision could be created (though so far only with weakened versions of SHA-1).

But - all this to say: The chances of having two files with the same MD5 hash that are identical in size is vanishingly small. As such, for the known MD5 collision mechanisms, the different file size would be enough evidence something has changed.

... Why he didn't include file size in the metadata check, I can't tell you. Timestamps can be faked - but generating a hash collision with a file of equal size is a Hard problem.

dded · on Dec 13, 2013

>> good deal faster

>> You could always try it and see :)

>> Example test: openssl speed md5 sha256

When I was looking for a fast hash easily called from Python, I settled on adler32 (as the fastest) after some trial-and-error on my files. I don't now recall all the utilities/functions that I tried, but they certainly included md5, sha1, and crc32. I only needed to test for accidental corruption, and only computed the hashes if meta data matched.

tsileo · on Dec 13, 2013

Thanks for the feedback!

I haven't thought about the filesize/hash to reduce collision, but I chose to stick with the last-modified time in the article, because it can takes hours computing hashes for a big directory tree.

Tools like rsync relies on last-modified time by default, and since I want to use this to track my own files, I won't fake it, so I think it's not a big deal?

netnichols · on Dec 13, 2013

It's not just that it could be faked, it could be an accident that you modify a file but the file modification date is not changed. For example, say you edit a photo, but later you run a script that sets the file's modification date to the EXIF data in the photo.

So I guess the point is that also including the file size will be one more (fast) data point to help ensure 'accurate' change tracking, without adding the overhead of computing content hashes.

tsileo · on Dec 13, 2013

I think I will add the file size to the index, since it's really cheap.

Thanks!

olavgg · on Dec 13, 2013

Don't use MD5, user SipHash https://131002.net/siphash/