This post got me thinking that it'd be interesting if instead of being given an ...

MrBuddyCasino · on July 1, 2020

Some build systems work that way, tup is one I think. They use strace to intercept file I/O and figure out what has been updated, and thus can figure out an optimal way to re-build.

aidenn0 · on July 1, 2020

I think Tup uses FUSE rather than strace (which is why it doesn't track external dependencies, and requires relative paths for all internal files), but I might be wrong.

cecilpl2 · on July 1, 2020

I built a dependency checker that worked similarly to this once.

By hooking the filesystem calls, you can make a list of all files that a given process touches. When that process finishes, serialize a dependency file containing the hashes, timestamps, and sizes of all those files. Next time you run that same command line, read the dependency file from the last run and compare to the current filesystem state. If it's the same, and you know your command is idempotent, you can skip execution entirely.

Now, if you put that logic in a dll, you can inject it into arbitrary third-party processes to which you don't have source code and it will still work. Name the dependency file after the hash of the command line.

phone8675309 · on July 1, 2020

The ClearCase SCM does something similar - it has the notion of "derived objects" and "audited builds".

You can run an audited build command under a special ClearCase wrapper and it will look at the versions of the elements used in the build - if that build has already occurred with the same input elements before, even if in a different view, it can "wink in" the derived object result - that is, it can cause that previous build to be visible in your current view. This can save a lot of time when you're building a large codebase.

AdieuToLogic · on July 1, 2020

A less intrusive approach could use kernel queues[0] on most Unix-like systems. This way, LD_PRELOAD would not need to be used and the process for responding to desired disk I/O is independent of the processes performing same.

0 - https://www.freebsd.org/cgi/man.cgi?query=kqueue&apropos=0&s...

thechao · on July 1, 2020

fabricate.py (https://github.com/brushtechnology/fabricate), based off the now ancient memoize.py.

In my experience with C/++, it is faster to combine Make & ccache: just have every C file depend on every header file, and let ccache decide if it needs to be rebuilt.

AshamedCaptain · on July 1, 2020

I really disagree there, specially as the project gets complex enough. ccache will not speedup anything related to linking, for example. But it will increase mtime of all your object files as it writes them even if they come from cache. So at some point even archive file creation dominates your "incremental" build time.

I suggest you don't wait until it's too complex to fix the mistake and do proper dependency tracking since the very beginning. It's not that hard in C/C++.

thechao · on July 2, 2020

My project is ~3.6 million lines of code with a 300 ms incremental null build and a 2-10s touch-a-header-file build. I only generate a few hundred exes, and most of the code lives in a single dylib.

I know that's not large, but I've got one Makefile that's only about 200 lines long. It's a pretty good trade-off.

AshamedCaptain · on July 2, 2020

Yes, it's not large at all, but I am already surprised of the claim that you can link 4 million LOCs of line in 300ms. For completeness, around 10x that MLOCs of C++ takes 2 minutes here with gold, on Xeon machines. Even writing the main exec is a good chunk of time (stripped, it already measures almost quarter of a gigabyte).

thechao · on July 6, 2020

Most of the code is tuned C code — tuned in the sense of being fast to compile, with a nice C++ wrapper to make it nice to use. I'm seeing compilation in the 100kloc/s on an older laptop.

eatonphil · on July 1, 2020

Cool idea! Minor tweak though I'd rather have the watcher run separately, but I think that would still be easy enough if you resolved file paths to the project root.