Awesome post! `LD_PRELOAD` is a powerful tool for program instrumentation.
It's worth noting, though, that using `LD_PRELOAD` to intercept syscalls doesn't actually intercept the syscalls themselves -- it intercepts the (g)libc wrappers for those calls. As such, an `LD_PRELOAD`ed function for `open(3)` may actually end up wrapping `openat(2)`. This can produce annoying-to-debug situations where one function in the target program calls a wrapped libc function and another doesn't, leaving us to dig through `strace` for who used `exit(2)` vs. `exit_group(2)` or `fork(2)` vs. `clone(2)` vs. `vfork(2)`.
Similarly, there are myriad cases where `LD_PRELOAD` won't work: statically linked binaries aren't affected, and any program that uses `syscall(3)` or the `asm` compiler intrinsic to make direct syscalls will happily do so without any indication at the loader level. If these are cases that matter to you (and they might not be!), check out this recent blog post I did on intercepting all system calls from within a kernel module[1].
There is a very recent development in this area -- there is now a way to do it without ptrace and instead entirely using seccomp[1]. It's somewhat more complicated (then again, ptrace is far from simple to get right) but gives you the benefit that you don't need to use ptrace (which means that debuggers and upstart will work).
It's going to see a lot of use in container runtimes like LXC for faking mounts and kernel module loading (and in tools like remainroot for rootless containers), but it will likely also replace lots of uses of LD_PRELOAD.
There's another way to intercept syscalls without going as far as a kernel module, using debugging API ptrace. There's a pretty neat article about how to implement custom syscalls using ptrace: https://nullprogram.com/blog/2018/06/23/
Yup! I discuss the pros and cons of using `ptrace` within that post.
It's all about the use case: if being constrained to inferior processes and adding 2-3x overhead per syscall doesn't matter, then `ptrace` is an excellent option. OTOH, if you want to instrument all processes and want to keep instrumentation overhead to a bare minimum, you more or less have to go into the kernel.
I've been looking for a while for a way to capture all file opens and network ops to profile unknown production workloads similar to proc. explorer on Windows, which I believe is implemented using ETW. Unfortunately strace seems to be out of the question purely because of the performance impact. Is the performance impact due to strace or ptrace itself?
It's ptrace itself: every traced syscall requires at least one (but usually 3-4) ptrace(2) calls, plus scattered wait(2)/waitpid(2) calls depending on the operation.
If you want to capture events like file opens and network traffic, I'd take a look at eBPF or the Linux Audit Framework.
You can use [1] from Intel to successfully intercept all syscalls from a given library, by default libc only. This library actually works by disassembling libc and replacing all `syscall` instructions with a jump to a global intercept function that you can write yourself. Incidentally it's also an LD_PRELOADed library.
At a previous job, we wanted binary reproducibility - that is to say, building the same source code again should result in the same binary. The problem was, a lot of programs embed the build or configuration date, and filesystems (e.g. squashfs) have timestamps too.
Rather than patch a million different packages and create problems, we put together an LD_PRELOAD which overrode the result of time(). Eventually we faked the build user and host too.
End result: near perfect reproducibility with no source changes.
I've also used it for reasons similar to the GM Onstar example in the article -- adding an "interposer" library to log what's going on.
I've pulled similar stunts with pydbg on a Windows XP virtual machine -- sniffing the traffic between applications and driver DLLs (even going as far as sticking a logger on the ASPI DLLs). That and the manufacturer's debug info got me enough information to write a new Linux driver for a long-unsupported SCSI device which only ever had Win9x/XP drivers.
Well if I could figure out the protocol of the Polaroid Digital Palette (specifically the HR-6000 but the ProPalette and CI-5000S use the same SCSI protocol)...
Look for any debug data you can turn on in the driver and correlate that against whatever you see going to the scanner. Try to save timestamps if you can, then merge the two logs.
I was a little surprised that while Polaroid had stripped the DLL symbols, they'd left a "PrintInternalState()" debug function which completely gave away the majority of the DP_STATE structure fields.
After that, I reverse-engineered and reimplemented the DLL (it's a small DLL), swapped the ASPI side for Linux and wrote a tool that loaded a PNG file and spat the pixels at the reimplemented library.
And then someone sent me a copy of the Palette Developer's Kit...
(Incidentally I'd really love to get hold of a copy of the "GENTEST" calibration tool, which was apparently included on the Service disk and the ID-4000 ID Card System disks)
I shoot 135 film and some medium format, I have tried Super 8 and would love to start shooting 16mm film - but having a film recorder and actually use it something?!
:-D What can you do, what would you do?
If I was filthy rich I'd project 35mm movies in my living room. :)
I'll share my story. I used to work at a popular Linux website hosting control panel company. Back in the early 2000's "frontpage extensions" were a thing that people used to upload their websites.
Unfortunately, frontpage extensions required files to exist in people Linux home directories, and people would often mess them up or delete them. People would need their frontpage extension files "reset" to fix the problem. Fortunately, Microsoft provided a Linux binary to reset a users frontpage extension files.
Unfortunately, it required root access to run. Also unfortunately, I discovered that a user could set up symlinks in their home directory to trick the binary into overwriting files like /etc/passwd.
We ended up actually releasing a code change that would overwrite getuid with LD_PRELOAD so that the Microsoft binary would think it was running as root, just to prevent it from being a security hazard.
It was very much in keeping of the Microsoft of the era. Not out of maliciousness. Just a general lack of interest or knowledge of any non-Windows platform, but a recognition that if Frontpage was going to be as dominant as they wanted, they at least needed to vaguely support it.
Think the worst case of "Well it works on my machine"
Here's my friend's LD_PRELOAD hack, it pushes the idea further: hooking gettimeofday() to make a program to think that the time goes faster or slower. Useful for testing.
I've implemented some sort of "poor man's Docker" using LD_PRELOAD, back then in 2011 when Docker wasn't a thing. It works by overriding getaddrinfo (IIRC) and capturing name lookups of "localhost", which are then answered by an IP address that's taken from an env variable. The intended use is the parallelization of automated testing of a distributed system: by creating lots of loopback devices with individual IPs and assigning those to test processes (via the LD_PRELOAD hack), I could suddenly test as many instances of the software system next to each other as I wanted, on the same machine (the test machine is some beefy dual-socket server with lots of CPU cores and RAM). Each instance (which consists of clients and several processes that provide server services, thus they're by default configured to bind themselves to specific ports on localhost, as it is common for dev and test purposes) would then be able to route its traffic over its own loopback device, and I was spared of having to somehow untangle the server ports of all the different services just in order to be able to parallelize them on a single machine and of the configuration hell that would have come with this.
It helped that processes by default inherit the env variables from their parents that spawned them - that made it a lot easier to propagate the preload path and the env variable containing the loopback IP to use. I just had to provide it to the top-most process, basically.
Today, one would use Docker for this exact purpose, putting each test run into its own container (or even multiple containers). But since the LD_PRELOAD hack worked so well, the project in which I implemented the above is still using it (although they're eyeing a switch to Docker, in part because it also makes it easier to separate non-IP-related resources such as files on the filesystem, but mostly because knowledge about Docker is more widespread than about such ancient tech as LD_PRELOAD and how to hack into name resolution of the OS).
Here's my ldpreload hack: rerouting /dev/rand to dev/urand -- because I disagree with gpg's fears on entropy. Now it's as fast as generating a private key with ssh-keygen or openssl:
You can also just simply delete /dev/random and symlink it to urandom. Or delete it and create a character device at /dev/random that uses urandom's major/minor numbers.
My layman's understanding of the two is that /dev/urandom will happily output more bits than it has been seeded with, and so is unsuitable for use in cryptography, as it can output correlated values. Is my understanding here incorrect?
(edit: I see my parent post is being downvoted. How can this be? The commenter is just asking a question...)
It is incorrect. Both /dev/urandom and /dev/random are connected to a CSPRNG. Once a CSPRNG is initialized by SUFFICIENT unpredictable inputs, it's forever unpredictable for (practically) unlimited output (something 2^128). If the CSPRNG algorithm is cryptographically-secure, and the implementation doesn't leak its internal state, it would be safe to use it for almost all cryptographic purposes.
However, the original design in the Linux kernel was paranoid enough, it blocks /dev/random (even if a CSPRNG can output unlimited random bytes) if the kernel thinks the the output has exceeded the estimated uncertainty from all the random events. Most cryptographers believe if a broken CSPRNG is something you need to protect yourself from, you already have a bigger trouble, and it's unnecessary from a cryptographic point-of-view to be paranoid about a properly-initialized CSPRNG. /dev/random found on other BSDs is (almost) equivalent to Linux's /dev/urandom.
However, /dev/urandom has its own issues on Linux. Unlike BSD's implementation, it doesn't block even if the CSPRNG is NOT initialized during early boot. If you automatically generate a key for, e.g. SSH, at this point, you'll have serious troubles - predictable keys, so reading from /dev/random still has a point, although not for 90% of the programs. I think it's a prefect example of being overly-paranoid about unlikely dangers, while overlooking straightforward problems that are likely to occur.
The current recommended practice is to call getrandom() system call (and arc4random()* on BSDs) when it's available, instead of reading from raw /dev/random or /dev/urandom. It blocks, until the CSPRNG is initialized, otherwise it always outputs something.
*and no, it's not RC4-based, but ChaCha20-based on new systems.
> /dev/random found on other BSDs is equivalent to Linux's /dev/urandom.
This isn't quite true. The BSDs random (and urandom) block until initially seeded, unlike Linux's urandom. Then they don't block. (Like the getrandom/getentropy behavior.)
> The current recommended practice is to call getrandom() system call (and arc4random() on BSDs) when it's available, instead of reading from raw /dev/random or /dev/urandom. It blocks when the CSPRNG is initialized, but otherwise it always outputs something.
+1 (I'd phrase that as "blocks until the CSPRNG is initialized," which for non-embedded systems will always be before userland programs can even run, and for embedded should not take long after system start either).
> it blocks /dev/random (even if a CSPRNG can output unlimited random bytes) if the kernel thinks the the output has exceeded the estimated uncertainty from all the random events. Most cryptographers believe if a broken CSPRNG is something you need to protect yourself from, you already have a bigger trouble,
Not just that, but if you have a threat model where you actually need information theoretic security (e.g. you're conjecturing a computationally unbounded attacker or at least a quantum computer)-- the /dev/random output is _still_ just a CSPRNG and simply rate limiting it doesn't actually make a strong guarantee about the information theoretic randomness of the output. To provide information theoretic security the function design would need to guarantee that at least some known fraction of the entropy going in actually made it to the output. Common CSPRNGs don't do this.
So you could debate if information theoretic security is something someone actually ever actually needs-- but if you do need it, /dev/random doesn't give it to you regardless.
[And as you note, urandom doesn't block when not adequately seeded ... so the decision to make /dev/random block probably actually exposed a lot of parties to exploit and probably doesn't provide strong protection even against fantasy land attacks :(]
> simply rate limiting it doesn't actually make a strong guarantee about the information theoretic randomness of the output. To provide information theoretic security the function design would need to guarantee that at least some known fraction of the entropy going in actually made it to the output. Common CSPRNGs don't do this.
This is an interesting point I hadn't thought about before, so thanks for that. I suppose if you're generating a OTP or something like that, there might be some small advantage to using /dev/random, but the probability of it making a difference is pretty remote.
The one thing I haven't been able to figure out is why Linux hasn't "fixed" both /dev/random and /dev/urandom to block until they have sufficient entropy at boot and then never block again. That seems like obviously the optimal behavior.
Blocking could potentially result in the system getting stuck during boot and simply staying that way. Compatiblity is a bear. The getentropy syscall does the reasonable thing.
It's important to note here that Linux's behaviour is broken, plain and simple: /dev/random blocks even if properly seeded, and /dev/urandom doesn't block even if improperly seeded.
The Real Solution™ is to make /dev/random and /dev/urandom the same thing, and make them both block until properly seeded. And replace the current ad-hoc CSPRNG with a decent one, e.g. Fortuna. There were patches almost 15 years ago implementing this (https://lwn.net/Articles/103653/), but they were rejected.
There's simply no good reason not to fix Linux's CSPRNG.
I think getrandom(2) is a fine choice, but if you are using the C library (as opposed to using asm directives to make syscalls), getentropy(3) is even better. No need to think about the third `flags` handler or read a long section about interruption by a signal handler.
Yes and no. Much of cryptography is based on psuedorandom number generators, which output more bits than they are seeded with. If these PRNGs are not secure, then almost any piece of cryptography you actually use would be insecure independent of your choice to use random or urandom.
Unless all of your cryptography is information-theoretically secure, there is no problem using a PRNG.
If you happen to be using a an information-theoretically secure algorithm than you are theoretically weaker using a limited entropy PRNG; but there is no practical implications of this.
The only information theoretically secure encryption algorithm is a one-time pad seeded with true randomness. In fact, you cannot achieve information theoretic security using a pseudorandom generator of any kind.
I am curious. How do you know if this is secure or not? Is there any publication or article available for this slightly time-saving but potentially dangerous choice?
The /dev/random interface is considered a legacy interface, and /dev/urandom is preferred and sufficient in all use cases, with the exception of applications which require randomness during early boot time; for these applications, getrandom(2) must be used instead, because it will block until the entropy pool is initialized.
Not that I disagree with you, but which are the official man pages for /dev/urandom? It's my recollection that the advice therein varies from OS to OS.
This page is part of release 4.16 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest version of this page, can be found at https://www.kernel.org/doc/man-pages/.
> this slightly time-saving but potentially dangerous choice?
The one and only danger is during the machine's boot process, because while /dev/random and /dev/urandom use the same data:
* on linux /dev/random has a silly and unfounded entropy estimator and will block at arbitrary points (used to be a fad at some point, but cryptographers have sworn off it e.g. Yarrow had an entropy estimator but Fortuna dropped it)
* also on linux, /dev/urandom never blocks at all, which includes a cold start, which can be problematic as that's the one point where the device might not be seeded and return extremely poor data
In fact the second point is the sole difference between getrandom(2) and /dev/urandom.
If you're in a steady state scenario (not at the machine boot where the cold start entropy problem exists) "just use urandom" is the recommendation of pretty much everyone: tptacek, djb, etc…
> In fact the second point is the sole difference between getrandom(2) and /dev/urandom.
AFAIK, there's another important difference: getrandom(2) doesn't use a file descriptor (so it'll work even if you're out of file descriptors, or in other situations where having an open fd is inconvenient), and it doesn't need access to a /dev directory with the urandom device.
The catch is old programs that depend on the blocking behavior of /dev/random during early boot could be an issue. Unlikely to be a problem on a server, though...
It disables fsync, o_sync etc, making them no-ops, essentially making the programs writes unsafe. Very dangerous. But very useful when you're trying to bulk load data in to a MySQL database, say as preparation of a new slave (followed by manual sync commands, and very careful checksumming of tables before trusting what happened)
I've been using LD_PRELOAD for fun and profit for a long, long time. Its simplicity is due to the simplicity of the C ABI. Its power is due to dynamic linking.
C is one programming language. C w/ ELF semantics and powerful link-editors and run-time linker-loaders is a rather different and much more powerful language.
I won't be sad to see Rust replace C, except for this: LD_PRELOAD is a fantastic code-injection tool for C that is so dependent on the C ABI being simple that I'm afraid we'll lose it completely.
You can easily write and call functions that abide the C ABI in Rust, but the set of types permitted in those signatures is much smaller (only #[repr(C)]-compatible types) than in ordinary Rust functions. The Rust ABI is more complicated and won't be stabilized anytime soon.
This is Jess’ personality - check out her Twitter account. I don’t mind it, but I’ve followed her for a while so I’m used to it. Honestly I find it to be a refreshing break from typical the typically stiff writing I see. She’s smart and doesn’t need to hide behind stodgy writing in order to make herself seem smarter.
I don't really mind it on twitter, as twitter is anything but serious, and you can't really have any coherent text there.
But such elements in a regular article simply harm its coherence and readability for anyone that does not spend much of their time in (rather noisy and immature, IMHO) communities which feature "meme image macros" heavily.
(Bonus negative points if some of the images are animated. That makes me think that the author actively hates the readers.)
Since I knew what LD_PRELOAD did then I was rather amused that the author had a similar epiphany over it as I did many years ago, which made it a great read -- something to relate to.
I agree that the article is emotional, but the annoyance or not is so very subjective.
And the intro going for about 1/3 of the total article before even knowing what we are talking about (yeah, I click on article when I’m intrigued by the title, when it seems programming related)
I'll also join in and share my projects using LD_PRELOAD. These also work on macOS through its equivalent DYLD_INSERT_LIBRARIES.
https://github.com/d99kris/stackusage measures thread stack usage by intercepting calls to pthread_create and filling the thread stack with a dummy data pattern. It also registers a callback routine to be called upon thread termination.
https://github.com/d99kris/cpuusage can intercept calls to POSIX functions (incl. syscall wrappers) and provide profiling details on the time spent in each call.
It's a bit more unwieldy to use, because it doesn't just replace all matching symbols (it's not how symbol lookup works for DLLs in Win32) - the injected DLL has to be written specifically with Detours in mind, and has to explicitly override what it needs to override. But in the end, you can do all the same stuff with it.
I didn't phrase that unambiguously -"injected DLL" in this case means "the DLL with new code that is injected", not "the DLL that the code is being injected into". With LD_PRELOAD, all you need to override a symbol is an .so that exports one with the same name. With Detours, you need to write additional code that actually registers the override as replacing such-and-such function from such-and-such DLL. But yes, the code you're overriding doesn't need to know about any of that.
Librespot uses LD_PRELOAD to find and patch the encryption/decryption functions used in Spotify's client so the protocol can be examined in wireshark (and ultimately reverse engineered). I am not the original author, he wrote a MacOS version using DYLD_INSERT_LIBRARIES to achieve something similar.
I once used LD_PRELOAD to utilize an OpenGL "shim" driver (for an automated test suite). The driver itself was generated automatically from the gl.h header file.
If everyone is giving examples, of LD_PRELOAD — it has serious production use at scale in HPC, particularly for profiling and tracing. Runtimes such as MPI provide a layer designed for instrumentation to be interposed, typically with LD_PRELOAD (e.g. the standardized PMPI layer for MPI). Another example is the entirely userspace parallel filesystem that OrangeFS (né PVFS2) provides via the "userint" layer interposing on Unix i/o routines. That sort of facility is a major reason for using dynamic linking, despite the overheads of dynamically loading libraries for parallel applications at scale. I'm not sure if a solution could be hooked in with LD_PRELOAD, but Spindle actually uses LD_AUDIT: https://computation.llnl.gov/projects/spindle
The authors have a small library that sets up some signal handlers for things like divide by zero and segmentation faults. They LD_PRELOAD this library when starting a buggy binary (they test things like Chromium and the GIMP), and when the program tries to divide by zero or read from a null pointer, their signal handlers step in and pretend that the operation resulted in a value of 0. The program can then carry on without crashing and usually does someting meaningful. Tadaa, automatic runtime error repair!
My favorite: https://github.com/musec/libpreopen is a library for adapting existing applications that open() and whatnot from all over everywhere to the super strict capability based Capsicum sandbox on FreeBSD. I'm working on https://github.com/myfreeweb/capsicumizer which is a little wrapper for launching apps with preloaded access to a list of directories from an AppArmor-like "profile".
LD_PRELOAD is extremely helpful in troubleshooting libraries. Around 2007, qsort on RHEL was slower than SUSE. I raised a case with Redhat along with a test case; but Redhat was not helpful, as it was not reproducible.
So, I copied glibc.so from a SUSE machine to that RHEL machine and ran the test case with LD_PRELOAD, compared with the RHEL glibc. I showed these results to Redhat. Eventually, a patch was applied to glibc on their side.
I personally just hate LD_PRELOAD because it's very difficult to turn it off and keep it off. I am glad others find uses for it and that's great,but I hate the privilege escalation attack surface it opens up,I get that it has uses,but there needs to be a simple way to disable it for hardened systems.
LD_PRELOAD only works for binaries that are dynamically linked (LD_PRELOAD is actually handled by the link loader not the kernel[1]), and you can only use it to overwrite dynamic symbols IIRC.
It definitely doesn't work with Go, and Rust might work but I'm not sure they use the glibc syscall wrappers.
I'm aware of that, I guess my point was that Rust probably doesn't use a lot of glibc (like most C programs would) so the utility of LD_PRELOAD is quite minimal.
I don't know enough about .rlib to know whether you could overwrite Rust library functions, but that's a different topic.
Right, but does that mean it's only used as a way of getting syscall numbers (without embedding it like Go does) or is it the case that you could actually LD_PRELOAD random things like nftw(3) and it would actually affect Rust programs? I'll be honest, I haven't tried it, but it was my impression that Rust only used glibc for syscall wrappers?
Musl supports dynamic linking. But it also supports static linking (which glibc doesn't really support because of NSS and similarly fun features) -- hence why Rust requires musl to statically link Rust binaries.
We use a sort of similar trick (not via LD_PRELOAD, though) to inject faults in M_NOWAIT malloc() calls in the FreeBSD kernel. FreeBSD kernel code tends to be a bit better than most userspace code I've seen as far as considering OOM conditions, though it is not perfect.
What security issues are those? Is there anything you can do with LD_PRELOAD that you cannot do in other ways such as modifying binaries before executing them?
As a regular ol' GNU/Linux user, you cannot modify binaries in /usr/bin (or /bin), but you can definitely influence their behavior by "LD_PRELOAD=blah /usr/bin/thing".
It depends on assumptions in the way a system is hardened. For example, a home directory mounted noexec. In theory, LD_PRELOAD will not mmap a file in a noexec area. But if you can find an installed library with functions that mirror some other application you have, and you can LD_PRELOAD that library before executing the target application, you might be able to force the library to call unexpected routines. (That's a stretch, granted)
Another would be possible RCE. Say you can get a server-side app to set environment variables, like via header injection. Then say you can upload a file. Can you make that server-side app set LD_PRELOAD to the file, and then wait for it to execute an arbitrary program?
I needed to calculate the potential output file size tar would produce, so what better way than using tar itself to calculate it. It just required hooking read, write, and close.
Being able to override some library function such that running my text editor does $BADTHING isn't very interesting from a security perspective: if I have the capability to do that, I could also just run a program that does $BADTHING directly. Why bother with additional contortions to involve the text editor?
Malicious program without LD_PRELOAD can still copy the binary to other folder and sufficiently change the menu to point to the copy. Then modify the copy by binary patching to do whatever. Or run it via modified qemu to do whatever. The main problem is the lack of a proper sandbox and that all programs in user session generally have the same permissions.
If I were to provide ld preload based security cover (take any binary and secure it with ld-preload), will that be acceptable to corporates? Or does that increase the attack surface?
On FreeBSD, it's supported but kinda sucks. e.g., porting to a new CPU architecture is hell (I contributed to the FreeBSD/aarch64 go port, someone else picked it up now…)
The libc is the stable ABI on pretty much any OS that's not called Linux, just use it.
Ehh… does it kind of suck? The ABI of libc's syscall wrappers is basically "here's some ELF symbols to call with some arguments using the operating system's preferred calling convention". The only really "C" thing about it, other than the name, is struct layouts of various arguments.
It's worth noting, though, that using `LD_PRELOAD` to intercept syscalls doesn't actually intercept the syscalls themselves -- it intercepts the (g)libc wrappers for those calls. As such, an `LD_PRELOAD`ed function for `open(3)` may actually end up wrapping `openat(2)`. This can produce annoying-to-debug situations where one function in the target program calls a wrapped libc function and another doesn't, leaving us to dig through `strace` for who used `exit(2)` vs. `exit_group(2)` or `fork(2)` vs. `clone(2)` vs. `vfork(2)`.
Similarly, there are myriad cases where `LD_PRELOAD` won't work: statically linked binaries aren't affected, and any program that uses `syscall(3)` or the `asm` compiler intrinsic to make direct syscalls will happily do so without any indication at the loader level. If these are cases that matter to you (and they might not be!), check out this recent blog post I did on intercepting all system calls from within a kernel module[1].
[1]: https://blog.trailofbits.com/2019/01/17/how-to-write-a-rootk...