Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hidden dependencies in Linux binaries (thelittleengineerthatcould.blogspot.com)
118 points by thunderbong on April 14, 2024 | hide | past | favorite | 68 comments



> Meanwhile, when I use CUDA instead of Vulkan, I get serenity back. CUDA FTW!

Just because the complexity is hidden from you doesn't mean it's not there. You have no idea what is statically bundled into the CUDA libs.


I agree with you, hidden is worse.

But we do know what it can not static link to, any GPL library, which many indirect dependencies are.


I think you mean the LGPL? It allows you to "convey a combined work under terms of your choice" as long as the LGPL-covered part can be modified, which can be achieved either via dynamic linking or by providing the proprietary code as bare object files to relink statically. The GPL doesn't have this exception.


If static and dynamic libraries use the same interface, shouldn't they be detectable in both cases? Or is it removed at compile time?


First IANACC (I'm not a compiler programmer), but this is my understanding:

What do you mean by interface?

A dynamic library is handled very different from a static one. A dynamic library is loaded into the process virtual memory address space. There will be a tree trace there of loaded libraries. (I would guess this program walks this tree. But there may be better ways i do not know of that this program utilize)

In the world of gnu/linux a static library is more or less a collection of object files. The linker, to my best knowledge, will not treat the content of the static libraries different than from your own code. LTO can take place. In the final elf the static library will be indistinguishable from your own code.

My experience of the symbole table in elf files is limited and I do not know if they could help to unwrap static library dependencies. (A debug symbol table would of course help).


Exactly. Other use case is just more modular even if dependencies are sometimes tangled unnecessarily.


The tool is interesting, but doesn't account for the fact that some shared libraries opened via dlopen are done so lazily. So it might miss those if you haven't executed a code path that triggers them to load.

The other side of not accidentally loading more into your process than you thought is breaking down shared libraries into increasingly smaller sizes. In its limit I imagine it would be akin to a function per shared library, which probably defeats the point a bit.


Lazy binding for dlopen is disabled when the LD_BIND_NOW environment variable is set to a nonempty value, cf. https://man7.org/linux/man-pages/man3/dlopen.3.html


That doesn't mean the program is going execute all the dlopen() calls at start up.


That is true, but the root comment was specifically referring to dependencies of dlopen'ed libs not getting loaded. That one is 'fixable'.

(Btw, I'm pretty sure dlopen itself can't be lazy, due to needing to run constructors; the root comment is a bit vaguely worded... but ofc that only matters after dlopen is called.)


From the article it seems like that might be the point, to visualise what the binary actually uses, eg. if you use different preferences.


If the code path that loads the library is never hit, does it really count as a dependency?


It does if someone is intentionally sending it down that code path for exactly that reason.


If your tests don't exercise a bug, is it still there?


Really the big finding here is that Xlib et. al. get pulled in to GPU compute tools, because access to GPU contexts has traditionally been mediated by the desktop subsystem, because the GPU was traditionally "owned" by the rendering layers of the device abstraction.

The bug here is much more a changing hardware paradigm than it is an issue with shared library dependencies that recapitulate it. Things moved and the software layers kludged along instead of reworking from scratch.

Obviously what's needed is a layer somewhere in the device stack that "owns" the GPU resources and doles them out to desktop rendering and compute clients as needed, without the two needing to know about each other. But that's a ton more work than just untangling some symbol dependencies!


I was debugging a crash in vlc today - actually in the Intel VDPAU driver - and debuginfod (which dynamically downloads the debuginfo for everything in a coredump) took a good 15 minutes to run. If you look at the 'ldd /usr/bin/vlc' output it's only about 10 libraries, but it loads dozens and dozens more dynamically using dlopen, and I think probably those libraries dlopen even more. This tool could be pretty useful to visualise that.


On windows, this is Dependency Walker versus ProcExp. Similar eye-goggling results.

https://www.dependencywalker.com/

https://learn.microsoft.com/en-us/sysinternals/downloads/pro...


Dependency walker used to be part of the Windows SDK, never got the point why the removal, given its utility.


The glibc separation into multiple shared libraries is such a weird thing. Anyone happen to know how that happened? See musl for an example where they put it all in one lib and thus avoid a whole pile of failure modes.


POSIX requires it, cf. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/c...

However, that "requirement" doesn't prevent you from shipping an empty libm (or other libs listed there.)

(The actual reason is probably that glibc is old enough to have lived in a time where you cared about saving time and space by not linking the math functions when you didn't need them...)


Actually, now that I'm looking at the list… I certainly don't have a "libxnet" on my system ;D


A lot of them are stubs nowadays actually.


We're in this situation because we're using a model of dynamic linking that's decades out of date. Why aren't we using process-isolated sandboxed components talking over io_uring-based low-latency IPC to express most software dependencies? The vast majority of these dependencies absolutely do not need to be co-located with their users.

Consider liblzma: would liblzma-as-a-service really be that bad, especially if the service client and service could share memory pages for zero-copy data transfer, just as we already do for, e.g. video decode?

Or consider React Native: RN works by having an application thread send a GUI scene to a renderer thread, which then adjusts a native widget tree to match what the GUI thread wants. Why do these threads have to be in the same process? You're doing a thread switch anyway to jump from the GUI thread to the renderer thread: is switching address spaces at the same time going to kill you? Especially if the two threads live on different cores and nothing has to "switch"?

Both dynamic linking and static linking should be rare in modern software ecosystems. We need to instead reinvigorate the idea of agent-based component systems with strongly isolated components.


> is switching address spaces at the same time going to kill you?

The answer is "yes".

I won't stop you, if you want to make React even slower, be my guest. I want off this ride.


> The answer is "yes".

And if you have one per core anyway so nothing "switches"? Computers aren't single-core 80486es anymore. We have highly parallel machines nowadays and old intuition about what's expensive and what's cheap decays by the year.


I only have 16 cores. Linux, windows and macOS already load about 50+ processes at startup. If we moved shared libraries into their own processes, we’d be talking hundreds or thousands of processes running all the time. They don’t get a core each.

But, if you’re interested in this architecture, smalltalk did something similar. Fire up a smalltalk vm and play around!


You know that process A (the caller) is currently running. You assume that process B (the callee) is running in another core on the same NUMA nodes.


Sorry, not dedicating a whole CPU core for your shitty React app.


> Why aren't we using process-isolated sandboxed components talking over io_uring-based low-latency IPC to express most software dependencies?

To some extent we are, if what you do is work on backend RPC or web app frameworks.

But the better answer is because sometimes what you actually want is the ability to put a C function in a separate file that can be versioned and updated on its own, which is what a shared library captures. Trying to replace a function call of 2-3 instructions with your io_uring monstrosity is... suboptimal for a lot of applications.

And in any case, the protocol parsing you'd need to provide to enable all that RPC is going to need to live somewhere, right? What is that going to be, other than a shared library or equivalent?


currently involved professionally in a software architecture based on pretty much raw shared memory IPC, it's still too slow compared to in-process. See also VST hosts that allow grouping plug-ins together in one process or separating them in distinct processes, like Bitwig: for just a few dozen plug-ins you can very easily get 10+% of CPU impact (and CPU is an extremely dire commodity when making pro audio, it's pretty much a constant fight against high CPU usage in larger music making sessions)


> it's still too slow compared to in-process

Why? Relative to the in-process case, properly done multi-process data flow pipelines don't necessarily incur extra copies. Sure, switching to a different process is somewhat more expensive than switching to a different thread due to page table changes, but if you're doing bulk data processing, you amortize any process-separation-driven costs across lots of compute anyway --- and in a many-core world, you can run different parts of your system on different cores anyway and get away with not paying context-switch costs at all.

Also, 10% is actually a pretty modest price to pay for increased software robustness and modularity. We're paying more than that for speculative execution vulnerability anyway. Do you run your fancy low-level audio processing pipeline with "mitigations=off" in /proc/cmdline?


> Also, 10% is actually a pretty modest price to pay for increased software robustness and modularity. We're paying more than that for speculative execution vulnerability anyway.

it's a completely crazy price to pay in a field where people routinely spend thousands of $$$ for <5% improvement

> Do you run your fancy low-level audio processing pipeline with "mitigations=off" in /proc/cmdline?

obviously yes! along with power saving CPU C-states or anything throttling-related disabled, specific real-time IRQ and threading configuration (e.g. making sure that the sound card interrupts aren't going to happen on a core handling network interrupts) and two dozen other optimizations (which do make a difference, I regularly set-up new machines from scratch for shows, art installations, etc. and always do this setup step-by-step to see if things are finally "good enough" and they always make a difference, in really a make-or-break sense).


Can you share your checklist for this show set-up? :)


not for free but we can do consulting for this at my job, feel free to mail at jmcelerier ම sat.qc.ca :)


Someone got a microservice hammer, so everything looks like a nail eh?

What is really needed, is sane memory model where you can easily call any function with buffers (pointer + size) and it is allowed to access only these buffers and nothing else(note). Not this mess coming from C where this is difficult by design.

(note)since HN likes to split hairs: except for its private storage and other well thought exceptions


This would be what's known as software-based fault isolation, right? Here's a paper from 1993: https://dl.acm.org/doi/abs/10.1145/168619.168635

I don't understand why this idea keeps failing to take hold even though it's constantly reintroduced in various forms. Surely now, 30 years after that paper was published, we can bear the "slightly increased execution time for distrusted modules" in return for (as the paper suggests) faster communication between isolated modules?


So basically like that one Rust hardware project that gets posted periodically?

You could probably do it pretty decently in C via `pkey_mprotect` (probably with `dlmopen`).

https://www.man7.org/linux/man-pages/man7/pkeys.7.html


It'd be nice if dlmopen weren't broken and made a linker namespace so separate that it couldn't even share pthreads with another.


That's because IPC is not low-latency.

No modern processor architecture has a proper message passing mechanism. All of them expect you to use interruptions; with it's inherent problems of losing cache, disrupting pipelines, and well, interrupting your process flow.

All the modern architectures are also so close to have a proper message passing mechanism that it's unsettling. You actually need this to have uniform memory in a multi-core CPU. They have all the mechanisms for zero copy sharing of memory, enforcing coherence, atomicity, etc. AFAIK, they just lack a userspace mechanism to signal other processes.


Futexes allow low-latency signalling to other processes, and FUTEX_SWAP [1, 2] promises to decrease the cost even further.

[1]: https://lore.kernel.org/lkml/20200722234538.166697-1-posk@po...

[2]: https://www.phoronix.com/news/Google-User-Thread-Futex-Swap


This doesn't seem to touch on any of the points on my comment.

It does take a lot of unnecessary stuff out of the way. But isn't enough to change the picture at all.


> This doesn't seem to touch on any of the points on my comment.

What points did your comment make? You didn't define "signalling" specifically enough to discuss. Can you elaborate on precisely what kind of "signaling" primitive processors or operating systems should provide?


If the goal is to put a security boundary within the process between libraries, there might be better ways to do it than process boundaries. One approach is to wasm sandbox library code. Firefox apparently does this - compiling some libraries to wasm, then compiling the wasm back to C and linking it. They get all the benefits of wasm but without any need to JIT compile the code.

Another approach would be to leverage a language like rust. I’d love it if rust provided a way to deny any sensitive access to part of my dependency tree. I want to pull a library but deny it the ability to run unsafe code or make any syscall (or maybe, make syscalls but I’ll whitelist what it’s allowed to call). Restrictions should be transitive to all of that library’s dependencies (optionally with further restrictions).

Both of these approaches would stop the library from doing untoward things. Way more so than you’d get running the library in a separate process.


Think you would have loved Signularity OS[1].

It featured software-isolated processes that communicated via contract-based message passing, which allowed for zero-copy exchange of data.

As a research OS it never became a fully-fledged OS[2], but an interesting attempt IMHO.

[1]: https://www.microsoft.com/en-us/research/wp-content/uploads/...

[2]: https://en.wikipedia.org/wiki/Singularity_%28operating_syste...


A big issue with IPC is thread scheduling. Thread B needs to get scheduled to see the request from thread A and thread A needs to get scheduled to see the response from thread B. I think there are WIP solutions to deal with this [1] this but I'm not up to date.

[1] https://www.phoronix.com/news/Google-User-Thread-Futex-Swap


This was roughly the dream of DBus. However, outside of desktop-shaped niches it proved to be extremely difficult to secure, standardize, and debug.

Process-level/address-space-level dependency sharing remains both easier to think about and simpler to implement (and capabilities are taking bites out of the security risks entailed by this model as time goes on).


Because hardware resources, 20 years ago doing something like VSCode with tons of external processes per plugin, would drag your computer to a crawl.

Emacs wasn't Eight Megabytes and Constantly Swapping only due to Elisp.


Dbus is a thing. Bus1 was its still-born spiritual kernel-space successor.


This is very interesting! Are there any movements to move towards this?

Wouldn't it open up for a new attack vector where process could read each other data?


Essentially path dependency. It would be almost impossible to change how it works now.


I don't think this stuff is as hard as you make it out to be. Consider that companies like Apple regularly change how things are done successfully. It just requires a good plan, time, & budget. Wayland is one example of what that looks like & it's not a good story. Pulse audio followed by pipewire is another example of migrations happening. I suspect this would probably be slightly worse than Wayland unless some kind of transparent shim could be written for each boundary so that it can be slotted in transparently.


Apple can regularly change how things are done because they have absolute control over their platform and use an "our way or the highway" approach to breaking changes, where developers have to go along or lose access to a lucrative market. This approach really really would not work on Linux: consider that the rollout of systemd was one-tenth as dictatorial as is SOP for changes from Apple, and it caused legions of Linux users to scream for Poettering's head on a stick.


That's not a path dependency though. That's just a critique of the bazaar development model. And honestly I think if the big distros got together and agreed this would be a significant security improvement, they could drag the community kicking & screaming just like they did with systemd (people hated systemd so much at first that they tried to get other init systems to not suck but over time persistent effort wins out).


You would replace function calls with syscalls ? Yeah well, if you omit performance and complexity, why not .. io_uring is nice. Yet my application have to call that lzma function and get the result now : you now add cross-process synchronization (via the kernel => syscall) as well as insert the scheduler in the mix


I am a perpetual newbie when it comes to things like this. What's the advantage of dlopen()-ing instead of dynamic linking?


Faster program start, and potentially making it a soft-dependency instead of a hard dep (or letting you fallback on something else if it's not there)


You can load libs at runtime with dl-open, for exemple if you need a feature you can load the corresponding lib. While with dynamic linking you'd load everything when the process is launched slowing the launch


A more obscure use would be for loading multiple instances of a singleton library. This is especially helpful in something like a unit test suite, where you want each test case to start in a cleanly initialized state. If the code under test has a bunch of globally initialized variables, reloading the library at runtime is one of only a few possible ways of doing it.


Just wait until Lennart pushes his idea of doing linking entirely via dlopen() in systemd (see the story from a few days ago). Last bits of sane and efficient means to track dependencies will be gone forever after that. Good luck creating any lean Docker/k8s images without pulling in systemd-based stack after that.


> see the story from a few days ago

I missed it, can you add a link or at least the post title I can search for?


I'm impatiently waiting for the systemd shell and editor /jk (and to be clear I hope this is a joke but I worry sometimes)


> Just wait until Lennart pushes his idea of doing linking entirely via dlopen() in systemd (see the story from a few days ago)

Could you paste a link to the story? I haven't been able to find it through search engines, and I'd love to read the rationale of such idea...



Wait until /lib is split into /lib.d with /lib.d/default and /lib.d/available and....


When I google for "libvulkan_virtio" I get zero results.

What does it do?


virtio-gpu with vulkan command passthrough to host, https://docs.mesa3d.org/drivers/venus.html


Thank you. So is this a Vulkan emulator which does not send the commands into a software renderer but rather to the host's GPU? What reserves the resources on the GPU, also this driver? Can one reserve resources explicitly through the API or does this happen dynamically, as-needed? Because if explicitly, then I'd wonder if this is also part of the library, of the Vulkan spec, or if it is some Mesa offering.


It's basically allowing you to use the vulkan API on the virtualized guest, by writing vulkan API commands in a ring buffer in memory that is visible both by guest and host. These memory regions are only alive and accessible as long as the allocation lives, which is controlled by virtio control commands (in specific, create, map, unmap, destroy BLOB where a blob is the shared memory allocation).

This allows textures shaders and generally large amounts of data to skip being copied to and from the virtqueues, which is the usual method of virtio communication.

So to answer your question, if you use the Vulkan API on a guest to for example query the available Vulkan devices, if the correct mesa library is installed and virtio-gpu Venus is available, you will be able to use resources on the host with the Vulkan API.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: