What I found most interesting from working on some FUSE filesystems (and from the git pseudo-filesystem) is that removing a file via unlink is not a file operation at all, but an operation on the parent directory. The only way that the filesystem knows how to find a file (an inode) is by finding it in a directory listing (which is itself a filesystem object).
The name itself is the giveaway -- "unlink" because you're removing a hard link to a file.
Similarly, access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.
This is a place where POSIX and Win32 diverge significantly -- in Win32, permissions and access happen at the file level, which is why Windows is testy about letting you delete a file that is in use, while POSIX doesn't care -- the process accessing the file, once the file is open, maintains a link to the inode, and all the file data is intact, just not findable in the directory where it was initially located.
A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.
> access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.
No, in POSIX they are properties of the file. You can use fchmod() and fchown() to change mode and ownership via an fd.
> A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.
Yes, you can access them; but I don't believe you can link them. (But I'd love to proven wrong. A few years ago I actually needed to un-unlink a non-regular file that was still open.)
> Yes, you can access them; but I don't believe you can link them.
I think you can if you use debugfs. I wrote a post here about recovering a running binary after deleting the file on disk: http://lukechampine.com/recoverbin.html
> Yes, you can access them; but I don't believe you can link them.
Yes you can. I remember YouTube's flash viewer back in the day would put the downloaded flv video in /tmp and then delete it. I used to check the flash pid, go to /proc/{pid}/fd and see the symlink to the deleted file. Then a cp would give me the actual file.
I don't think this is the same as linking the file. You are not creating a new link to an existing file, you are creating a copy of it and creating a link to that.
If you modified the old file after the cp, you wouldn't see the changes in the new one.
That's interesting about the access permissions and ownership; I thought that access permissions in POSIX were path-dependent. Some quick experimentation indicates that ownership and access does in fact apply across hard links.
There's still some truth to the path-dependent notion, in that you may not be able to access a file through a hard link in a directory that you do not have access to, even if you have access to that same hard link through another path. But if you don't have access to the file itself then you're out of luck.
This does make sense from a security perspective, but I thought that the path-dependent checks in the kernel were strong enough to not require inode-associated ACLs.
You're right about restoring the file by re-linking to the hard link, but you can access the contents and cp it out of proc at least.
I don't think the cross-filesystem hardlink is the problem, since the link in proc is a symbolic link.
I can do this experimentally by creating a symbolic link in /tmp (a different filesystem) to a file in /home, and then creating a hard link (with ln -L) from the symbolic link to another file in /home, and the result is a valid hardlink to the same inode as the original file.
This doesn't work through /proc for an unlinked file, but only because the underlying link call requires a path, not an inode. You can create a hard link out of proc if the file has not been deleted, though, without any cross-filesystem problems.
You have to somehow increment inode's reference count and write reference to it into some directory.
symlink() does not increase reference count of anything and in fact its target does not have to be meaningful filename at all (although in the practical non-POSIX sense there does not exist any string that is not valid filename). One interesting ab-use of this is that you can use symlink()/readlink() as ad-hoc key-value store with atomicity guarantees (that hold true even on NFS). For example emacs uses exactly this for it's file locking mechanism.
IIRC the files in /proc/pid/fd are not true symlinks but something that behaves as both file (you can do same IO operations as on the original FD) and symlink (ie. you can readlink() them and get some string) at once.
man 2 open section about O_TMPFILE seems to strongly imply that you can linkat from /proc/<pid>/fd to concrete file. Not sure if there are some special cases for /proc/self/fd vs /proc/<pid>/fd, but that would seem bit odd.
not sure. but you don't have permission to modify that inode, hence, no permission to link. the model is pretty straightforward.
also, being able to 'create' files 'owned' by another user in other locations (by linking them into place) could create quite a few bizarre and undefined corner cases, some of which might have implications for system stability and/or security.
But... traditionally, Unix systems do allow creating hardlinks of other users' files. And yes, this misfeature is a source of great number of security holes.
An option to disable this behavior (/proc/sys/fs/protected_hardlinks) was addded only in Linux 3.6, and then it's still disabled by default.
I suppose you can read the file if you can intercept an open fd, read it, and write it somewhere. It could possibly be the now vacant previous location.
This trick used to be the way I would download (or play in VLC) videos sent through Flash embeds in webpages. Way back in the day there was an actual temp file. But many companies didn't like that so they started deleting the file immediately to preserve the consensual hallucination that is 'streaming' and keep the lawyers happy. The way around this was to use stat, proc and awk in a bashrc function, ie:
With newer versions of Flash this too went away (around ~2014). But I still keep a browswer profile around that uses an old one around just so I can access the downloaded file to play in VLC (much smoother).
On FAT the "permissions" are in fact part of the directory entry (IIRC some OSes in "multitasking DOS" family even have FAT extended by having what essentially is the unix mode, uid, gid tuple in the directory entry).
On NTFS permissions live in MFT, which essentially is same thing as inode.
Another neat side effect of this is that you can remove a file that you don't own, so long as it's in a directory that you do. rm prompts to confirm by default, but it's intuitively surprising that can delete a root-owned file that you have no read or write permissions to.
This is also a cause of frequent confusion among less knowledgeable users when they try to clear up space on a filesystem that is reporting full, but there is a process keeping the file desciptor open. As far as they can tell the file is gone, but the usage hasn't gone down. There is a guy at my work that I point out lsof +L1 to about every three months or so. He can't seem to wrap his head around the concept.
Using hard links you can have a file that exists in multiple places with none of these being ‘the’ place. If you remove one of these links, nothing happens in the other places. All you see is the refence count going down by one.
Only when the reference count is zero will the file data be removed.
> Only when the reference count is zero will the file data be removed.
Not even necessarily then, right? (That is, there's no guarantee that a file's data is zeroed out just because its reference count drops to 0.) It's more just that it's only when the reference count is 0 that the actual space occupied on disk can be overwritten.
Flash (ab)use this to hide the cache files for streamed videos. They create a temp file, then open it, then delete it. You can however still grab it out of proc...
O_TMPFILE sort of does that trick now automatically; creates file without directory entry so you don't need jump through those (race-prone) hoops of unlinking the file.
What a cool way to learn. I've been doing this 20 years and it never occurred to me to just unravel some of the basic tools I use every day and see if it matches my expectations.
I particularly like using strace/dtrace/dtruss to show the communications between the user process and the kernel. You can learn some very interesting system calls like that, and see how experienced programmers use them.
There's another tool of that variety: ltrace shows calls to functions in dynamically-loaded libraries. It... works, mostly, but it doesn't know things like function type signatures, so it has to guess, and sometimes gets it wrong in weird ways. It also doesn't know defines and enums, so it can't turn numbers back into symbolic constants like strace and its close kin can:
Fun fact: according to unsubstantiated UNIX lore, "rm" is NOT short-hand for "remove" but rather, it stands for the initials of the developer that wrote the original implementation, Robert Morris.
Considering the naming scheme of other basic unix utilities, I'll chalk this one up to "fun coincidence" rather than actual truth.
Alternatively, we could make "Robert"/"Bob" a euphemism for nuking files ;) "Yeah I Bob'd the whole build directory" "Bombed?", "No, Bob'd. Like deleted... never mind"
("PK" is not only the beginning of ZIP files but also of, among other things, ODT, DOCX, and JAR files, which are all in turn implemented as ZIP files.)
int unlinkat(int dirfd, const char *pathname, int flags);
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink(2) and rmdir(2) for a relative pathname).
I would expect the reason for this is in case someone is moving directories around while the find is happening ... Each time find enters a directory it doesn't actually chdir(), so it opens that directory and uses it to anchor the removal.
> The first couple of lines in the trace seem to be pretty clearly related to setting up the sudo part of the command.
This is not correct. "sudo dtruss" makes sudo run dtruss, so it is not possible for dtruss to track what sudo is doing. You can see what "sudo" does (which is much more complicated) by running "sudo dtruss sudo". You need sudo twice because sudo is a setuid program, and obviously it would be insecure to allow anybody to trace a setuid program (for example, it has to read the shadow file, so if you could trace it then you could dump all the hashes and crack them on your own time).
On a separate note, it's probably a bad idea to go reading Linux man pages and expect those to give you accurate information about Mac system calls.
> The getpid command has a pretty self-explanatory name, but I wondered what the parameters that were passed to the function were.
Does anyone know what's going on here? It looks like all syscalls are shown to have 3 arguments, even when they wouldn't need that many... Except close() which is shown to have only one.
but bsd derivatives essentially convert userland libc 'syscalls' into to a call to a single lower level 'syscall' function which passes data to/from the kernel using a macro / integer list to determine which actual functionality is desired..
You should really spell out the full term if the acronym is predominantly used somewhere else.
Unless I'm wrong and you're actually talking about the session initiation protocol (voice over ip)... But that makes little sense
I wonder why did Apple block tracing SIP protected programs? Like, wasn't SIP supposed to protect against their modification on disk, like NetBSD's veriexec? How would DTracing rm be dangerous?
Because then people could trace rm in a way that allowed them to run arbitrary code in that process. It's the same issue you'd have if you attached a debugger to the process or loaded a dynamic library in their address space.
That is not true, dtrace cannot modify data and is specifically designed not to do so.
It can however be used to leak information or read info out of other processes.
I was going to say that for this reason even on linux you can't ptrace processes that are not children of the current process (e.g. you can run something under strace, but not attach to an existing process unless you twiddle a flag or do it as root). Having said that, you CAN modify data with ptrace, unlike dtrace. So that's kindof an aside. In any case the idea is that one process can't hijack another even from the same user for ptrace.
Cool "anthropology" method. This triggered my interest to go look at the C source code for linux[1]. Turns out the command actually relies on a few different files, including a "remove.h"[2]. I was surprised I didn't immediately see the call to unlink in the source code, but obviously there's a lot of useful infrastructure to build on. Further investigation that the unlinkat[3] syscall was actually being used. Minorly fascinating :)
I was somewhat surprised by the small number of process startup/libc initialization syscalls in the trace in original article.
On Linux (Debian unstable in my case) you will get order of magnitude more, because rm is dynamically linked (although it seems that only with libc and nothing else), because of libc startup (did you know that linux has amd64-specific syscall arch_prctl(PRCTL_SET_FS), that does exactly what it sounds like?). And then because core utils rm cares about such things as whether stdin is terminal (probably because -f/-i behavior changes depending on that) and does the actual unlink in somewhat convoluted way that involves fstatat() (called twice, for some reason) and only then unlinkat(). Somewhat notably last thing that rm does is trying to lseek() stdin only to get ESPIPE...
It's very educational to dig into stuff that everyone takes for granted or even try implement your own version that does similar things (at a smaller scale and with less features, performance and security of course, just for educational purposes, not to serve as a real replacement).
E.g. writing your own shell sounds hard but you can do it in an hour (of course it'd be very bare compared to even ash), I once did it during a very casual C class while trying to impress the teacher. He even joked it was self hosting when he has seen the end result (as in - I didn't need other shells anymore and could use vim and gcc from my shell so I could use my shell to work on my shell). I wanted to put that into tutorial at some point but there already exists one[0] (that blog seems quite tinker-y actually, it has implementing own Linux sys call too, which is also surprisingly easy to do and I did it for a class at one point too[1]).
Going knee deep into this stuff also completely dispels the magic that language and system runtimes, filesystems, file formats, linkers, shells, standard commands (e.g. ls, I had to reimplement it once as a homework) or whatever have, or even better - it's still magic to most people and you're the wizard now! It's also very accomplishing to do something so unique in an hour or two (although to me due to my C and C++ bias at some point making a pastebin clone in an hour in PHP or Python became an unique experience).
Even JIT (which sounds scary due to V8, LuaJIT, etc. being so tightly made and complex) is easy to get into and understand at toy scale[2].
There is an old story floating on the net where a sysadmin did the fateful recursive "rm" on the root directory and managed to hit CTRL-C before it got too far.
He then had to jump through a bunch of hoops and use a bunch of strange commands to restore / because of all the missing binaries. I wish I could find it.
I had to do a sort of version of this on a machine on mine once. Its hard disk was failing, and it couldn't I/O half of the system-critical binaries, but it did still have some in disk cache. It doubled as my network's SSH gateway/proxy for when I'm coming from an IPv4-only network, as my home only gets a single IPv4 address.
> I had a handy program around (doesn't everybody?) for converting ASCII hex to binary, and the output of /usr/bin/sum tallied with our original binary.
I copied busybox's nc onto it (needed for the ssh jump); had to copy to /run as / was unwritable due to the disk failure. Now, scp no longer ran, so "copy" was a Python script to turn the binary into a printf command, which is a shell built-in and can write arbitrary binary.
(If you ever get into a jam like this, busybox's utilities are very useful.)
> But hang on---how do you set execute permission without /bin/chmod? A few seconds thought (which as usual, lasted a couple of minutes) suggested that we write the binary on top of an already existing binary, owned by me...problem solved.
I think umask'ing correctly prior to a printf should work nowadays, no? (IDK about in the author's time.) Thankfully in my case, chmod was in disk cache still.
I have a similar funny story of recovery after a root dir incident. Someone told me their "cygwin just broke". It was indeed broken and the bash was unusable.
Long story short: they accidentally ran find with an exec of chmod that takes away the x bit on /.
It sounds ouch-y but it's not that bad because it turned out that find went alphabetically (or so) so first it went into /bin and then quickly made chmod unexecutable, so only the stuff that came before chmod in /bin was affected and all it took was to make it executable again normally via Windows.
It's actually even good bash happens to come before chmod in the sorting and that made cygwin completely "broken" at a glance or else it'd Murphy's law its way into "I don't have executable bit on this exotic rarely used command or other" 5 months down the line with the relevant lines of ~/.bash_history and everyone's short term memory long gone.
One thing I've been thinking is if there would be value in (obviously non-POSIXy) userland that would mirror more closely the underlying syscalls. I'm not sure how it would exactly look like, and obviously there would be still need for higher level utilities too.
There's some precedent for that: POSIX has "link" and "unlink" utilities, which just call the corresponding functions. Most people use higher-level "ln" and "rm".
> Some investigation revealed that csops is a system call that is unique to the Apple operating system and can be used to check the signature that is written into a memory page by the operating system.
Close. That's one thing that csops can do, but in this case it's being used to extract the entitlements from the binary (CS_OPS_ENTITLEMENTS_BLOB == 7).
> I wasn’t sure what the memory addresses that were referenced in the mprotect call actually corresponded to or what the best way to figure it would be.
You could fire it up in LLDB and break on mprotect…
The name itself is the giveaway -- "unlink" because you're removing a hard link to a file.
Similarly, access permissions are also properties of the embedding in the directory, rather than the bits of the file itself.
This is a place where POSIX and Win32 diverge significantly -- in Win32, permissions and access happen at the file level, which is why Windows is testy about letting you delete a file that is in use, while POSIX doesn't care -- the process accessing the file, once the file is open, maintains a link to the inode, and all the file data is intact, just not findable in the directory where it was initially located.
A neat trick here is that you can effectively still access the file (even restore it) if a process has the file open, through the /proc filesystem.