> For example, contrary to NT, syscalls can't callback to userspace.
There are some mechanisms that can call back into userspace during syscalls such as seccomp filters, FUSE, ptrace, userfaultfd, fanotify, the syscall_user_dispatch feature used by wine...
There's the core_pattern handler too.
Someone summarized it that a mov instruction could be serviced by starting a python process.
Genuine question: how do you know such things? I'd love to learn them, so I'm wondering how others learn them (other than remembering random comments from here).
`ptrace(2)` and `userfaultfd(2)` show up when doing program tracing/analysis, among other places. I don't know of a great resource for the latter, but Eli Bendersky has a terrific series on debuggers that covers `ptrace(2)`[1].
Man sections, a legacy from ye olden days when man pages were printed on paper and you'd use numbered sections for faster navigation (or splitting them into separate books if there are too many pages).
For example, `man 2 select` (or `man select.2`) gives you information about the `select` syscall, and `man 3 select` has information on the libc wrapper around that syscall.
A somewhat better example might be time: `time.1` has info on the command use to calculate running time of another command (in statements like `time make -j`, and `time.7` — low-level information on how to interact with kernel's time-related functionality. Since both are named `time`, if you just use `man time` without the number, you get the first one (and no way to get to the second).
Read `man man`, it has a lot more on this stuff.
The table below shows the section numbers of the manual followed by the types of pages they contain.
1 Executable programs or shell commands
2 System calls (functions provided by the kernel)
3 Library calls (functions within program libraries)
4 Special files (usually found in /dev)
5 File formats and conventions, e.g. /etc/passwd
6 Games
7 Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7), man-pages(7)
8 System administration commands (usually only for root)
9 Kernel routines [Non standard]
I don't think it's very portable, actually. It doesn't work on any of the BSDs or OpenSolaris forks IIRC. A yet another alternative spelling `man 'select(2)'` should work everywhere, but many shells require you to escape parentheses (or put them in quotes as above), so I don't find it very useful.
Here is an important distinction to make: contrary to NT, syscalls in a given thread do not call back to userspace in the same thread.
In NT, KeUserModeCallback implements some sort of a coroutine pattern, where the kernel space can call back to the user code synchronously in the same thread and expect it to return to the callsite. In Linux, there's no such thing as a user-mode callback: a given thread cannot have any user code running (or scheduled to be run) while simultaneously having the kernel call stack active.
These either block the current thread waiting for another thread (e.g. a debugger, fault handler, or filesystem server) to complete the request, or fires off the request asynchronously.
> syscall_user_dispatch
This does not "call back" into userspace at all; it merely delivers SIGSYS. Signal delivery works more like NT's user APCs: there's no kernel-side context active in the thread the signal handler (or the APC) is running.
It really isn't. "The" kernel doesn't block, a context blocks and typically that context is associated with a user process (directly or indirectly).
sleep(3) the kernel blocks for a time specified by userspace. futex(2) can cause the kernel to block until usespace wakes it. Similar wait(2). Also read(2) from a pipe or localhost socket.
So whether or not it is a bad design depends highly on what gets blocked, in what situations, and what can be done to recover the situation if things misbehave.
There are some mechanisms that can call back into userspace during syscalls such as seccomp filters, FUSE, ptrace, userfaultfd, fanotify, the syscall_user_dispatch feature used by wine... There's the core_pattern handler too.
Someone summarized it that a mov instruction could be serviced by starting a python process.