I've become quite partial to Go's implementation. It uses a context.Context that...

TheDong · 2024-06-03T15:36:59 1717429019

> It does require all of the involved functions to implement support for this, though I think most things do at this point.

Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

Context also has the problem of waiting for cancellation to complete.

Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

> [Terminating processes] doesn't handle remote resources cleanly, does it? E.g. if I were to lock a Postgres table for a query, and that query times out, will that correctly unlock the table and close the client? Or e.g. lock files?

Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

This is necessary because a process may exit basically at any time for any number of reasons, such as the kernel OOM-killing it.

everforward · 2024-06-03T16:10:34 1717431034

> Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

> Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

There is an edge case where it hangs (e.g. on writing to a crappy NFS share) but to the best of my knowledge, that stems from the kernel not being able to interrupt already-queued IO and some knock-on effects related to PIDs owning FDs. E.g. `ls` can't be interrupted when trying to list an NFS dir that's unstable.

Would love to be told I'm wrong there if I am.

> Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean (depending on whether I need a single notification that it's dead, or a persistent way to check whether it's dead). It's not as simple as I'd like, but it's also not terribly hard.

> Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

I was under the impression that it takes time for Postgres to notice the connection is dead; am I incorrect there? I thought that if a process terminates unexpectedly, Postgres would wait for its own timeout before terminating the client and freeing any resources used by it. I know it won't leak memory for forever, but having a table locked for 30 extra seconds could be a big problem in some situations (i.e. a monolithic DB that practically the whole company uses).

TheDong · 2024-06-04T00:44:34 1717461874

> In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

So don't use 'io.ReadAll' or 'io.Copy' since they don't take a context thus don't internally do what you're suggesting. I guess the stdlib authors don't know how to use context either.

Anyway, `reader.Read()`, even with just 1KB, can still take arbitrarily long. There's plenty of cases where you wait minutes or hours for data on a socket, and waiting that long to respect a context cancellation is of course unacceptable.

> Postgres .. connection timeout

Killing a process closes all its file descriptors, including sockets, and closing the tcp socket should cause the kernel to send a FIN to the server. Postgres should react to the client end of the socket closing pretty quickly.

This does rely on you using the linux kernel tcp stack, not a userspace tcp stack (in which case all bets are off), but in practice that's pretty much always the case.

fl0ki · 2024-06-05T00:05:05 1717545905

> In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

That can still block pretty much indefinitely. Imagine you're a client reading from a server, but the server isn't in any hurry to send anything, and keepalives are keeping the TCP connection open, and no network blips occur for months, so your goroutine is blocked on that read for months.

The much simpler and more robust thing is to propagate context cancellation to socket close. The close will abort any blocked reads or writes.

e.g.

    go func() {
      <-ctx.Done()
      _ = conn.Close()
    }()

You'll still observe and return an error in the read/write call, and close is idempotent, so this doesn't take anything away from your existing logic and really just acts as a way to propagate cancellation.

I don't know how well this works for other types of closeable reader/writer implementations. It may not even be thread-safe for some of them. But this worked great when I tried it for sockets.

> I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean

I try to just use `errgroup` whenever possible, even if no error can be returned. It's just the most fool-proof way I've found to make sure you return only when all nested goroutines are returned, and if you're consistent about it then this applies recursively too. It's a way to fake "structured concurrency" with quite readable code and very few downsides.

oefrha · 2024-06-03T17:02:40 1717434160

Sockets and pipes generally have SetReadDeadline() and SetWriteDeadline(). With io.Reader and io.Writer in general you have to resort to a separate goroutine and a channel, otherwise they would have to conform to more restricted interfaces, say ReadDeadliner/WriteDeadliner, which is not always possible.

fl0ki · 2024-06-03T16:14:49 1717431289

At least two correctness risks remain with Go's approach:

goroutines observe this cancellation asynchronously. You can cancel an operation from your point of view, and begin another one (a retry of the first, or another operation altogether), but the original one is still running, creating side effects that get you into unintended states. If one can be running, potentially any number can be. You have to make sure to actually join on all past operations completing before beginning any new ones, and not all libraries give you a way to synchronously join on asynchronous operations. If you write your own, it's very possible, it just takes a lot of care.

When you select { } multiple non-default arms like this, and more than one of them is "ready", which one gets selected is random. This avoids starvation and is the right way to implement select { }, but most code that checks for cancellation incorrectly pretends this is not the case and that it will observe cancellation at the earliest possible time. It actually has an exponential probability series of observing cancellation later and later, compounding with the above issue. If the work done between select is long (e.g. CPU or IO work) this compounds even further. The correct solution is to select for cancellation again on just one non-default arm, but that is not "idiomatic" so nobody does it.

All of this is manageable with due care. Some libraries make it impossible because they kindly encapsulate not just what you don't need to know but what you actually do need to know if you want correct deterministic behavior. In my experience, very few deveopers, even of popular libraries, actually understand the semantics Go promises and how to build correct mechanisms out of them.

acaloiar · 2024-06-03T15:22:37 1717428157

The context done channels are clearly the way when dealing with all native Go code.

Allthough to the grandparent's point, whne you're dealing with executables or libraries outside of your control, the only true way I know of to get a "handle" on them is to create a process, with its pid becoming your handle.

In situations like image processing, conversion, video transcoding, document conversion, etc. you're often dealing with non-Go-native libraries (although this problem transcends language), and there's no way to time-bound processes. That is to say that you often need to consider the Halting Problem and putting time bounds and preemption around execution. So what I've had good success with is adding a process manager around those external processes, and when a timeout or deadline is missed, kill the pid. You can also give users controls to kill processes.

Obviously there are considerations with resource cleanup and all sorts of negative consequences to this, depending on your use case, but it does provide options for time bounding and preempting things that are otherwise non-preemptable.

everforward · 2024-06-03T15:40:26 1717429226

Ahh, I hadn't considered operating across languages. That does make it awkward if you can't inject some Go (or other) controls in the middle by having Go manage the loop and only calling incremental processing in the other library.

That is awkward. My first thought is "just don't use the library" but that's obviously a non-starter for a lot of things, and my second thought was "transpile it" which sounds worse.

I suppose the signals do allow the binary/library to do its own cleanup if it's well-behaved, so it's really a binary/library quality issue at the end of the day as is something Go/Python/whatever native. There isn't a massive semantic difference between ctx.Done() and a SIGHUP handler; a SIGHUP handler can also defer killing the process until a sane point after cleanup.

acaloiar · 2024-06-06T00:31:32 1717633892

Exactly!

wrs · 2024-06-03T19:00:20 1717441220

All processes can crash at any time due to out-of-memory, bugs, hardware failures, etc. so this should not introduce additional inter-process failure modes. It may reveal existing failure modes, of course!