Hacker News new | past | comments | ask | show | jobs | submit login
Don't trust default timeouts (robertovitillo.com)
139 points by kiyanwang on Aug 29, 2020 | hide | past | favorite | 115 comments



No! The danger in forcing programmers to pick a timeout is that they will pick the wrong value, most often a too short timeout, because they have been testing their software on a super-fast internal network and haven't considered the poor users in the real world.

Case in point: Google's Waze. If I have a slow mobile connection (e.g. edge or even 3g), Waze will repeatedly fail to load a driving route. It will think for a few seconds at most, then timeout and tell me there was a problem. If it only would wait a few more seconds to load, then the app would be useful. Instead, due to their crappy choice of timeouts, the app becomes useless.


I strongly second this "No!" for both of the JS examples he makes.

There was no way to set a timeout in Fetch because the browser, acting as the user's agent, has a sane default (~75 seconds on average, but it varies by browser and platform).

Developers often pick TERRIBLE values for timeouts when left to their own devices.

Hell, the author of this exact blog post has picked 10 seconds in all of his examples. That a FUCKING BAD timeout. It's far, FAR too short for many use cases.


> Hell, the author of this exact blog post has picked 10 seconds in all of his examples. That a FUCKING BAD timeout. It's far, FAR too short for many use cases.

It isn't necessarily. It all depends on the use-case. If most of the operations are finishing in 5ms then the probability of something finishing after 10s are rather low - and timing out and retrying early is probably the way to go.

Someone else in this thread recommended setting the timeout to around the P99 time that operations take. I think that's a reasonable starting point, even I might move it towards P99.9.

I worked (and am still working) on adjusting timeouts for systems doing billions of requests/s. One takeaway from that is that the actual value of timeouts is often not too important if you look at one system in isolation. The latency distribution will be rather logarithmic. Most requests might e.g. finish in 20ms. Then you get a P99 at maybe 3 digit ms, and a P99.9 at 10s (example numbers). From there on it will make a minor difference in availability if you now set your actual timeout to 5s or to 120s - it might just be noise along your other error sources.

However it makes sense to align the absolute timeout with timeouts of dependencies. E.g. if you have a chain of

client -> service A -> service B

and service A always times out first, then the client will get an error that service A is broken - but nobody can easily diagnose whether it was service A or service Bs fault. If service B times out first, the client can get an error message that indicates that. Therefore it makes sense if upstream service timeouts are shorter (even if only by a second).

In the same model if the client times out first the services actually do only observe the client dropping the connection. They don't know whether the client timed out or cancelled the operation for other reasons. And therefore they might also not record that something in the service is actually not ideal. For that reason I would recommend setting client timeouts higher than service timeouts (if you are aware of them).

However there is yet another exception to this thing, which are TCP connection timeouts. If you can configure them separately, it makes sense to have those rather low and performing multiple retries. That can improve overall latency, since dropped SYN packets will only be retried by the OS after 1s.


TCP takes 3-4 seconds to detect a dropped packet and retransmit. That puts an absolute minimum baseline of 5 seconds on timeout just to be able to send or receive a packet.

If you're doing anything across datacenters, you have to take a minimum baseline of 10 - 15 seconds to account for extra latency on top.

If you do billions of requests/s I bet you don't care that requests fail? You probably can't even see that requests are failing because you'd have no logging, too expensive at this scale.

I do financial systems, most load typically doesn't go above 1k/s, but every request matters because a dropped request is a dropped payment, possibly tens of millions of dollars lost! There is a ton of issues caused by having too low timeouts set by developers (anything below 30 seconds). I had to reconfigure a ton of systems and libraries to have higher timeouts and ignore configuration passed by developers.


> If you do billions of requests/s I bet you don't care that requests fail? You probably can't even see that requests are failing because you'd have no logging, too expensive at this scale.

That's an assumption. Our customers care a lot. And we have sufficient monitoring in place. All recommendations I provided above where about maximizing availability, and provided based on the experiencing of improving the experience for lots of users.

> I do financial systems, most load typically doesn't go above 1k/s, but every request matters because a dropped request is a dropped payment, possibly tens of millions of dollars lost!

Without knowing much more about your system: If you are losing that amount of money for failed requests (which can e.g. happen due to random network blips) you are doing something wrong. You should invest into different strategies than increasing timeouts.


It's inherent to payment systems to "lose" money on a failed request. A transaction involves two sides, either both sides agree that the transaction is completed or money is lost/duplicated.

If the client decides to drop (timeout) and consider the transaction cancelled, while the server is processing it and will consider it done. That's a catastrophic issue that needs to be addressed. It is one of the most common bugs I've seen in the wild (root cause: too short timeouts).

How to make highly critical systems reliable enough in the face of hardware and software issues is a complex topic. At this level this involves a holistic approach to get every component to cooperate together (timeout is a minor example). A HUGE amount of work is to detect errors, and more importantly to propagate errors across diverse stacks (software should be aware of database errors, services should detect other services failing).


This doesn't make sense to me. If there is a real risk that a timeout can happen (and there always is) then the payment system should be implementing a two phase commit.


Doesn’t that just create a Byzantine generals problem?

My understanding is that real payment systems solve the problem by just taking a day to finalize transactions…


I don't know what the Byzantine generals problem is.

Two stage commit is important because it has: 1) Predefined transaction id prior to final submission that allows you to validate the status, so if your request to commit gets 503'ed or you get a timeout you can reliably query to know if it was processed or not 2) Unlimited resubmissions of the final commit. It doesn't matter if I perform the final commit api request 1 time or 100 times, it will never cause a duplicated transaction to occur. So if I get a timeout or a 503 I can resubmit knowing that if my original commit request went in my new submit will be a no-op, and if my last commit request didn't get processed then this time it hopefully will be processed.

This pattern isn't just a payments pattern thing either. This is heavily used in distributed systems where failures can occur. UPS' API used to use this as well so you could be sure that you don't pay for duplicate shipping labels or cause duplicate shippments.


The Byzantine general problem is the field of research dealing with consensus/consistency issues like what we discuss here. The baseline is that there are two generals on a battlefield trying to coordinate an attack, they send messengers to communicate but any message might be lost or intercepted. The problem is proven to be unsolvable so let's not go heads in assuming you can be sure of any outcome ;)

https://en.wikipedia.org/wiki/Two_Generals%27_Problem

https://en.wikipedia.org/wiki/Byzantine_fault


Taking longer does not solve the byzantine generals problem. The difference here is that the role are asymmetrical: once the bank receive your order for a transaction it does not need to check that you know whether the order was correctly received; the bank can simply perform the transaction and them best-effort let you know of what happened.


I wasn’t suggesting that banks solve an insolvable problem. They just ignore it by doing something else.


Isn't it better to make it idempotent? The risk is that the client might accidentally make the same transaction twice if the first attempt looks like it failed.

Make the client include the id of it's last known transaction and only apply the transaction if it's up to date, otherwise tell the client to refresh and try again.


The second stage is idempotent (which is why it works), but the purpose of the first stage is to make sure both sides have an agreed upon idea of the uniqueness of the transaction that's about to take place.

For instance, if I want to generate a shipping label that goes from my house to your house and I do two attempts, how does the receiving service know if I made two distinct attempts (I want to ship 2 similarly sized items) or if a transient error occurred in between making me attempt a re submission?

You solve this by creating an inactive request with the criteria (shipping label from my house to your house). This step is not idempotent but that's OK, because if I resubmit I just create a 2nd inactive request that may never actually be finished.

The second step is to say "this request is good and I want to proceed with it". That step is idempotent and marks the existing request as not just inactive but puts it in an active state.

A shopping cart flow is a user managed 2 stage commit (review your cart, submit the cart order). No matter how many times I submit my order it won't cause duplicate orders because I'm submitting a specific shopping cart.

UPS, Paypal, and others just use a computer/api-managed 2 stage commit

You can't always rely on a client generated ID, because you would have to know that the client id is unique enough. The server is the only one who can really generate a transaction id that it knows is globally unique and efficiently queryable in its backend.


It's not mutually exclusive. You can do two stage commits with the second stage being idempotent.

The practical risk is that this puts a ton of complexity on the client, to keep track of states and perform some follow up actions. The added complexity means more bugs and each additional step can fail hence compounding the problem rather than solving it.


This doesn’t address much of your comment, but I work on systems with millions of req/s, but even with billions you can still sample to do logging and monitoring. But you’re right, we don’t care that every request works, just that 99.9% do.


> recommended setting the timeout to around the P99 time that operations take. I think that's a reasonable starting point, even I might move it towards P99.9.

Why would you intentionally drop 1% or even 0.1% of requests?


depending on the distribution of processing times[1] dropping slow request and retrying can improve latency.

[1] https://en.wikipedia.org/wiki/Fat-tailed_distribution


What's the process of finding P99? Is it taking a bunch of samples, getting the standard deviation, and then calculating the value at which 99% of all possible samples would fall?

This also assumes O(1) I assume...

(I'm thinking of how to apply reasonable timeouts to background celery tasks)


I straight up couldn't use terraform on DSL while visiting family over Christmas because of this. It would be chugging along at what appeared to me a reasonable speed, but either one of Google's services or terraform itself would decide I was taking too long and stop. I was unable to work that week.


Maybe it is a good idea for next time to provision a vm/machine closer to where you are deploying? It would also prevent loss of progress when the connection got dropped or something.


Is the argument that a call that could get stuck for a few minutes is better than a “wrong“ dozens of second value ? Even as a user I feel it’s a waste of precious resource (including my time). It’s like waiting at the register until the shop closes down because the employee had to go somewhere, instead of giving up and trying the next open register.

I’d think infinity is not a valid state.

For waze’s case, I supposed their priority is not on salvaging the 1% longest request (though critical to you), and instead preserve server resources for the 99% faster clients. That’s not a “wrong” value on their side, and probably have been carefully tailored to get the right tradeoff.


A too short timeout is more problematic than no timeout because it breaks the application.

Let's say 10 seconds, typical intuitive but bad timeout. This will cause requests to fail for no reason other than users are in Asia or Africa, high latency. This will break the application when it's used or deployed across datacenters because high latency. This will cause requests to fail when the server is a bit busy (couple seconds more to process requests). Worse, it will cause chain reactions under load, creating more retries and even more load, causing other services/servers to timeout too.

Better go for a long timeout. A long timeout doesn't break the application.


> A long timeout doesn't break the application.

I'm pretty sure infinite timeout also breaks the application, in a way people rarely realize that it is because of the timeout. People would rather think it "just didn't work, don't know why" instead of being very clever and realized "it must be low timeouts!!!"


Not all requests are the same and they need to be treated differently. Some requests are rather optional and it's probably better to timeout if they don't respond in a timely manner so not to use up more resources than needed. Other requests, like payments, for instance, you probably want to give the best chance for it to succeed. So, no timeout is likely a good idea. If the actual TCP connection times out or is closed by the server, we can hope it was good enough to realise something didn't go well and rollback. So, we are probably safer to assume it didn't go through in that case.

When it comes down to UI there are even more options. Since you have a human on the other side, you can transfer to them the responsability of deciding when to timeout. The UI certainly shouldn't become completely unresponsive while a request is being made.


Yes, go ahead and set a 10 minutes rather than infinite or 10 seconds. That will make it much easier to realize that things are frozen because they will raise exceptions and logs all over the place.

To be pedantic though, infinite timeouts don't break applications except some rare cases of resources exhaustion. If an application is completely unresponsive, it is dead for good, not because of the timeout, need to fix the root cause (often resource exhaustion like swapping or it's waiting on another IO or service that's frozen).


Failing because of a too short timeout feels silly, but a stupidly large timeout leads to frustration and hazardous user actions like killing the app with the task manager.

You don't need a timeout, you need a "cancel" button.


Funny you mention that, this reminds me of Windows task management. Windows automatically gives a popup to terminate an application when it detects an application is unresponsive.

This happens regularly when I open large files in some app, they take a fair bit of time to load, Windows offers a popup to kill the app after few seconds. Have to carefully wait and not click anything.


Or a retry button. The point is that the application does not know if your network has a ridiculous latency that messes up tight timeouts


> For waze’s case, I supposed their priority is not on salvaging the 1% longest request (though critical to you), and instead preserve server resources for the 99% faster clients.

What resources? Buffering a response takes a minuscule amount, and if even a tiny fraction of people try again it will waste far more.

And even if it did take more in total, it would not be by much. This justification for saying it's not a wrong value is very weak.


I've seen many cases on mobile web pages where it tries to connect, hangs on a progress bar, my phone loses and regains signal during the process, it still doesn't finish loading, but a pull-to-refresh makes it load instantly.

That's an example where a timeout and retry would have fixed the problem. If it had been an API call behind an app, it would have hung indefinitely.

Some libraries sadly have their default timeouts set to infinite.


When I implement this, I typically use separate thresholds for the entire request and time since last progress or some rolling average transfer rates. Letting a slow transfer complete is useful both for what you mentioned and also reducing server congestion but you do need to detect failures where the remote end goes silent (server failure, network roaming, etc.) without tearing down the connection.


This is probably it. Production timeouts vs developer timeouts. What a shame about Waze. I wanted to try it out but then realized its just owned by Google now so its kinda pointless for me to bother.


I think the opposite. Use infinite timeouts for outbound calls. If you're an interactive application, display progress/activity to the user. Allow the user to manually cancel. If you're a server application or system, maintain an operation wide timeout and, if you do time out, propagate cancellation.

Systems I work with that have a default timeout are a pita. You end up having to make pointless retries when you'd have been happy to wait.

There are exceptions. If cancelling and retrying is has a decent chance of routing around the original problem. In that case, a timeout makes total sense. The other case is if you have a workload where operations tie up a mixed set of resources (e.g. threads + blocked backend calls) and only some of your incoming ops are dependent on the blocked resource. In that case, timeouts make sense in that they at least allow you to make forward progress on the unblocked requests. Although tbh separate queues and thread pools is the safer way to handle this. Because your caller with the timed out calls is gonna keep retrying and eventually these retries will crowd out the requests that can make progress in your incoming request mix.


I think the problem is partly that many programming languages make it difficult to propagate cancellation correctly and idiomatically. So people end up adding timeouts to individual network requests instead of to the operation as a whole.


Go, Rust, and Zig make it impossible to interrupt most blocking syscalls, because they automatically retry on EINTR. [1]

Go supports timeouts on network & file read/write ops, which can be used to interrupt them.

[1] https://github.com/golang/go/issues/41054


>Go, Rust, and Zig make it impossible to interrupt most blocking syscalls, because they automatically retry on EINTR.

This is not true of Rust. Some of the convenience wrappers (std::io::Read::read_exact, etc) on top of the basic primitives (std::io::Read::read, etc) do retry for you (and explicitly document it), but not "Rust" as a whole. The primitives map one-to-one to calls of read/write/sendto/recvfrom and bubble up ErrorKind::Interrupted to the caller just fine.


The Rust filesystem API works (or once did) as I described. This can render an application unusable when trying a network filesystem or storage device that's unavailable.

https://github.com/rust-lang/rust/issues/11214

To support user intervention when a task takes longer than expected, all blocking syscalls should be interruptible.


Linking to discussions from pre-1.0 does not help your assertion.

std::fs::File's impl of std::io::Read:

https://github.com/rust-lang/rust/blob/7fc048f0712ba515ca11f...

-> https://github.com/rust-lang/rust/blob/7fc048f0712ba515ca11f...

-> https://github.com/rust-lang/rust/blob/7fc048f0712ba515ca11f...

Again, as I said, it corresponds one-to-one with a call to the underlying read API. The retries for ErrorKind::Interrupted are done by higher abstractions like std::io::Read::read_exact, and they explicitly document that they do this.

https://github.com/rust-lang/rust/blob/7fc048f0712ba515ca11f...


Apologies if I got this wrong, but you've linked read(), and I'm referring to mkdir() et al.

Have they taken out the EINTR retries which were added for the issue I linked?



Go only recently started handling -EINTR correctly (retrying the operation) though I do agree that this should've only been done in the higher-level wrappers. The "os" package shouldn't be doing retries IMHO.

Pre-1.14 -EINTRs were quite rare in "normal" Go programs so the stdlib basically ignored them, but 1.14 introduced preemption which resulted in many more -EINTRs and quite a few Go programs were broken as a result. So in many ways this behaviour was necessary to un-break backwards compatibility. If Go had made the interruption semantics -- which had existed for at least a decade before Go came about -- clearer from the outset then maybe this whole business could've been avoided.

This is symptomatic of the reasons why container runtimes (at least, those written in Go) have historically been vary wary of Go updates. Several years ago, each Go release would change some minor semantics of the Go runtime and cause breakages...


I prefer the opposite, because of the reality that people are lazy. Increasing the timeout often leads to a programming habit where you just turn a blind eye to potential bottlenecks and/or cross your fingers and hope that the network will always be available.

Enforce short timeouts, preferably less than 3 seconds, and definitively no longer than you expect the user to have patience for, if the operation is part of an interactive workflow. Any task that has a legitimate reason to take longer gets pushed to a background process or cron job. This makes timeouts on the frontend a regular occurrence, so you'll be forced to handle them just like any other error condition. Result: a more robust program.

But this probably depends on what kind of program you're building. Most of the stuff I build and support are consumer-facing, so anything that isn't instantaneous is cause for concern. Other types of applications, though, might have more patient users.


The problem is this will make your site unusable for people on slow connections. Some people have internet connections that take seconds to connect, and nothing the site does can make that faster.


Those users don’t get past our single sign on app in the first place. The “ticket granting ticket” used to give out sessions in individual apps expires after a few seconds, presumably to discourage various kinds of attacks.

Not every app is forced to support terrible connections. We definitely had issues with hung operations until we had a timeout, though there was some disagreement about failing “fast” (10 seconds is “fast”???) vs hanging inexplicably for much longer, but not forever, periods of time.

Edit: when REST calls occasionally failed after 10 seconds, rather than hanging the UI, the support calls stopped. 2 or 3 people a day had to hit submit a second time. They got over it, as opposed to reloading the app/page. This was the most cost effective way for us to handle this.


Most services of some kind have some form of resource contention at play. Even the simplest CRUD app likely has a DB connection pool to deal with. Imagine some external service isnt responding for some non-tricial amount of time and your requests are holding open DB connections as a result, you could very easily cause a total availability loss for that service.

Timing out would at least let you, for instance, flip a circuit breaker off or fail fast and have the resulting monitoring very specifically tied to the actual problem in the system, not to mention avoiding resource contention issues like I mentioned.


> Use infinite timeouts for outbound calls.

Don't use infinite timeouts if you don't have a another way to cancel the operation.


Counterpoint: always have another way to cancel the operation.


Not retrying is implicitly a way to cancel. Also, it is a codepath that's trivially tested vs the effort to test an extra cancel path.


I'm not sure how testing timeouts is trivial compared to cancellation. They both take about the same amount of code to write a test for, IME. (Not much.)

Not retrying+timeouts has similar effects to cancellation. The operation ceases to go forward. But it is not the same. It's a lot more expensive than imperative cancellation (need to rebuild, resend, reparse the request) and it has a lot of production risks that waiting with cancellation doesn't. For example, naive retries can expose backends to thundering herds, and less naive retries can have strange issues caused by exponential backoff where you'll have requests sitting around doing nothing for half their own timeout, before giving up because the next retry did not hit before the end of the parent request's timeout.


All good points. By trivial I meant 2 tests (works/fails) vs 3+ (works/fails/cancel with the latter possibility having its own works/fails cases). A timeout is just a status code on failure.


Yet, if you are making an interactive application, that easy to test codepath is a great way to put bugs into your requirements.


In Go, it is a single code path. Contexts can be canceled and they also come with propagating timeouts. The timeouts simply trigger a cancellation, so the only code path is handling cancellation.

There's nothing complicated about it, so there's no reason your code can't implement timeouts and cancellation the same way: timeouts are a cancellation triggered autonomously after some time passes.


Not disagreeing, but having trouble seeing it. Could you elaborate?


By adding that timeout you just created a user-visible behavior that nobody asked for and people will only notice in production while dealing with the most complicated use-cases.


Nobody asks for it but some choice must be made. As a user, I have often cursed things that hang indefinitely. And I don't trust application state after touching a cancel button. That stuff is seldom tested well.


Well, not always possible. For example the latest systemd has a bug where it sometimes deadlocks in a PAM module, so it blocks all remote access to a machine over ssh (openssh uses PAM, optionally). If openssh had a timeout on the PAM child process, it would simply retry after timeout, instead the whole machine is lost and needs to be restarted with physical access.

There's no way to cancel the operation remotely, because you're not authenticated yet. And you may not have any other access.

Timeouts are also a good defense strategy against bugs.


This requires the API you're using to support this. If the API doesn't, then using infinite timeout is a bad idea.


Of course. I did not mean to convey that timeouts should be avoided in all cases. In fact I listed several such cases where they should be used. An API that has no way to cancel would be another example. Although I would argue that such an API is fundamentally flawed.


Yeah I just wanted to highlight it, as I've see far too much code passing INFINITE to WaitForSingleObject or similar.

And yeah, not having another way of cancelling is not nice, but sadly not entirely uncommon.


Right I think the suggestion in that case would be to upgrade to an API that does support cancellation wherever possible. E.g. wait for multiple objects with the original argument and an additional cancel event.


As a user, I prefer short timeouts that pop up an error message with a 'retry' button.

You need the retry button anyway, in case the server is throwing errors. And there's often no good place for a cancel button, without putting up a big 'Loading' animation.


Retrying doesn’t help on a slow connection... if the best case connections speed of your user is slower than your timeout, you can retry infinite times and it wont help.

Maybe increase the timeout on retry?


Exponential backoff but for the timeout instead of the interval, and/or a Wait Longer button, sound useful.


As a user, I hate it when a program tells me I need to press “retry”! You’re the computer, you retry, and keep trying until I ask you to stop.


Until it sits there and retires till hell freezes over and you don't realize anything is wrong.

And that's the problem with timeouts, everyone has different expectations.


It's hard to go wrong by telling people what's happening and why. "it's taking longer than usual to X" is a fine thing to expose in a UI.


I should have stated my implicit assumption, that the program will tell me that it’s retrying. If I decide it’s not worthwhile I can stop it, sure.

For any user-initiated action, automatic retry seems strictly better than failing on a single timeout.


Related to HTTP timeouts, I’ve run into database clients without default timeouts. This meant even though the HTTP request was cut off after the timeout, the database request kept on working causing tons of slow database requests to be running on the server.

With Postgres you can use roles to set timeouts, maybe you want a longer timeout for crons, shorter for HTTP endpoints.

Sadly we were using mongo which doesn’t have equivalent functionality. Ended up monkey patching the client library to define a reasonable default timeout.


That isn't due to a missing timeout, that is due to not properly communicating aborted requests down the stack which, admittedly, isn't always easy and some clients/languages/etc. are very bad at. A hardcoded timeout, while a fine workaround in some applications, is not a good default and not the proper fix for that.

Default timeouts in the database layers are hidden time bombs that turn operations that just legitimately take a bit longer than some value the library author set that you didn't even know existed into failures that get retried over and over causing even more load than just doing the thing once. Don't get me wrong there are lots of uses for setting strict timeouts and being able to do so is very important, but as a default no thanks.


You sometimes won't know a TCP connection has been closed unless you try to write to it (possibly there's a select/epoll/etc way to test), so if you are using blocking I/O, you won't know that the HTTP client went away long ago.


Highly advise to turn on TCP keepalive to detect dropped connections.


Sure. But the pointer of the parent poster was that you still won't observe the error unless you are interacting with the socket again. If you have a blocking thread per request model and your thread is blocked on the database IO, then it won't look at the original request (and it's source socket) for that timeframe.

There is no great OS solution for handling this. You kind of need to run async IO on the lowest layer, and at least still be able to receive the read readiness and associated close/reset notification that you can somehow forward to the application stack (maybe in the form a `CancellationToken`)


Requests should be kept alive. For any system, if the requester goes away, eventually the system should stop doing work on their behalf. That seems like the root of the problem in the situation you're describing.


Yes, I'd agree, but a fair number of database's wire protocols have no means of saying "this request is cancelled"!

As for "if the requester goes away", remember that the requester might be a few hops away. E.g., the HTTP connection from the mobile client drops; the web server and its connection to the DB is still alive and well. I can forcefully shut that connection, but that's somewhat of a drag (I'd rather keep it open, since it is perfectly good).

Beyond closing the connection, and being able to issue some form of "cancel this query" request: HTTP/1 lacks it entirely, PostgreSQL requires opening a separate connection, Redis lacks it entirely, and I think both Mongo and MySQL lack it entirely.

Even support for "time this request out" is spotty.


> PostgreSQL requires opening a separate connection, Redis lacks it entirely, and I think both Mongo and MySQL lack it entirely.

MySQL has a kill command, which does need to be done on another connection (might also need more permissions, it's been a while). It's been a while since I used a lot of MySQL, but this was definitely a pain point when things went sideways.


MongoDB has killOp which will allow a user to kill any operation they have an operation ID for:

https://docs.mongodb.com/manual/reference/method/db.killOp/


This (and the content of the article) has been one of the (dull) recurring themes in my career, I feel. Finding and adding timeouts and trying to prevent databases from chasing their own tail.

A co-worker of mine actually added support for timeouts to a database we were using. (It is a smaller, less-well known DB.) I added it to the Python side.

Good cancellation support in the language is really critical here, I found. In Python, it was a breeze to add timeouts and get rid of long running requests, if, say, the network connection dropped: you cancel the future, and that cancellation propagates to all the sub-futures. It is even hookable so that one can — if the network protocol supports it — propagate that across the wire to other services.

The DB in our case was written in Go, however, so that was tougher. Golang's best method (that we learned of at the time) is to thread a "Context" object through your code paths. We were working with existing code, of course, and it lacked this, and it's harder to add in hindsight.

Of course, once we got the server to stop hanging on queries of doom and return a more appropriate "that's a query of doom, and would hang the server" error, the complaint was that the server wasn't executing those queries anymore…


The asyncio cancel is convenient but can be really deceiving. If you want to reliably propagate the cancel across the wire, you likely have to invoke new await-calls inside your CancelledError-block. But doing this requires you to re-await the cancelled function again!

    async def process():
        try:
            await db.slow_operation()
        except CancelledError:
            synchronous_functions_work()
            await db.cancel()   # This future will not complete on timeout
 
    p = process()
    try:
        await asyncio.wait_for(p, timeout=1)
    except TimeoutError:
        await p   # Required for db.cancel() to run!!!
Now fire-and-forget on a timeout is perhaps the most reasonable approach, otherwise you'd get timeout on timeouts, so better implementation would be to restart p without awaiting it, or putting it on a background cancel-list. But it can be really confusing when you are not aware of this behavior.

Edit: Seems they actually fixed/changed this in 3.7: https://bugs.python.org/issue32751. So instead you have to write robust except-blocks that must never timeout.


"Don't trust timeouts" is a better title and a better approach. The fundamental problem with distributed systems is you can't tell the difference between slow and non-responsive/crashed services. Simple timeouts are rarely the answer. Here are the obvious ones depending on what part of the problem you are trying to optimize.

* Keepalive - Have the server ping back on a short timeout while it's working. Use a very long timeout for the server response.

* Asynchronous queues - Use queues for requests and discard traffic/error out when the queue becomes full.

* Idempotence - Send another request if the first one does not return in a reasonable amount of time.

* Broadcast - Don't fetch the information, have it sent to you through UDP. Great for cumulative metrics. If you miss one, no problem, the next packet has the same data.

* Cancellation - Cancel the request if you don't get an answer.

* Multiple requests - Send requests to multiple services and return the one that gets back first.

Forcing clients to pick timeouts amounts to punting a hard problem over to somebody who has even less idea how to solve it than you do.

Edit: clarity


Great list.

That last suggestion (Multiple requests) can be tough to implement correctly, I think it's usually called Happy Eyeballs: https://youtu.be/oLkfnc_UMcE?t=290


Man! As a Chinese person, I really wish some companies to move their developers who works on network related components to China for a few weeks, because that will improve the stability of their products quite a bit.

You know we have a firewall that randomly disconnects connection and block traffic. A lots of apps just gets confused when that happened. And that happens a lot, once few minutes or shorter maybe.

When I work on my proxy, I had do define a new strategy to detect dead connections, such as to use separated timeout for Dial and Read. The Dial timeout will be a shorter value defaulted at 20 seconds, the Read timeout will be a rather normal one usually defaulted at 120 seconds.

I found many software just don't use any strategy. They just sends the connection and wait, assuming everything will be fine while it's actually hanging forever on user's end, until the OS kick them out. Many download system don't even have retry/resume mechanism: You downloaded 99% of a 600mb package (and it takes about 48 hours), then the connection EOF'ed, the software say "yeah you better download all of it again, hehe".

An example of good strategy can be found in `apt`. The software detects slow network, timed out connection, automatically retry downloads (not sure if it can resume download, could be great if it did). And all of that gives me a strong software that I can trust: I know when I run the command, the command will try it's best to get things done. And usually it did, causing far fewer issues than `npm`, `snap`, `git` and etc.

I suggest everybody give this mindset a try: When your software downloads data and puts it on user's computer, the copy of data is now owned by the user. You remove the data, you're looting the user from what they've got. It's like that you're making a dinner for your user: Everything been made (downloaded) is already on the table, one failed meal (packet) should not cause you to flipping the table. Instead, you retry and retry until it can't be done (For example, the source has changed or wait time is really too long).


This is a surprising blind spot for most large tech companies. I’m sure Apple, Google, Microsoft, Facebook, etc. spend a lot of money on their corporate networks but you’d think they have at least one engineer who encounters unreliable wireless on a regular basis. Only Netflix and appears to test for this - I’ve never had to toggle networking to restore functionality after a dropped packet.


Youtube also handles it pretty well, but the twitch player for example, gives up forever if the network is too flakey, who exactly wants that behavior?


Youtube Music, however, has the opposite problem. It always tries to load the online version, regardless of whether I have downloaded the file already. Even if I specifically clicked a file in the downloaded list, it will still try the online version when the next song begins.

When I have flaky connections, I get so many pauses (of songs I have downloaded) that I just deleted the app alltogether.


With mobile operating systems like iOS one problem you frequently encounter is the user moving from one physical connection layer to another, ie WiFi to Cellular and back. iOS will send requests on both, but won’t move your existing request from one to another.

So for one app I write the user (say a company executive who always complained his photo uploads failed but was too busy to give a detailed bug report or even a specific time it happened) would take pictures at home or office, which starts uploads on WiFi, then put the phone in their pocket and get into their car and drive away. So I ended up using relatively short timeouts (30 seconds) and automated retries.


Facebook has a pretty advanced tool for testing bad internet connections.

https://github.com/facebookarchive/augmented-traffic-control

Google, Apple and Microsoft also have ways to simulate bad connections but Facebook seems to be the best at it. I heard they even encourage their employees to switch to 2G from time to time.


Oh, I know they have people who know this is a problem but every time that I spend time somewhere with an unreliable connection I encounter problems in apps and websites made by those companies. That’s the difference between knowing it’s a concern in theory and making it a regular part of your testing.

This is especially noticeable in their web apps because they rarely reimplement the functionality which is built in to the browser. When I’m on the subway, mbasic.facebook.com has things like a working reload but the app and Facebook.com will both fail to do elementary error handling and will often discard whatever you entered.


One of the main complaints about NFS, one of the original distributed systems - is that client machines hang when the server is unavailable. The problem is that the (Unix) filesystem layer assumes that disks are reliable (spoiler alert: they're not), and NFS stretches disk access across a network.

The concept of a "soft mount" with a timeout was introduced to NFS but it's almost never recommended. This is because client programs have no idea how to handle a timeout from the filesystem. This article shows how every HTTP client has to be configured to handle failures. Imagine if every program that accesses a file, from /bin/cat all the way up, had to have error handling code to deal with timeouts and retries. A sane choice is to wait infinitely if there's nothing more intelligent that you can do.


"NFS server xyz not responding , still trying" is still in my head, despite not using nfs for probably a decade.


Most software expect an error of some sort (e.g. file not found, permissions) and do nothing with it except print it. This would be sane behavior with a NFS timeout. Let the user retry rather than hanging forever.

Consider a startup script that hangs forever. Better it fail than hang. Or ls hanging forever when instead the filesystem could fail the operation after 30s.


I recently ran into this problem with the popular Python "requests" library, which doesn't have a default timeout set. That's especially annoying as their slogan is "HTTP for Humans" and that doesn't feel very human friendly.

There is a longstanding issue [1] to add a default timeout, but so far that hasn't happened yet.

[1]: https://github.com/psf/requests/issues/3070


In the AWS builders library they suggest setting the timeout to be at the p99 for the expected latency of the operation (or choose a different percentile if you want to be more or less tolerant of false positives). That methodology seems pretty solid, provided it's something that's continually re-evaluated and tested under load.

Also important to consider what the client is advised to do in the case of a timeout. Retries, for instance should likely have backoff and jitter attached, or a retry budget.


That number sounds like really bad advise to me. Should be more like 99.99% in my experience.

Internal services have extremely low response time during normal operation (p99 around a second) but then the database will start a snapshot or a large analytics query hits on the week end (high IO) and the latency is through the roof for a short while. Too bad if services have short timeouts, they're all failing all requests now for no reason.

p99 is normal operation. Services shouldn't be configured to systematically fail for 1% of operations.


fair enough, that's why they call out that you need to load test it and actually determine that the value you set meets expectations. Agreed that blindly setting a value is problematic


We kept running into this issue at my job. A lot of our original database queries for our Go service called db.QueryRow, not db.QueryRowContext. The former doesn't respect timeouts, the latter does. So I ended up writing my own wrapper around Go's database/sql package. It basically just reexports all the functions that accept Context and hides the ones that don't. Very helpful. Timeouts are important.


> Network requests without timeouts are the top silent killer of distributed systems.

YES 1000% (I mean maybe not 'top' but it's up there)

languages must move connection pooling, timeouts, and retry semantics into the stdlib

API client libraries have to do a better job of documenting what happens when a request fails

systems need to do a better job of centralizing how timeouts are configured; this can't be left to chance


May I introduce you to erlang/BEAM? These issues were solved 30 years ago, and the ecosystem has had plenty of time to solidify best practices around exactly what you're asking for.


What actually BEAM protect you against resources consumption ? If you set o timeout all you process will be up and idle.


Default BEAM timeout is usually 5s (probably too long, in some cases), if you miss it the default is unhandled exception, which crashes the process that made the call (and only that process, no others). The VM will then recover all of the resources (file descriptors, sockets, data to be GC'd) associated with that process. All in zero lines of code.

Also you can have millions of processes per core, with minimal performance regression, do you're likely to notice it in monitoring before it becomes a problem.


hmm -- this is a compelling sales pitch. What should I read to learn about how erlang API clients manage pooling + timeout logic?


To be honest I use elixir. Docs are better anyways. For timeouts that information is in the standard library, for example

https ://hexdocs.pm/elixir/GenServer.html#call/3

Note that call is even more sophisticated, it incorporates a liveness check on its counterparty to quit out before timeout if there's been a catastrophe. Note this is effectively a simple form of backpressure management that degrades availability gracefully under stress in favor of ensuring the integrity of existing connections. Because fallibility is so baked into the runtime, every good library incorporates timeouts where it's sensible.

For connection pools, it's not explicitly part of the standard library but basically everyone (maybe not whatsapp) uses poolboy:

https://elixirschool.com/en/lessons/libraries/poolboy/

I have never used it myself, yet, but I trust more experienced devs have incorporated it successfully in ecto (rdbms) and Phoenix (web framework).


I like the idea of increasing the timeout on successive retries. Sort of like backoff exponentially, increase timeout exponentially. Of course, this is only if the reason for retrying could be helped by a larger timeout.


I work on the payments industry and this issue has struk our systems several times. One extra piece of advice is to also consider the compound timeout when there are multiple calls to the same service. I still remember having our system comopletely hang because Rabbitmq was unresponsive. We had a 50ms timeout with Rabbitmq, but that didn’t protect us since we would hit the service 50 times per request.


Different timeouts exist for different purpose. Sometimes infinite is the only sane default. Sometimes the default must be dynamic. And sometimes you just pick something that sort of makes sense with all the other system components in mind.

There are a dozen or more timeouts just for a TCP connection. There's the initialization timeout, the 3-way handshake timeout, the half-closed timeout, the time-wait timeout, the unverified reset timeout, the established connection timeout, the retransmission timeout, the timed wait delay, the delayed ack timer, the arp cache timeout, the arp cache minimum reference timeout, the keep-alive timeout, and more.

Every single person in the world depends upon default timeouts, so of course they matter. When they are picked intelligently, they improve the default behavior of the majority of system interactions. So we can trust default timeouts, when they are useful. But if we're building a system, it makes sense for us to determine what the appropriate timeout is for our system.


> Javascript’s XMLHttpRequest is THE web API to retrieve data from a server asynchronously.

Uh what? Has he never heard of Fetch:

https://developer.mozilla.org/Web/API/Fetch_API

its been around for at least 5 years, and it returns a Promise.


He covers fetch() in the section after that, and rightly complains that unlike XHR it doesn't support timeouts at all.


There could stand to be more thought in many applications put into timeouts and cancellation and how it should all work in the face of APIs that might be unresponsive or slow sometimes. But I don't think that putting some arbitrary timeout as the default everywhere is really a good idea.

Many of these things are used for one-off scripts, where it isn't worth thinking about. For many APIs, it isn't worth the trouble - if one of your dependent services is unresponsive, there isn't really any meaningful thing your application can do anyways. It doesn't become an issue until there are so many timeouts that it's impacting other resources. Best to leave it off until you know what you want to do with it.


Timeouts start to get really funny once you start to create lots of UDP Connections while using NAT somewhere in between. Since UDP is connectionless the NAT has no idea whether it will receive any packets anymore and therefore has to keep the port mapping for a certain amount of time. At some point UDP packets will be dropped since these mapping tables can‘t be of unlimited size.


conntrack timeouts don't just apply to UDP:

  net.netfilter.nf_conntrack_dccp_timeout_timewait = 240
  net.netfilter.nf_conntrack_frag6_timeout = 60
  net.netfilter.nf_conntrack_generic_timeout = 600
  net.netfilter.nf_conntrack_gre_timeout = 30
  net.netfilter.nf_conntrack_gre_timeout_stream = 180
  net.netfilter.nf_conntrack_icmp_timeout = 30
  net.netfilter.nf_conntrack_icmpv6_timeout = 30
and even for TCP, there is a timeout after the connection is closed. The fact that UDP has no state and therefore no 'connection' doesn't mean that just because TCP does, that conntrack only tracks it while the connection is open. Besides, you could sever a cable and TCP wouldn't know that anything happened. So you do need timeouts for anything in a NAT table.


Any mention of the Go http package and timeouts ought to also mention "never use http.DefaultClient":

  package http
  var DefaultClient = &Client{}
It is a convenient global variable that uses whatever last settings were set upon it from any bit of code executed in any dependency.


Also: set a timeout in your database to stop out-of-control queries from taking a whole system down. Postgres's "statement_timeout" comes to mind; if a statement exceeds the timeout, Postgres can effectively roll back the system to is previous state.


> never use “infinity” as a default timeout.

Never say never. What can be tuned in the system (obviously relevant only for server software) is better tuned there unless you really like (re-)negotiating tuning options with ops.


After being bitten twice by default timeout values I have the maxim "the defaults will always be wrong" engraved in my heart.


XHR does have a timeout right? It’s just arbitrarily defined by the browser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: