It absolutely is difficult. A challenge I have seen is when retries are stacked ...

eyelidlessness · on Dec 11, 2021

> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:

A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.