> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.
This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:
A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.
I just find it amusing that they describe their back-off behaviors as "well tested" and in the same sentence, say it didn't back off adequately.