Hacker News new | past | comments | ask | show | jobs | submit login

It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

I just find it amusing that they describe their back-off behaviors as "well tested" and in the same sentence, say it didn't back off adequately.




> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:

A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: