Most of the suggestions here is suggesting ways of restarting services when they...

tootie · on Nov 6, 2019

Always treat third-party systems like they're full of nitroglycerin. Double check all response codes, expect the unexpected, degrade gracefully when it hits the fan. You're always better off serving up a nice 500 error page than spinning forever or returning a false positive to users. And make sure you have a clear SLA with them and can escalate/mitigate/compensate when they don't fulfill it.

wstuartcl · on Nov 7, 2019

This. Write guards like the external integration is an active malicious adversary -- because when they change api, go down, have their own issues they may as well be attacking your integration.

tachion · on Nov 6, 2019

This is exactly the reason why restarts shouldn't be ever considered a fix, as I've elaborated on a bit more in my other comment in this thread. They fix noting, but give an illusion of it.

gwbas1c · on Nov 6, 2019

Glib answer: Don't work alone!

There's two ways to think about this:

1 - Your product might actually be too complex for a single-person business. You could rotate being on call for situations like this. This means that you'd have to make sure that sales are big enough to support an additional parter or two.

2 - Perhaps you need to simplify your product? Think more critically about error handling? I don't know the details about this part of your service, but if I assume that these bad UUIDs came from HTTP POSTs, why does a series of wonky HTTP posts bring down your entire service? Typically, something like this would trigger some kind of unhandled error that's caught higher up in your web framework and returns some kind of 5xx error.

This paragraph is very C# centric, but it should translate to other languages as well: Typically, I layer my error handling. Each operation is wrapped in a general exception handler that catches EVERYTHING and has some very basic logging. (ASP.Net does this and returns a 5xx error if your code has an unhandled exception.) Furthermore, as I get closer to actual operations that can fail, I catch exceptions that I can anticipate. Finally, I have basic sanity checks for things like making sure a string is really a UUID.

Without knowing much of your service's architecture, it just sounds like you need some high-level error handling. You probably have 100s of other little weird bugs, so high level error handling needs to do the equivalent of returning a 5xx error and logging, so you can fix it when you're able to.

dkersten · on Nov 6, 2019

My point was less about the specific issue I hit and more that 1) external circumstances that a restart won't resolve can cause failures, because 2) we're human and no matter how hard we try, even with a large team, things do slip through.

The difference with having a large team is less that all possible failure cases will get protected against (although more eyes and code review does help), but more that someone can always be available to fix it when something unexpected happens.

In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system. The error actually was localised to one particular type of updates, but that type stopped working because I didn't protect defensively enough against errors in that one particular case (I do have my database queries protected against errors, but this one slipped through). This caused other systems to not get these updates, so things that relied on them stopped working. Its not that they crashed, they just never received the updates they were waiting for.

Of course the fix is to trap all exceptions, log/notify, ignore and continue, so that at least one piece of bad update doesn't affect other updates, but again, my main point was that we're human, so can't possibly protect against everything that might cause a non-recoverable (without human intervention) error.

> Finally, I have basic sanity checks for things like making sure a string is really a UUID

Yes, I did add this too after I hit this issue and its a good point: validate EVERYTHING even if you generate it and think you can assume it will be good.

> Don't work alone!

That's the real solution, but sometimes its not possible.

Thanks for your detailed response, though, its appreciated.

gwbas1c · on Nov 6, 2019

> In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system.

Is your product too complicated for a single-person business?

As a solo programmer, I can write and develop extremely complicated systems. These systems can be so complicated that I don't have time to run them, find customers, support customers, ect.

That, ultimately, is why I don't see myself running a single-person business anytime soon. I really enjoy complicated programming, and if I have to also handle ops, support, sales, ect, then what I program needs to be too simple to remain interesting.

Clubber · on Nov 6, 2019

Defensive programming (or I call it fail-safe programming) is a must for any type of service / daemon.

I employ a healthy dose of exception trapping and logging, and I get an email whenever it happens.

People aren't perfect, but you can anticipate a lot of failures. It usually involves bad data as in your case. Each time you get bitten, change your code so it fails gracefully.

dkersten · on Nov 6, 2019

Agreed. Be as defensive as possible, trap all exceptions (this allowed me to identify the problem and fix it very quickly, but I still had to step in and fix it) and validate absolutely everything no matter how unlikely it seems. Also go over every system and ask "what if an unexpected error happens, will it take the system down? will it prevent other requests/tasks/users from working?"

Clubber · on Nov 6, 2019

Also, if you are running on linux, check out supervisor. I use it to keep my WebAPI's up. If it crashes, it starts it up again.

kd5bjo · on Nov 6, 2019

For situations like this, it's extremely useful to have a human-review queue that the automated system can drop jobs into. It keeps the rest of the system running, and lets the support staff (you) fix the problem during normal business hours.

This is how we handled errors in our video archiving system at Justin.tv, and to my knowledge we never lost a single frame that made it to the broadcast servers. The raw bits were streamed to disk as they came in, and only got removed once the VOD servers had the final version— any errors would retry a few times and then get flagged for manual processing. We did have a few close calls where something broke the whole archiving system, and the broadcast servers came dangerously close to shutting down due to full disks, though.

bjacobt · on Nov 6, 2019

I used to run a service solo that processed data from multiple external sources and as you said you need to program defensively when dealing with input.

I handled it with a pipeline that did the following, 1. validation, 2. transform data if needed, 3. load the data. If validation failed, the data would get "quarantined" and I would get an email notification, or slack notification if urgent.

I don't generally write a lot of unit tests, but all my validation and transformation logic would have 100% coverage because things always change and you need to make sure future updates never break the system

aantix · on Nov 6, 2019

If you're monitoring runtime exceptions in production, you should be able to set alerts with your exception logging system. Here's how to do it for Sentry. https://sentry.io/_/resources/customer-success/alert-rules/