Most of the suggestions here is suggesting ways of restarting services when they go down, which is a good start, but that doesn't actually solve the issue I hit last night...
My system integrates with an external system and what happened is this external system started sending me unexpected data, which my system wasn't able to handle, because I didn't expect it so never thought to test for it -- the issue was that I was trying to insert IDs into a uuid database field, but this new data had non-uuid IDs. Because the original IDs were always generated by me, I was able to guarantee that the data was correct, but this new data was not generated by me. Of course, sufficient defensive programming would have avoided this as this database error shouldn't have prevented other stuff from working, but my point is that mistakes get made (we're humans after all) and things do get overlooked.
The problem is, restarting my service doesn't prevent this external data from getting received again, so it would simply break again as soon as more is sent and the system would be in this endless reboot loop until a human fixes the root cause.
That's a problem that I worry about, no matter how hard I try to make my system auto-healing and resilient (I don't know of any way to fix it other than putting great care into programming defensively), but again, we're human, so something will always slip through eventually...
Some people are suggesting to out-source an on-call person. That seems to me like the only way around this particular case. (The other suggestions can still be used to reduce the amount of times this person gets paged, though)
Always treat third-party systems like they're full of nitroglycerin. Double check all response codes, expect the unexpected, degrade gracefully when it hits the fan. You're always better off serving up a nice 500 error page than spinning forever or returning a false positive to users. And make sure you have a clear SLA with them and can escalate/mitigate/compensate when they don't fulfill it.
This. Write guards like the external integration is an active malicious adversary -- because when they change api, go down, have their own issues they may as well be attacking your integration.
This is exactly the reason why restarts shouldn't be ever considered a fix, as I've elaborated on a bit more in my other comment in this thread. They fix noting, but give an illusion of it.
1 - Your product might actually be too complex for a single-person business. You could rotate being on call for situations like this. This means that you'd have to make sure that sales are big enough to support an additional parter or two.
2 - Perhaps you need to simplify your product? Think more critically about error handling? I don't know the details about this part of your service, but if I assume that these bad UUIDs came from HTTP POSTs, why does a series of wonky HTTP posts bring down your entire service? Typically, something like this would trigger some kind of unhandled error that's caught higher up in your web framework and returns some kind of 5xx error.
This paragraph is very C# centric, but it should translate to other languages as well: Typically, I layer my error handling. Each operation is wrapped in a general exception handler that catches EVERYTHING and has some very basic logging. (ASP.Net does this and returns a 5xx error if your code has an unhandled exception.) Furthermore, as I get closer to actual operations that can fail, I catch exceptions that I can anticipate. Finally, I have basic sanity checks for things like making sure a string is really a UUID.
Without knowing much of your service's architecture, it just sounds like you need some high-level error handling. You probably have 100s of other little weird bugs, so high level error handling needs to do the equivalent of returning a 5xx error and logging, so you can fix it when you're able to.
My point was less about the specific issue I hit and more that 1) external circumstances that a restart won't resolve can cause failures, because 2) we're human and no matter how hard we try, even with a large team, things do slip through.
The difference with having a large team is less that all possible failure cases will get protected against (although more eyes and code review does help), but more that someone can always be available to fix it when something unexpected happens.
In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system. The error actually was localised to one particular type of updates, but that type stopped working because I didn't protect defensively enough against errors in that one particular case (I do have my database queries protected against errors, but this one slipped through). This caused other systems to not get these updates, so things that relied on them stopped working. Its not that they crashed, they just never received the updates they were waiting for.
Of course the fix is to trap all exceptions, log/notify, ignore and continue, so that at least one piece of bad update doesn't affect other updates, but again, my main point was that we're human, so can't possibly protect against everything that might cause a non-recoverable (without human intervention) error.
> Finally, I have basic sanity checks for things like making sure a string is really a UUID
Yes, I did add this too after I hit this issue and its a good point: validate EVERYTHING even if you generate it and think you can assume it will be good.
> Don't work alone!
That's the real solution, but sometimes its not possible.
Thanks for your detailed response, though, its appreciated.
> In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system.
Is your product too complicated for a single-person business?
As a solo programmer, I can write and develop extremely complicated systems. These systems can be so complicated that I don't have time to run them, find customers, support customers, ect.
That, ultimately, is why I don't see myself running a single-person business anytime soon. I really enjoy complicated programming, and if I have to also handle ops, support, sales, ect, then what I program needs to be too simple to remain interesting.
Defensive programming (or I call it fail-safe programming) is a must for any type of service / daemon.
I employ a healthy dose of exception trapping and logging, and I get an email whenever it happens.
People aren't perfect, but you can anticipate a lot of failures. It usually involves bad data as in your case. Each time you get bitten, change your code so it fails gracefully.
Agreed. Be as defensive as possible, trap all exceptions (this allowed me to identify the problem and fix it very quickly, but I still had to step in and fix it) and validate absolutely everything no matter how unlikely it seems. Also go over every system and ask "what if an unexpected error happens, will it take the system down? will it prevent other requests/tasks/users from working?"
For situations like this, it's extremely useful to have a human-review queue that the automated system can drop jobs into. It keeps the rest of the system running, and lets the support staff (you) fix the problem during normal business hours.
This is how we handled errors in our video archiving system at Justin.tv, and to my knowledge we never lost a single frame that made it to the broadcast servers. The raw bits were streamed to disk as they came in, and only got removed once the VOD servers had the final version— any errors would retry a few times and then get flagged for manual processing. We did have a few close calls where something broke the whole archiving system, and the broadcast servers came dangerously close to shutting down due to full disks, though.
I used to run a service solo that processed data from multiple external sources and as you said you need to program defensively when dealing with input.
I handled it with a pipeline that did the following, 1. validation, 2. transform data if needed, 3. load the data. If validation failed, the data would get "quarantined" and I would get an email notification, or slack notification if urgent.
I don't generally write a lot of unit tests, but all my validation and transformation logic would have 100% coverage because things always change and you need to make sure future updates never break the system
My system integrates with an external system and what happened is this external system started sending me unexpected data, which my system wasn't able to handle, because I didn't expect it so never thought to test for it -- the issue was that I was trying to insert IDs into a uuid database field, but this new data had non-uuid IDs. Because the original IDs were always generated by me, I was able to guarantee that the data was correct, but this new data was not generated by me. Of course, sufficient defensive programming would have avoided this as this database error shouldn't have prevented other stuff from working, but my point is that mistakes get made (we're humans after all) and things do get overlooked.
The problem is, restarting my service doesn't prevent this external data from getting received again, so it would simply break again as soon as more is sent and the system would be in this endless reboot loop until a human fixes the root cause.
That's a problem that I worry about, no matter how hard I try to make my system auto-healing and resilient (I don't know of any way to fix it other than putting great care into programming defensively), but again, we're human, so something will always slip through eventually...
Some people are suggesting to out-source an on-call person. That seems to me like the only way around this particular case. (The other suggestions can still be used to reduce the amount of times this person gets paged, though)