Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you make sure your servers are up as a single founder?
448 points by thr0waway998877 on Nov 6, 2019 | hide | past | favorite | 223 comments
I'm running a small business on AWS as a solo founder. It's just me. Yesterday I had a service interruption while I was in the London subway. Luckily, I was able to sign in to the AWS console and resolve the issue.

But it does (again) raise the question I'd rather not think about. What if something happens to me and there's another outage that I can't fix?

So - how do you make sure that your servers are up as a one person founder? Can I pay someone to monitor my AWS deploy and make sure it's healthy?




I am a solo founder of a website monitoring SaaS [0]. Theoretically, my uptime should be higher than that of my customers'. Here are a few things that I found helpful in the course of running my business:

* Redundancy. If you process background jobs, have multiple workers listening on the same queues (preferably in different regions or availability zones). Run multiple web servers and put them behind a load balancer. If you use AWS RDS or Heroku Postgres, use Multi-AZ deployment. Be mindful of your costs though, because they can skyrocket fast.

* Minimize moving parts (e.g. databases, servers, etc..). If possible, separate your marketing site from your web app. Prefer static sites over dynamic ones.

* Don't deploy at least 2 hours before you go to sleep (or leave your desk). 2 hours is usually enough to spot botched deploys.

* Try to use managed services as much as possible. As a solo founder, you probably have better things to focus on. As I mentioned before, keep an eye on your costs.

* Write unit/integration/system tests. Have a good coverage, but don't beat yourself up for not having 100%.

* Monitor your infrastructure and set up alerts. Whenever my logs match a predefined regex pattern (e.g "fatal" OR "exception" OR "error"), I get notified immediately. To be sure that alerts reach you, route them to multiple channels (e.g. email, SMS, Slack, etc..). Obviously, I'm biased here.

I'm not gonna lie, these things make me anxious, even to this day (it used to be worse). I take my laptop everywhere I go and make sure that my phone is always charged.

[0] https://tryhexadecimal.com


> Monitor your infrastructure and set up alerts [..] "fatal" OR "exception" OR "error"

I almost have the regex "fatal|invalid|unknown|error|except|critical|cannot" in muscle memory many years after having last had to type it - must have typed it thousands of times tailing and grepping logs :- )


Instead of having a Regex that searches for error|critical|except it’s a good thing to have a log level in your logging infrastructure, so that you can query for example log level=2 and get all the bad things.

It takes a bit to work this into the code and infrastructure everywhere but it’s worth it.


In Ruby:

    %w[fatal invalid unknown error except critical cannot].any? {|word| 'My critical error message'.include? word}
... may be better than a regex if it's just a list of literal strings. Javascript also has something similar.


watch out for someone writing [FATAL] in a form somewhere and it getting logged :)


That's a smell. If you have to type a mouthful like that more than twice, put it in a script.


This covered most of my thoughts (also a solo founder). I also have a fallback server running that is an entire duplicate app running on my local machine (in house PC). When an error occurs it fails over to my local machine.

At fail-over my local machine will also switch from the dev database to prod (connecting to the RDS instance). It'll only has read-only access to relevant data, but it'll keep the site 90% functional.

Is it perfect? Hell no. Pretty hacky, but it's kept my application running twice after a failure in the past two years.


This is is scary.

This is exactly how I run my (also monitoring) SaaS and it shows that the OP has learned how to minimize risk and prepare for the worst.

You are married to your mobile & laptop.

I also shows there is no free lunch: you need to invest in redundancy.


From time to time I ponder about all the terrible things that could happen: what can go wrong will probably go wrong so it helps to be (kinda) prepared.


As a single owner (of also a monitoring SaaS) I am currently putting out a fire where my primary datacenter died. I have suffered data loss even with precautions I had in place, one server lost all filesystems which took out my git repos.

I have backups, I have clones. I've still been in partial outage for 4 days and will be fully up tomorrow when I literally drive my servers to a new DC. Surprisingly I have slept 8 hours every night and I'm not worried. I've been in contact with my customers and provided solutions to keep them alive. If they leave they're going to leave, nothing I can do. I am looking to make sure everything is built uniformly (the server that died was the last of an old build process) and invest in scaling to the cloud in a bit more efficient and orderly manner.


Yes, however you are extremely and acutely aware of this. Your business depends on it.

This is not a long term solution though. You need to add people and take a break at some stage.


Offtopic: Really love how clean and readable the website is! Great work!


Indeedly!

Very clean and simple, but not trivial or bare. Really good job, jmstfv.


From this link [0], it appears that the site uses Netlify. I'd love for the OP to point us in the direction of the theme and/or setup details as I am fairly new to Netlify

[0]https://tryhexadecimal.com/costs


It is just static HTML and CSS files, with sprinkles of JS when necessary. I don't like using third-party frameworks or themes because they tend to get in my way. I designed it all from scratch.

I've been experimenting with functional (atomic) CSS lately and built a small framework that ended up using in Hexadecimal (if you peek at the source code, you can see weird classes like m-0, p-0, etc..). I liked it, but one should be careful in not overusing it, otherwise, it is an unmaintainable mess, yet again.

Below, I linked my Netlify config [0]. My static files live in the src/ folder, so you should probably change that if you have them in a different folder. If you don't know what Content-Security-Policy (CSP) or Strict-Transport-Security (HSTS) headers do, leave them out. It is easy to shoot yourself in the foot if you don't know what you're doing.

[0] https://privatebin.net/?c794ba90771f6e10#GSnJJ3dnd5r3ZDGgi9R...


Your website looks awesome, I really like the clean design.

Have you considered a trial extending to their first outage notification, rather than 14 days?

I would imagine if they don't see an outage in 14 days they're not seeing the value.

However if their trial expired following their first 'site is down' reminder - that's where your value comes in.

Just my $2c


That's really interesting, it never occurred to me before (because I've never seen such thing), but it makes sense. A lot of work happens in the background, and folks don't see the value until their website is down or their certificate is about to expire.

I'll investigate that. Appreciate the suggestion.


I really wish you all the best jmstfv ! The website is obviously great but I love the fact that you have shared such great tips at https://tryhexadecimal.com/costs for solo founders/ small business owners.


Honestly I’ve worked on teams with over 30 engineers that I wished abided by half this (particularly about the prefer static over dynamic bit This is just straight up good advice


With regards to AWS RDS & EC2 multi-az deploys and cost...buy reserved instances! It is much cheaper if you can afford to pay up front for 1yr or 3 years at a time.


I do consider that but I'm still early in this game. My AWS spending is $1.38/day on EC2 and $1.05/day on RDS, all covered by AWS credit: https://tryhexadecimal.com/costs


For EC2 you really should be using spot fleets and making sure your architecture can support instances coming and going.


I need to keep exactly two instances running (in separate AZs) and will never need to scale up beyond that...is it still worth investigating spot fleets in this case?


Nice website jmstfv. I have a question about how customers trust your monitoring saas. Say I have a website/service, my deployment config would probably be similar to what you have. If AWS goes down, both mine and your monitoring saas will go down. Can you please share some details about your architecture that makes it more reliable? I am curious about what sort of customers are you targeting.


Background workers are geographically spread out across multiple AWS regions: https://tryhexadecimal.com/docs/workers#locations

It is highly unlikely that all of them will be knocked down at the same time.

If workers suspect that there is something wrong with your website, they will check it from all locations at once. Website is only marked as down if more than half of them can agree on that. That alone knocks down most of the false positives: https://tryhexadecimal.com/docs/uptime#sequence-and-frequenc...


This is a great post, thanks. On your last point, do you mind sharing what you use to monitor the logs and send out alerts?


I send my logs to Papertrail and they have a feature that will notify you if your logs match a predefined pattern. Some of the patterns that I use:

* Account deletion (DELETE FROM \"accounts\")

* New successful sign up (INSERT INTO "accounts")

* New signup attempt (Started POST "/signup") [0]

* Fatal Exception Error ("fatal" OR "exception" OR "error")

* Warning ("warn")

I do share my costs and tools publicly, in case if you're interested: https://tryhexadecimal.com/costs

[0] I got hit by a botnet attack couple weeks ago, so I keep a tab on their activity (https://news.ycombinator.com/item?id=21327416)


Just curious; what do you use Redis for? Thanks for your post!


Jobs (e.g. uptime checks, payment processing, sending emails) are added to the job queue (Redis) and background workers pick it up from there, and churn through them asynchronously. Most of the work happens in the background, so in my case, Redis is a critical piece of infrastructure.

https://devcenter.heroku.com/articles/background-jobs-queuei...


Ah, of course! I saw background jobs and Redis but didn't make the connection at the time.


> I was able to sign in to the AWS console and resolve the issue

Kids these days.

I had a RAM stick fry in one of the physical machines sitting in a colo 1 hour drive away. Not die, but just start flipping bits here and there, triggering most bizarre alerts you can imagine. On the night of December 24th. Now, that was fun.

--- To add ---

If you are a single founder - expect downtime and expect it to be stressful. Inhale, exhale, fix it, explain, apologize and then make changes to try and prevent it from happening again. Little by little, weak points will get fortified or eliminated and the risk of "incidents" will go down. There's no silver bullet, but with experience things become easier and less scary.


Effective apologizing is the #1 business skill of the one-person tech company.


That reminds me of the time we had a DIMM actually melt on the 22nd December http://fanf2.user.srcf.net/hermes/doc/misc/orange-fire/


Thank you for sharing, that's a real nightmare before Christmas story!


I hope you kept it ;)


Isn't this exactly the case where you could have avoided this hassle entirely had you shelled out some cash for ECC memory?


Live and learn is what I think the take away of this story is all about... I had a server fail dec 25 mid morning. It caused failures in away I had thought about before because instead appearing completely dead it was alive enough to not let go of any tcp connections. For the critical component in question, I didn’t have the correct timeouts in place... so as the single operator I was fortunate that my wife was also my co founder and so was a bit more understanding.


And for everyone else who doesn't feel like doing that, Amazon Lambda and a server in your closet are two other options.

If your workload doesn't fit into such a setup, then you simply need to take investment, optimize your code, or move closer to your colo


Reminds me of an int32 <> int64 mismatch that overflowed precisely also on the night of Dec 24th!


Kids these days.

Insert punchcard anecdote...

Transistors! We had to replace valves in my day.


geez... why do these things always need to happen right around the holidays. I mean the disaster couldn't wait a week or two later?


You might also want to consider some additional risks that are often overlooked:

Billing issues. What happens if the credit card you use to pay for everything gets hijacked, and you're trapped with a blocked card trying to clean it up but your bank is taking their sweet time and won't give you another card until it's sorted? ALWAYS have a backup credit card.

DNS Registrar. There's a hard SPOF in the DNS, where your registrar essentially holds your domain name hostage. If your DNS gets hijacked, but your registrar is taking a few days to sort out who actually owns it, you're down hard. There's no mitigation for this one, except paying for a registrar with proper security processes. If you do 3FA anywhere, make it here.

AppStore. If your app gets banned, or a critical update blocked, what do you do? Building in a fallback URL (using a different domain name, with a different registrar, can help work around any backend issues. There's not much you can do for the frontend functionality, except using a webapp.

It can be worthwhile looking at risks and possible mitigations beyond just server and database issues, especially when it's just you.


> What happens if the credit card you use to pay for everything gets hijacked, and you're trapped with a blocked card trying to clean it up but your bank is taking their sweet time and won't give you another card until it's sorted?

This happened to me this week. Luckily bank got me the new card within 2 business days, but still was a bit stressful and I burned a day getting my payment info updated everywhere.


> What happens if the credit card you use to pay for everything gets hijacked, and you're trapped with a blocked card trying to clean it up but your bank is taking their sweet time and won't give you another card until it's sorted?

It doesn't even have to get hijacked, we had our bank reissuing cards because of some bug on the old chip, the card number stayed the same and the expiration dates changed but Google Cloud blocked the paying accounts. Security is the top priority sure, but I don't want some proprietary algorithms decide whether or not our server will be up or not. The fix was to move away from Google, because bouncing on their support took to long.


I know this is going to be down-voted to nonexistence since everyone now-a-days wants to serverless, AWS and what not. personally i've always used either hosting.com or inmotionhosting.com. yes they are more expensive than AWS and what not, but the thing is, they both have a support staff 24/7/365. I called whenever i need and have someone remote into my server and fix whatever is wrong. furthermore, i can even have the server alert emails routed to not only me, but them as well! so they know about the problem and are on it and i don't have to do a thing.


I don't see why this would be considered something to downvote. You're essentially hiring an outsourced I/T staff by going with a service like this which is probably something that can be leveraged a very long way and will remain cheaper than bringing on full-time staff or engaging contracted I/T consulting.


This is really what I'd suggest to the OP, not necessarily THESE services but SOME form of out-sourced support. You can add technical redundancy until the cows come home and some bug is still going to stop things, probably with the worst possible timing. Until they've figured out enough AI to solve this issue (which would be expensive anyway, I'm sure) then I suggest adding human redundancy.


Sounds like a very sensible solution.


The only way to achieve high availability is to have redundancy of all things.

Random things will go wrong that you can't predict. Boxes will die suddenly and without reason, even after months of working fine without changes, and always at the worst possible moment. Your system needs to be built to withstand that.

I'll take the opposite approach of everyone here and recommend against serverless, kubernetes, and Heroku/PAAS.

You are a solo founder. You should understand your infra from the ground up (note: not understand an API, or a config syntax, but how the underlying systems actually work in great detail). It needs to be simple conceptually for you to do that. If anything goes wrong, you need to be able to identify the cause and fix it quickly.

I've gone through this first-hand and know all the trade-offs. If you'd like, I'm happy to discuss architecture decisions on a call. Email is in my profile.


No you don’t need to understand your infra from the ground up - especially as a solo founder. You should offload as much of the grunt work as you can afford to so you can concentrate on your business domain.

If something “goes wrong” or you don’t understand how to implement something with managed services, support is just a ticket and a live chat/phone call away. I can speak from personal experience that AWS business support is great even when there isn’t a problem and you just want an “easy button” for someone to tell you what’s wrong with your configuration.


It depends on the service I think. I've had ECS errors that took AWS support days to figure out (turns out some permission quota thing was overriding some ECS thing).

All in all, I think maybe I might have to find some other batch processing system.


If you are using regular ECS with EC2 - as opposed to Fargate, it just provisions regular EC2 instance with an agent already installed. You can ssh/rdp into the instance and troubleshoot it.

But yeah I did have a doozy of an issue with ECS but it was completely my fault. I create a cross account policy for ECR but left out the account that actually contained the registry. Then my containers were in a private subnet without any access to the internet (by design they were behind a load balancer) but they couldn’t get to the ECR endpoint. I just had to either assign a public IP address or use a private link.

Support helped me with both.


As a solo bootstrapping technical founder, I am in LOVE with Heroku.

My product is single-tenant, which can be tough infrastructure-wise since each app/customer needs a cluster of servers and services (Postgres and RabbitMQ). The Heroku pipelines enables me to have a testing app, staging app, and when I want to push staging to production, I can push the code to all my single-tenant production apps with the push of a button.

In theory I could do this with Bitbucket or Gitlab CI/CD pipelines, but this enabled me to focus on app development instead of devops.

That's just my preference, of course.


The only way to achieve high availability is to have redundancy of all things.

It depends on your availability requirements, for me, it's more important and simpler to be able to deploy again(with data) under hour than dealing with HA.


IMO a single founder doesn't need "high availability" as the term is usually used (five nines stuff with high operational complexity). If you can get it cheaply by using eg Heroku, it's a nice bonus.


I build my stuff on top of a stack that hardly ever goes down.

All my SaaS products run on a Windows server, with SQL Server as a database and ASP.NET on IIS running the public sites. You can probably come up with a lot of uncharitable things to say about those technologies, but "flimsy" and "fragile" likely aren't in the list.

As a result, when things go seriously wrong, the application pool will recycle itself and the site will spring back to life a few seconds later. Actual "downtime", of the sort that I learn about before it has fixed itself might happen maybe once ever couple years. At least, I seem to remember it having happened at least once or twice in the last 15 years of running this way.

There's a Staging box in the cage, spun up and ready to go at a moment's notice, in case that ever changes. But thus far it has led a very lonely life.


I agree. I’ve spent close to 20 years hosting on IIS and have never had downtown attributable to it - especially since the app pool will recycle. Most of the issues I’ve seen were caused by people who didn’t know how to use EF correctly or badly tuned databases/queries.

Even now, the only “Linux deployments” I do are either with Lambda or Fargate (Serverless Docker). I just don’t like managing servers. These days even my EC2 instances are disposable.


Code Red was 18 years ago, Blaster 16, Witty worm 15. All ate ISS servers for breakfast in highly automated fashion. 20 years of IIS with no downtown indeed.


I know about the Unicode hack - where you could basically encode DOS commands in the url window and it would run on the server. We got lucky. It renamed all pages using the default names. We didn’t have any.


Webpages on the server should be read-only... just in case!


I’m curious, how much did you have to pay for your licenses? If you are a single founder with a new startup, it seems like you’d need a bunch of money to be able to use Microsoft in production, unless you can use something like BizSpark (if that’s still a thing?).


If using Microsoft be very wary of the cost of understanding the licencing - especially for the "free" options.

Many years ago we used BizSpark, and we wasted far too many days and dollars trying to understand and use the licencing (and using the abysmal web interface that was mandatory). We couldn't end up jumping through the right hoops to use products in production, even though were target audience for the products when we started our business. The licences had very broad exceptions that gave Microsoft the right to pull the plug at any time.

I'm sure things are different now, but I'm also sure they are the same.


BizSpark is not still a thing. The replacement program, Microsoft for Startups, requires being associated with a partnered "startup-enabling organization."


A standard sized Azure App Service runs about a $75 a month. That will let you host about as many ASP.NET sites on it as you want - and there's no VM that you have to manage. Scaling up or out is pretty painless if you need to

A SQL database on Azure runs about $5-$15 a month, depending on how you configure it.

Once you throw in SSL certs and DNS management, you're probably looking at about $100 a month.


For SQL database the lowest I could find right now was costing around $600. Can you please share a link for the $5-15 price range?

https://azure.microsoft.com/en-us/pricing/details/sql-databa...


You'll want to switch to the Single Database and change the pricing model from vCore to DTU. Then you can pick Basic or Standard and adjust the size and compute resources.

Basic will do quite a lot; I use one to load test our application and haven't hit limits.


Thank you


I use Leaseweb and they include a line item for the license price. It is all very affordable. I pay about $65 a month for a Windows 2016 server that I run IIS and SQL Server on for all of my projects I have going on. Very fast, affordable, etc. My SQL Server came from MS Bizspark for free (MSDN).


Contabo VM with win server 2016 data center (includes sql sever 2016) costs me 16€/month.


Last I knew, Visual Studio Professional was about $500 for a single user. SQL Server can get expensive depending on what you're hosting, but IIS and ASP.NET don't cost anything. All you'd need is a Windows box for those.


SQL Server on RDS includes the SQL Server license and is very cost effective for what you get. (Backups, point in time restore, etc). They also offer EC2 with SQL Server but I’d rather pay the slight premium for the additional management.


Not much. Maybe a few grand every few years when the hardware needs upgrading. Not really enough to notice in the big picture.


FWIW - if you just want to make sure your services are up - consider:

1) pagerduty.com or uptimerobot.com for remote monitoring to make sure you site(s) are up (and get alerts when they're not).

2) Datadog or New Relic if you want deeper monitoring (application performance, database performance, diagnostics/debugging.

3) Rollbar.com (site doesn't seem to respond) for site performance/errors.

4) Roll your own with Prometheus (https://prometheus.io/, or Nagios (https://www.nagios.org/)/IcingA. Or... strangely - I still use MRTG for a few perf monitoring things: https://oss.oetiker.ch/mrtg/

5) If you want to monitor the status of deploys/builds - I love integrating CI/CD systems with Slack - very helpful.

Hope that helps - I've spent a lot of my career monitoring things, and have this mantra that I need to know about services down before customers call to tell me same.

(a lot of these have free tiers)


> Nagios, MRTG

I am used to rolling my own and use Nagios to make sure my servers and web sites are up and URLs and scripts functioning.

I used to use MRTG and RRDTool in days past when I was responsible for monitoring many servers, switches and routers.


Right there with you :)


StatusCake and PingMonit are two more good ones in the same vein as UptimeRobot. All have free tiers for testing and small scale/homelab use too.


There are plenty of options and plenty free pricing tiers. I use multiple ones for redundancy and comparison. One of them ended up producing false alerts, surprisingly.


FYI - if you would like some help setting up any of these - drop me a note (email in profile) - happy to advise, most of this is pretty easy.


Back when I was working on everything myself, I deployed everything through AWS Lambda and API Gateway, with all my static assets on S3 and CloudFront. I had exactly zero infrastructure issues over the course of two years and never dealt with security patches, SSH'ing, etc. If I were doing billions of requests, it may not have been the most cost effective, but it helped me scale without worrying about typical devops issues. Updates, testing, rollbacks, etc were also extremely easy.


How do you keep things fast?

Lambda functions can have cold starts that introduce latency. How do you manage that?

(From my small amount of experience - please prove me wrong.)


If you're doing a SPA, you can paper over the cold starts a bit, since the app itself will render, and it'll be background requests to load data that are impacted.

That still sucks, so then you can (hopefully) cache some things so that _some_ data begins to stream in. Or you can make it so your very high priority stuff has minimal dependencies - you can get a Lambda cold start in < 1s if your app only uses the standard library.

But still, in my experience, cold starts are a thing. If you have a high-traffic app or use Lambda warming, you decrease the # of people who experience a cold start, but at the end of the day, your p99 is going to be worse than a vanilla VM solution, because _some_ people will get cold starts. For some apps, that's OK - think line of business app where the first few pages can be served from static materials or cached materials, and you trigger the requests in the background.


First, caching. Can be done via API Gateway or Cloudfront.

Second, minimizing the function's code size. Both by making small functions and then optimizing them (there are plugins for the serverless framework to do that)

Third, using a language with a minimized cold start. Some can have cold start latences lower than 1 second [^1]

[1]: https://mikhail.io/serverless/coldstarts/aws/


If latency is a big deal, don’t use lambda. I went from not knowing anything about Docker to having a Fargate (Serverless Docker) NodeJS app running in less than a day using a walkthrough I found.

But then again, I did already know the other fiddly bits of AWS.


I am going to add that Lambda cold starts are pretty overblown unless you really do have an application that is super sensitive to latency (i.e. you are doing stock trading or something else pretty latency crazy). Most applications can use Lambda and pretty much never worry about cold starts. Consistent traffic keeps Lambda's warm and I've seen services at their busiest suffer less than 1% coldstarts.


It will only cold start the first time. You also need to make sure that if your lambda can reuse the existing db connection when receiving multiple multiple requests to save time.


I setup a CloudWatch event rule to ping the function every 10 minutes, which keeps the function warm without costing much at all.


Did you ever have any problems with the function concurrency limit?


It’s a soft limit. You can request a larger limit and get it increased within 30 minutes.


Awesome thanks! Sorry to bug but this is very interesting to me. Did you use a framework like Serverless?


No. I’m of the belief that when you choose a platform, go all in. Given a choice between the Serverless platform and AWS’s Serverless Application Model, I would choose SAM.

I also work for a company that pays for the business support plan, they will help you to an extent with any weirdness in their own platform - but not with third party utilities.

Another advantage with using SAM, is that you can create and configure your lambda/API Gateway from the web console and then export your SAM/CloudFormation template.

All that being said, for APIs, I don’t use either. I would recommend using standard frameworks like C#/WebAPI, JS/Node, Python/Flask/Django instead and using the proxy integration libraries that AWS provides. It lets you develop/debug locally like you are accustom to and it gives you the optionality to move your APIs to Fargate (Serverless Docker) or EC2 instances without any code changes:

C#/Web API

https://aws.amazon.com/blogs/developer/deploy-an-existing-as...

Node:

https://github.com/awslabs/aws-serverless-express

Python/Flask

https://dev.to/apcelent/deploying-flask-on-aws-lambda-4k42

I’ve been told that there are similar frameworks for other languages but those are the three that I use.

If you have any other questions, feel free to email me. My address is in my profile. I’m not a consultant trying to sell anything....


Thanks so much! This is really interesting. I think acloud.guru has a similar set up and they pay basically nothing for their millions of QPS. Going to research this more for my next project.


I've run a one-person business on my own servers for about eight years.

Honestly, the answer is learning how to manage anxiety and stress, particularly doing potentially destructive things under pressure. I think the psychological aspects of this are much more difficult than the technical ones.

If it helps, people are generally very understanding if you explain that you are a solo founder, and take reasonable steps to fix issues in a timely way. Most customers assume every company is a faceless organization; their attitude is much more forgiving when they learn they're dealing with a fellow person.

You cannot be on call 24/7 forever. You will burn out. If you can't hire someone you trust to take over part of this burden, then you have to accept the risk of sometimes not being able to log in for N hours if there is an outage (because you're camping with your spouse, etc.)

For very high-stress situations (database crash, recovery from backup) working from a checklist that you have tested is very valuable.

Good luck to you, and I hope you found useful answers in this thread!


Totally agreed on the point that you just won't be available all the time. You can set up all the alerting you want and high-availability on different zones and whatnot, but when sh*t hits the fan, you will still be the only person in charge for putting everything back together.

I concur that hiring someone, if you can afford it, even part time, would be a great idea.


Heroku. I just pay for the privilege of not thinking (almost) about such issues.


Heroku looks really expensive if you need e.g. a private network per client (compared to AWS VPC). Between $12k-$36k if I'm reading this correctly: https://elements.heroku.com/addons/heroku-private-spaces


Yes, if you need that.

Many many businesses can survive on a $25 a month 1x dyno and the free database tier (and free sendgrid, free newrelic, free whatever other addon) and can be pretty sure their site will basically never go down.

It's a fantastic security blanket, but...as your business grows you'll start to pay. The question then becomes "If I do all this myself on AWS (or azure or gcp) how much time am I going to be sacrificing time against building my business dealing with random infrastructure crap?" Or "at what point does it make sense to hire someone to focus on all this infrastructure and how much would that cost vs. just paying for heroku?"


Also services like platform.sh and others.


I agree in general with the responses encouraging better usage of managed platforms. I've run a SaaS app for a couple of years using a combination of AWS Elasticbeanstalk (Flask and Django) and AWS Lambda. Server resource related downtime has been minimal and recovery is quick/automated. Even hosting on Lambda you can run into issues without layers of redundancy (Lambda may be fine but a Route 53 outage would prevent you from hitting that endpoint if you're using that for DNS).

Before thinking about handing over management of the deployment, I would encourage you to think about what the root cause of the outage is and whether something in the app will create that situation again. I invested in setting up DataDog monitoring for all hosts with alerts on key resource metrics that were causing issues (CPU was biggest issue for me).

The other thing that's worked well for me is just keeping things simple. As a solo founder, time spent with customers is more valuable than time spent on infrastructure (assuming all is running well). It's a little dated, but I still think this is a good path to follow as you're building your customer base. A simple stack will let you spend more time learning how your product can help your customers best.

http://highscalability.com/blog/2016/1/11/a-beginners-guide-...


We are considering Datadog, and nothing else seems to compare to them, but they seem extremely expensive. As a small startup/solo founder, did your implementation justify costs?


Datadog is the best all around tool with APM and log monitoring built in. I highly recommend it as it will be substantially cheaper than paying for separate monitoring, apm, logs, etc. The support is also a lot more receptive towards small business owners unlike other platforms such as New Relic.

Also if you send custom stats to datadog such as login activity, you can use their Anomaly detection to find suspicious behavior.

Disclaimer: we use datadog at my company and have tried all the other popular options. Hands down datadog is the most feature rich and user friendly.


Yes, it's been an incredibly valuable tool for me. I get the most out of the host monitoring and the logging / alerting. The APM is nice, but I'm not using it nearly as much as the other pieces.

Just getting logging centralized alone has saved me tons of time, which is in turn more time spent on the product. I've been able to use the log parsing to setup metrics that tell me when an outside integration is acting up and isolate which paths. Take a day to really learn how their logs work and you'll be able to generate metrics / advanced event alerting in no time.

I was hesitant to pay the premium, but the peace of mind has been worth it. You can piece the same thing together with open source tooling. But then you've got another thing to manage.


Plenty of people are happy using the industry-standard open-source tools:

- metrics system (Prometheus, InfluxDB, Graphite, etc.)

- dashboards (Grafana)

- alerts (both the metrics system and Grafana can handle this)

If you're not a big-data company, you can self-host instances of these products reliably.

Disclosure: My startup, https://HostedMetrics.com, provides turnkey hosted versions of these software packages.

I'd be happy to explain the options that are out there and provide any advice you need. Get in touch: contact info is in my profile.


Your product doesn’t even come close to what datadog offers....


Datadog has ~9000 customers. I imagine the respective open-source tools have many more customers than that, which says something about the perceived value of the two approaches. Also, Datadog wraps many tools into one whereas the open-source solutions have singular purposes. It's an apples to oranges comparison.


$15 per server per month isn't that extreme. I've used Datadog in production for 3+ years (starting from a headcount of 10) and can honestly say they've been worth every penny. Observability is irreplacable.


Most of the suggestions here is suggesting ways of restarting services when they go down, which is a good start, but that doesn't actually solve the issue I hit last night...

My system integrates with an external system and what happened is this external system started sending me unexpected data, which my system wasn't able to handle, because I didn't expect it so never thought to test for it -- the issue was that I was trying to insert IDs into a uuid database field, but this new data had non-uuid IDs. Because the original IDs were always generated by me, I was able to guarantee that the data was correct, but this new data was not generated by me. Of course, sufficient defensive programming would have avoided this as this database error shouldn't have prevented other stuff from working, but my point is that mistakes get made (we're humans after all) and things do get overlooked.

The problem is, restarting my service doesn't prevent this external data from getting received again, so it would simply break again as soon as more is sent and the system would be in this endless reboot loop until a human fixes the root cause.

That's a problem that I worry about, no matter how hard I try to make my system auto-healing and resilient (I don't know of any way to fix it other than putting great care into programming defensively), but again, we're human, so something will always slip through eventually...

Some people are suggesting to out-source an on-call person. That seems to me like the only way around this particular case. (The other suggestions can still be used to reduce the amount of times this person gets paged, though)


Always treat third-party systems like they're full of nitroglycerin. Double check all response codes, expect the unexpected, degrade gracefully when it hits the fan. You're always better off serving up a nice 500 error page than spinning forever or returning a false positive to users. And make sure you have a clear SLA with them and can escalate/mitigate/compensate when they don't fulfill it.


This. Write guards like the external integration is an active malicious adversary -- because when they change api, go down, have their own issues they may as well be attacking your integration.


This is exactly the reason why restarts shouldn't be ever considered a fix, as I've elaborated on a bit more in my other comment in this thread. They fix noting, but give an illusion of it.


Glib answer: Don't work alone!

There's two ways to think about this:

1 - Your product might actually be too complex for a single-person business. You could rotate being on call for situations like this. This means that you'd have to make sure that sales are big enough to support an additional parter or two.

2 - Perhaps you need to simplify your product? Think more critically about error handling? I don't know the details about this part of your service, but if I assume that these bad UUIDs came from HTTP POSTs, why does a series of wonky HTTP posts bring down your entire service? Typically, something like this would trigger some kind of unhandled error that's caught higher up in your web framework and returns some kind of 5xx error.

This paragraph is very C# centric, but it should translate to other languages as well: Typically, I layer my error handling. Each operation is wrapped in a general exception handler that catches EVERYTHING and has some very basic logging. (ASP.Net does this and returns a 5xx error if your code has an unhandled exception.) Furthermore, as I get closer to actual operations that can fail, I catch exceptions that I can anticipate. Finally, I have basic sanity checks for things like making sure a string is really a UUID.

Without knowing much of your service's architecture, it just sounds like you need some high-level error handling. You probably have 100s of other little weird bugs, so high level error handling needs to do the equivalent of returning a 5xx error and logging, so you can fix it when you're able to.


My point was less about the specific issue I hit and more that 1) external circumstances that a restart won't resolve can cause failures, because 2) we're human and no matter how hard we try, even with a large team, things do slip through.

The difference with having a large team is less that all possible failure cases will get protected against (although more eyes and code review does help), but more that someone can always be available to fix it when something unexpected happens.

In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system. The error actually was localised to one particular type of updates, but that type stopped working because I didn't protect defensively enough against errors in that one particular case (I do have my database queries protected against errors, but this one slipped through). This caused other systems to not get these updates, so things that relied on them stopped working. Its not that they crashed, they just never received the updates they were waiting for.

Of course the fix is to trap all exceptions, log/notify, ignore and continue, so that at least one piece of bad update doesn't affect other updates, but again, my main point was that we're human, so can't possibly protect against everything that might cause a non-recoverable (without human intervention) error.

> Finally, I have basic sanity checks for things like making sure a string is really a UUID

Yes, I did add this too after I hit this issue and its a good point: validate EVERYTHING even if you generate it and think you can assume it will be good.

> Don't work alone!

That's the real solution, but sometimes its not possible.

Thanks for your detailed response, though, its appreciated.


> In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system.

Is your product too complicated for a single-person business?

As a solo programmer, I can write and develop extremely complicated systems. These systems can be so complicated that I don't have time to run them, find customers, support customers, ect.

That, ultimately, is why I don't see myself running a single-person business anytime soon. I really enjoy complicated programming, and if I have to also handle ops, support, sales, ect, then what I program needs to be too simple to remain interesting.


Defensive programming (or I call it fail-safe programming) is a must for any type of service / daemon.

I employ a healthy dose of exception trapping and logging, and I get an email whenever it happens.

People aren't perfect, but you can anticipate a lot of failures. It usually involves bad data as in your case. Each time you get bitten, change your code so it fails gracefully.


Agreed. Be as defensive as possible, trap all exceptions (this allowed me to identify the problem and fix it very quickly, but I still had to step in and fix it) and validate absolutely everything no matter how unlikely it seems. Also go over every system and ask "what if an unexpected error happens, will it take the system down? will it prevent other requests/tasks/users from working?"


Also, if you are running on linux, check out supervisor. I use it to keep my WebAPI's up. If it crashes, it starts it up again.


For situations like this, it's extremely useful to have a human-review queue that the automated system can drop jobs into. It keeps the rest of the system running, and lets the support staff (you) fix the problem during normal business hours.

This is how we handled errors in our video archiving system at Justin.tv, and to my knowledge we never lost a single frame that made it to the broadcast servers. The raw bits were streamed to disk as they came in, and only got removed once the VOD servers had the final version— any errors would retry a few times and then get flagged for manual processing. We did have a few close calls where something broke the whole archiving system, and the broadcast servers came dangerously close to shutting down due to full disks, though.


I used to run a service solo that processed data from multiple external sources and as you said you need to program defensively when dealing with input.

I handled it with a pipeline that did the following, 1. validation, 2. transform data if needed, 3. load the data. If validation failed, the data would get "quarantined" and I would get an email notification, or slack notification if urgent.

I don't generally write a lot of unit tests, but all my validation and transformation logic would have 100% coverage because things always change and you need to make sure future updates never break the system


If you're monitoring runtime exceptions in production, you should be able to set alerts with your exception logging system. Here's how to do it for Sentry. https://sentry.io/_/resources/customer-success/alert-rules/


If you’re using AWS I want to assume you didn’t go for a cheaper solution (e.g. VPS from a reputable company) because you like the managed solutions that they provide, among other reasons.

I assume also you want a simple way to increase reliability while keeping costs within reasonable limits.

Well, AWS can give you all that if you don’t want to go super fancy. Check Beanstalk to get something simple and reliable. Monitor using CloudWatch. Make sure to leverage redundancy options (multi az, multi region if worth it, etc). These are some general tips but with the information that you provide that’s all I can say.

You can also pay a consultant to get a review of your setup and get some recommendations. It won’t be cheap but it depends how much you value your time and your product.


Some of the comments are suggesting totally different technologies. Don’t do that. You can stay on AWS and achieve the reliability you need. This isn’t the sort of problem that should lead you to rebuild your whole stack.

The question you should be asking is, how can I make my service automatically recover from this problem. It depends why exactly it crashed. If a simple restart fixes the problem, there are different ways you can automate this process, like Kubernetes or just writing scripts.

I’m happy to give more detailed advice if you would like, my email is in my profile.


Your main concern is of course limited time/resources, so you'll have to make compromises.

The question is not whether your system will fail, the question is when.

Have proper monitoring and alerting in place.

But don't over engineer it, sometimes everything seems technically fine, but your support inbox will start getting user complaints.

Resolve the issue, figure out the root cause, make sure this or similar stuff won't happen, apologise to the affected users if necessary, and move on.

You'll learn waaay more failure modes of your application running in the wild, than just thinking about "what could go wrong".

It's a long game of becoming a better developer/devops guy, and not repeating the same mistakes in the future.


+1 -- most of the comments are about minimizing downtime, which you should of course to do to the extent practical, but at some point, whether you're a one-employee company or not, if you have no internet access and your servers are down you have to keep calm and accept that there will be some downtime and it's not the end of the world. You may even be surprised how few customers notice anything went wrong (depending on the kind of service you're running).


I would say that as a one person founder, know that you cannot ever get 100% uptime and live with it. In the most simplistic sense, you need to sleep 8 hours a day, you cannot live your life constantly stressed about uptime. Just generally have internet access and sometimes your service will go down.

On the set up, try your best to solve issues and use tried and true hardware, but things go down sometimes, even big sites like Google, Facebook go down, there is no silver bullet, you can only improve on your past mistakes.

Last, try to find some remote help, on a contract basis, it's not that expensive and it can help alleviate a lot of your stress.


I use uptime robot http://uptimerobot.com for monitoring, they have a free plan or paid if you want faster checks.

If it's truly critical to have no down time then you probably need to build that resilience in to your architecture.


+1 for UptimeRobot. Learned about it right here on HN:

https://news.ycombinator.com/item?id=6576250


You need monitoring. But it will not keep your servers from dying.

Using serverless, PaaS like Heroku or similar will help.


I currently run a batch of trading servers solo. The trading system is a C++ process with an asynchronous logger that prints log levels and times. One of the issues with trading is that you're dependent on your datafeed and exchange connections working which is out of your control.

I use a python monitoring script that tails logs watching for ALERT level log lines and constant order activity combined with a cron watchjob to ensure the process is alive during trading hours. The exception handler in the monitoring script sends alerts if the script itself dies.

If there are any issues I use twilio to text me the exception text/log line. I also use AWS SES to email myself but getting gmail to permanently not block SES is a pain in the ass. By design Twilio + AWS SES are the only external dependencies I have for the monitoring system (too bad SES sucks).

On my phone I have Termius SSH setup so I can log in and check/fix things. I have a bunch of short aliases in my .profile on the trading server to do the most common stuff so that I can type them easily from my phone.

I also do all my work through a compressed SSH tmux including editing and compiling code. So if things get hairy I can pair my phone with my laptop, attach to the tmux right where I left off, and fix things over even a 3G connection.

This compressed SSH trick is a huge quality of life improvement compared to previous finance jobs I've worked where they use Windows + Citrix/RDP just to launch a Putty session into a Linux machine. It's almost like finance IT has never actually had to fix anything while away from work.


If it helps at all, I found that buying a dedicated IP for SES helped our deliverability enormously.


I basically don't manage any servers. Everything runs on AWS Lambda & co (DynamoDB, S3,...)

It doesn't prevent an app-level outage (corrupted data in the database, bad architecture,...) but at least I don't have to worry about servers going down anymore.

As for the rest, unit & extensive integration tests along with continuous integration and linting. Oh, and a typed language. Moving from Javascript to Typescript was a blessing. But I still miss Swift.


We are a very small team at https://codeinterview.io. We recently achieved a respectable level of reliability with a tiny team. Some things you should do:

- Atleast have a pool of 2 instances (ideally per service) running under an auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to explore EBS and google cloud run. If you can use them, use them!

- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.

- Health checks! The trick is to recover the failed container/pod/service before you get that pagerduty call. Ideally, if you have 2 of each service running #2 will handle the requests until the #1 is recreated.

- Sentry + newrelic APM + infra: You should monitor all error stack traces, request throughput, avg response time. For infra, you mainly need to watch memory and CPU usage. Also on each downtime, you should have greater visibility at what caused it. You should set alerts on higher than normal memory usage so you can prevent the crash.

- Logs, your server logs should be stored somewhere (stackdriver on gcloud or cloudwatch on aws).

These might sound overwhelming for a single person but these are one time efforts after which they are mostly automatic.


One thing that has helped me a lot with monitoring is custom application-level metrics.

If you have a good idea of the usage patterns of your service, create metrics backed by the patterns. This can help you find things that CPU/Memory will hide.


1. Stay on AWS only.

2. Pay for a Business Support plan. https://aws.amazon.com/premiumsupport/pricing/

3. Call business support about something "how do I restart my server" - so you know how to file a ticket, get a feel for how quick the response is and how it works.

Do not over think this. EG: terraform templates


re: terraform, that's only part of the picture right? What do you recommend for provisioning? I assume Terraform + Packer? I looked into these a while back and they seemed good.

My only concern was that my target was very low-cost setups, and I wanted something like Packer but let me provision multiple images onto a single machine. Eg, if I just used Packer, as far as I could tell I would have to have 1 machine per image. It sounds odd, but I didn't want to pay $5*Services, especially when the number of users and load was very small. Being able to deploy in a PaaS fashion to something like Docker on the machine seemed best.

But then I was looking at Terraform + Packer + DockerSomething, and things went back to feeling less simple.


Don’t do Terraform. If you are already on AWS. Use CloudFormation. You are paying for the business support plan. They aren’t going to support your Terraform deployment.

I haven’t used Packer, but could you use CodeBuild/CodeDeploy/CodePipeline? Again if you’re on AWS you might as well take advantage of their support.

You can deploy Docker images using CloudFormation either to Fargate and not have to worry about servers or to EC2 instances (? I haven’t tried).


For your case I recommend you use Poor Man’s High Availability method, an auto scaling group of size 1.


Interesting, I wouldn't have expected this answer. Can you elaborate?


Autoscaling with a min/max of 1 will cause the instance to terminate after a number of failed health checks and start a new instance.


can you say more about this? pointers?


Yes there are plenty of start-ups doing this. You can also use AWS's build-in functionality to achieve this. You can write a Lambda function which checks your server status. Even better which calls your end-point for health check if you want more detailed monitoring


I solo ran a web hosting service way back in 2000-2003, well before cloud when it was mostly LAMP and CPanel. It was super mission critical stuff for 20000 sites and I was totally winging it. As it grew I got totally paranoid about uptime. Long story short, at some point there’s no substitute for getting a human to help back you up. I had this company that I paid $250 a month for that was helped monitor and would jump in to my servers to troubleshoot if I was unavailable. They rarely were needed and when they were it was usually just an Apache restart or similar. Best money I ever spent.


Another option is just to not tackle systems that require 24/7 uptime IF you are just one person. Instead, make an installable product or do a service that's not interactive or real-time.

I've been in the game for a while and every time I run across an idea for a service, there's always a question of whether I'd be OK with sleeping with a pager, remoting to the servers at 4 am on Saturday and generally be slaved to the business. The answer, upon some reflection, is inevitably No. This is the domain of teams.


I wrote up a "technical continuity plan" that describes how to keep my web sites and APIs in maintenance mode in the event of my untimely demise. It has a list of bare-minimum things to do in the following week, month, and year, and describes the various third-party relationships and how to go about hiring a replacement administrator. I shared the doc with a few close friends. I hope it's not needed in the future, but just writing the doc was a useful exercise for me in the present.


You have identified a single point of failure (yourself), you either need to accept the risk or hire a person on retainer.

I'm in the same boat with my solo founder projects (links in profile).


I have a few things in production — two SaaS, one customer-facing subscription site. I run these all myself with no staff or contractors.

The short answer: I'm married to my phone/laptop.

My test coverage is good. I use managed services when possible so I don't need to play sysadmin. I don't deploy before I leave for something (dinner, shower), and I have some pretty good redundancy across all my services. If one node goes down, I'm safe. If four go down (incredibly unlikely), well, fuck, at least my database was backed up and verified an hour ago.

I invested a large amount of time into admin-y stuff. My admin-y stuff is solid and I can tweak/config/ccrud anything on the fly. I credit being able to relax thanks to my admin-y stuffs. Obviously, if shit really hits the fan with hardware or an OS bug, I need to get to my laptop. But over the last six years, I haven't had to do that yet, and hopefully I won't have to.

I've explored adding staff — mainly for day-to-day operations — but I like the idea of interfacing with my customers and I credit growing things to where I have because I'm in the trenches with them. Things haven't always gone smoothly, and my customers always let me know, but any issues are normally swiftly-resolved.

The scale of one of my products is non-trivial and has a ton of moving parts — some of which I'm in no control of and could change at any time and break _everything_. It sounds terrifying, and it is, but I've made a habit to check things before peak hours. If something's amiss, a quick fix is usually all it takes.


I'm not a solo founder, but I run a number of servers that are heavily used - all with different software with varying amounts of reliability. I also allow other people to deploy code without checking with me first, just to keep things fun.

I have a few pieces of advise:

1. Make sure your service can safely fail and be restarted. What I mean is, if somebody is POST'ing data or making database changes, make sure you handle this safely and attempt some recovery. Something not being fully processed is okay as long as you are able to handle it.

2. Self-monitoring. I run all my systems inside a simple bash loop that just restarts them and pop me an email (i.e. "X restarted at Y" and then "X is failing to start" if it continues).

3. External monitoring via a machine at home that rolls the server back to a previous binary (also on the server). It also pulls various logs from the server, as well as the binaries, so they can be analyzed. Okay, it has some reduced functionality, but it's stable and will keep things going until the problem is fixed.

4. Make sure your service fails inconveniently - i.e. returns a `{"status":"bad"}` string or something, or defaults to a "Under maintenance page, please come back soon". Your service going down is one thing, but becoming completely unresponsive is quite another.

One thing I can't prepare for (which happens more than you think) is the server itself crashing, which as you say, means I'm randomly logging into a VPS console and rebooting. I use a bunch of different VPS providers and every one of them has a slightly different console.


Just to add to the voices that are saying "by not having any". If you can get away with edge, lambda, or heroku ... even if it's in the short term, do.


Other people are suggesting alternative platforms when you could simply have an AWS autoscaling group. If a server goes down it simply relaunches a saved image.


It's a good solution, but it's a bitch-and-a-half to set up.


It boils down to sending some sort of notification so first responders know about the issue ASAP

You can do it at the OS level: on a windows OS for example: you use EventViewer and assign a task to specific type of log captured by the OS this task can then invoke a small app that sends emails if an error-log occurs or something like that

    Application specific issue:
        you can manually capture exceptions raised within the app and send notifications
            there are many clever ways to do this and not hinder performance, and also not pollute your code-base with exception handling
                you can spawn "fire and forget threads" that send notifications ...
                let me know if need more ideas here

    Integration tests:
        given that you've built a strong suite of integration-tests covering all the functionality on your app
            you can have have your integration tests run every 15min or so and send notifications if tests fail

You can also use monitoring tools. I know Azure offers ways to help with this. Reach out if want more ideas or more specific solutions


Yes, you can pay managed hosting providers different amounts of money for different levels of support.

Managed support will usually only monitor and fix basic infrastructure and respond to support requests from you. They often won't monitor or fix your applications/services; for that you can set up your own application monitoring and tests. NewRelic is a good all-in-one choice, but there are plenty more out there. To call you during an incident, you'd also adopt PagerDuty.

In order to avoid service outage in general, you want to hook up some kind of monitor to something that re-starts your services and infrastructure. This will only fix crashes; it won't fix issues like disks filling up, network outages, application bugs, too many hits to your service, etc.

You should be able to find small businesses who specialize in selling support contracts for all levels of support. By signing a contract and on-boarding a 24/7 support technician, you can get them to do basically whatever you need to be fixed when it goes down. I don't have suggestions for these, maybe someone else does (it used to be common for SMBs in the 2000's).


It really depends on the failure mode and the cost of failure. As mentioned by others you can encounter issues in external services which you have no control over and the best you can do in that case is fail gracefully until you're able to deal with the issue. If it's easy to detect failure, and a restart fixes the problem, it can be quite straightforward to set up some monitoring scripts that take care of this for you, and even if it's more complicated than a restart some monitoring can at least notify you by email or SMS. Keeping your tech simple and/or having high test coverage or formal verification can reduce your error rate. Similarly you can introduce fault tolerance into the system with something like Erlang's OTP or monitored containers in an orchestrator (K8s, Docker Swarm, some cloud solution). If failures are expensive you might want to take on staff to deal with them, if the cost is low you might just want to accept occasional downtime (though you'll want to think about how you report that to your users).


+1 for Erlang. Learning how to write OTP apps in Erlang taught me so much about building reliable systems.


The operations guy is here. I'm probably biased.

If I were you, I would use free monitoring services like uptimerobot. There are some other options available. Typically these services provide some basic functionality for free, it would be enough for a small enterprise.

On AWS it is quite easy to create your own external probes for a reasonable price. However, it would require some basic programming skills.


Are you using a Node.JS backend? This is a little script that I set up with cron on a second instance, which logs in and restarts the Node server if it's down. • replaces * because it's used for italics on Hacker News comments.

#!/bin/bash

thisHtml=`curl -s "[your site's web address]"`

if [[ $thisHtml != •"<title>[your site's title]</title>"• ]]; then

#echo "Server is down"

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "killall -9 node"'

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "export PATH="/root/.nvm/versions/node/v8.11.2/bin:/usr/lib/node:/usr/local/bin/node:/usr/bin/node:$PATH" && forever start /var/www/html/[...]/index.js"'

rebootDate=`TZ=":[your time zone]" date '+%Y-%m-%d %H:%M:%S'`

echo "$rebootDate" >> "/home/ec2-user/serverMonitoring/devRestarts.txt"

fi


> Can I pay someone to monitor my AWS deploy and make sure it's healthy?

Yes. There are consulting shops that will do this, as will many of the monitoring tools listed in the thread (though these tools will not fix the problem for you). Broadly speaking, there is a cost associated with this, as well as the cost associated with your downtime. If the cost of your downtime (reputational risk, SLA credits, etc) outweighs the cost of hiring someone to cut your MTTR to 5 minutes (assuming you can playbook out all of the relevant scenarios) + provides some value in stress reduction, then you should do this. If you've been doing this a while, you can math it out. In what experience I've had though, an outside person is unlikely to be able to fix an "unknown unknown", they just won't know your environment as well as you will.

All that said, one hour of service interruption a year is still better than most.


Redundancy. Failure should always be an option. Specific answers will depend on your stack. Nobody will be able to monitor and react like you will because all IT solutions are their own species of butterfly with their own intricacies. If uptime is really that important you might be at the stage where you need to take on an employee.


I highly recommend Kubernetes as infrastructure. It has a reputation for being too complex to use on your own or with simple projects but that reputation is undeserved. Self-healing container orchestration has been eye-opening for me. Many people groan at the prospect of learning something new but it is remarkably easy to use, the only barrier to entry being the high cost of cloud solutions and the unwillingness of many engineers to work with hardware (which would nullify the cost of cloud services). You can easily develop and test on local hardware and deploy to the cloud with the exact same configuration.

The idea that your server does not perform regular health checks or spin itself back up when it fails just seems weird to me now. I like being spoiled.


Having run K8s a bit on my own... Please don't run your own cluster. K8s is great and has a lot of wonderful features which help immensely with automation (ex: You tell it "I want to run 3 copies of this software and here's how to check it's health" and if it sees unhealthy containers or less than 3 copies running... it fixes the problem). However K8s is a bear to run in its own right. As a single-person company you don't have time for that. Use GKE or similar.


You are misinformed. MicroK8s and K3s can get a full-fledged Kubernetes setup running in less than a minute. I run a five-node cluster at home and it is a breeze.

Edit: I forgot about HyperiotOS, which is another viable option. MicroK8s has Canonical’s backing so it has quickly become the most well-supported option. It advocates single-node clusters for development but it supports multi-node as well.


I am leaning toward learning k8s seriously and am actually curious on your take: is the overhead of learning and maintaining a k8s cluster actually better than using AWS features like autoscaling coupled with health checks?


I find my home cluster to be laughably easy to set up, and I am still new to home clusters. This is an area of rapid development. I could wipe the hard drives and have it back up in less than 30 minutes, not counting the time to rewrite the OS images. I just have some machines running Ubuntu Server with one running Ubuntu Desktop (just because I like it). Each node needs a single shell command and it’s good to go. In the past I have done things like using interface forwarding to allow every node to use a hard drive that is connected to the master so if I ever actually have a use for that or if I want the master to act as a DHCP server and forward its network interface to the workers I can do it again.

To maintain portability you need to pay attention to CPU architecture and the underlying OS. Just make sure that your images are built for the same architecture that you will use in the cloud and that you have the right OS and version. x86_64 is ubiquitous (I think AWS offers an ARM server option, though) so developing on a Raspberry Pi cluster might be a headache to deploy.

For production you will want to use a cloud provider (or not, it’s not the craziest idea to run a home server if your use case permits) and you can do so for development if you can afford it.

The real advantages are service discovery, monitoring options, stateful support and resource allocation. I used Kubernetes Up and Running 2nd Edition, and if you don’t know Docker yet I would work through the Docker Book first because you will want to write your own images. Just make sure that everything lives in the same repository: a single filesystem as the single source of truth.


Don't maintain the cluster. Have someone else run it for you. Unless you want K8s cluster to be the only thing you do.

The advantage of K8s is that it abstracts so much away from you, that you should (in theory) be able to take the same YAML config file from AWS EKS to GCP GKE to Azure AKS...and it runs the same everywhere. Things like loadbalancering and HTTP ingress rules that normally would manual configs on each platform - become part of the K8s config.


> Don't maintain the cluster. Have someone else run it for you. Unless you want K8s cluster to be the only thing you do.

This can be as simple as just using Google's GKE service.


I've been thinking about this quite a bit lately. I've run DevOps for a few organizations and learned quite a bit through that.

Ultimately you can engineer your systems, even if they are quite complex, to be manageable by a single person. It's not one thing though. It's years of experience and gut feel. It's also totally distinct from technology.

Some things that come to mind:

- use queues for background tasks that may need to be retried. If things go down and you have liberal retry policies, things should recover.

- use boring databases. Just stay away from mongo and use something like rds which is proven and reliable.

- be careful in your code about what an error is. Log only things at the error level you need to look at.

- test driven development. Saves a ton of time.


You start by identifying the reasons behind why your application/service may fail and then design and implement the infrastructure for it, that can withstand certain failures for a cost you can bear. If a failure of a piece of infrastructure costs you £1 per day, you might be OK with paying £5/day for the infrastructure to handle such failure. But would be be OK with paying £50 for the same thing?

It's all the matter of defining requirements, then solutions and tradeoffs of those solutions and then implementing it with best practices in mind (automation, testing, monitoring, backups, etc.).

Hit me up if you want to discuss it over a pint! :)


Very simple: you need to get higher level of service, instead of paying for servers you need to pay for up-time. It's what PaaS, managed service or serverless does : manage server for you, at scale. To have something online you need: - servers - VM/OS management - Scalability system - monitoring (hardware, OS, applicative & functional) - action on monitoring and escalation management - update every weeks - observability

That's what we provide at Clever Cloud BTW https://www.clever-cloud.com/


Use heroku as long as it's cost affective. Every time I've moved from heroku to another platform for "cost savings" I always end up spending much more time that I'd planned just maintaining it.


Does your service actually need to have an incredible uptime? What would be the worst that would happen if the service was down let say 24 hours?

I feel like we over engineer that part. Sure there's plenty of service where you don't want any downtime and it makes sense to over engineer it (like any monitoring service) but for many SaaS, the worst that will happens is a few emails.

Maybe write a simple SLA, something with a 8 hours response over theses kinds of outage. If some client require more, than sell them a better SLA at an higher cost. That should let you invest into better response time for sure.


I rent dedicated servers at Hetzner.

No cloud machines, no hosted cloud services for production beyond DNS.

* 3 machines in separate data centers (equivalent of AWS AZs) for >= 30 EUR/month each. ECC RAM.

* These machines are /very/ reliable. Uptime of > 300 days are common, reboots happen only for the relevant kernel updates.

* Triple-redundancy Postgres synchronous replication with automatic failover (using Stolon), CephFS as distributed file system. I claim this is the only state you need for most businesses at the beginning. Anything that's not state is easy to make redundant.

* Failure of 1 node can be tolerated, failure of 2 nodes means I go read-only.

* Almost all server code is in Haskell. 0 crash bugs in 4 years.

* DNS based failover using multi-A-response Route53 health checks. If a machine stops serving HTTP, it gets removed from DNS within 10 seconds.

* External monitoring: StatusCake that triggers Slack (vibrates my phone), and after short delay PagerDuty if something is down from the perspective of site visitors.

* Internal monitoring: Consul health checks with consul-alerts that monitor every internal service (each of the 3 Postgres, CephFS, web servers) and ping on Slack if one is down. This is to notice when the system falls into 2-redundancy which is not visible to site visitors.

* I regularly test that both forms of monitoring work and send alerts.

* Everything is configured declaratively with NixOS and deployed with NixOps. Config changes and rollbacks deploy within 5 seconds.

* In case of total disaster at Hetzner, the entire production infrastructure can be deployed to AWS within 15 minutes, using the same NixOps setup but with a different backend. All state is backed up regularly into 2 other countries.

* DB, CephFS and web servers are plain processes supervised by systemd. No Docker or other containers, which allows for easier debugging using strace etc. All systemd services are overridden to restart without systemd's default restart limit, to come back reliably after network failures or out-of-memory situations.

* No proprietary software or hosted services that I cannot debug.

* I set up PagerDuty on Android to override any phone silencing. If it triggers at night, I had to wake up. This motivated me to bring the system to zero alerts very quickly. In the beginning it was tough but I think it paid off given that now I get alerts only every couple months at worst.

* I investigate any downtime or surprising behaviour until a reason is found. "Tire kicking" restarts that magically fix things are not accepted. In the beginning that takes time but after a while you end up with very reliable systems without surprises.

Result: Zero observable downtimes in the last years that were not caused by me deploying wrong configurations.

The total cost of this can be around 100 EUR/month, or 400 EUR/month if you want really beefy servers that have all of fast SDDs, large HDDs, and GPUs.

There are a few ways I'd like to improve this setup in the future, but it's enough for the current needs.

I still take my laptop everywhere to be safe, but didn't have to make use of that for a while.


Very well-thought infra and nice metrics. What kind of application are you running if I may ask?


Computer vision, specifically reconstruction of 3D models from 2D photos, as a service.


I use Heroku for https://keygen.sh. Sometimes it pisses me off how big the bill is (~$1.5k/mo atm), but the net time savings are still worth it to me. I usually spend a total of 0 hours a month on managing servers/infra, and less than an hour a day on support. I'm thinking I'll move to AWS eventually to maximize margins, but right now this really works for me.


Sure, the person you pay is AWS.

You enable them to do it for you by creating HA infrastructure. Start by creating an autoscaling group that enforces a certain number of working application endpoints. You probably need an alb too. An app endpoint that fails healthcheck causes the asg to spin up another instance and auto-register with the alb. (You can snapshot your configured and working app endpoint as the base image).


I'd love to recommend pingdom, or a service like it. I'm in no way affiliated with them, just a very happy customer and one of those products where I'm jelly I didn't come up with the idea. It integrates very nicely with pagerduty and slack/sms, etc.

It's just extra redundancy in case something like cloudwatch (which you should use -- with ELBs) also goes down.


I used AWS Cloudwatch and some simple server side checks (ianheggie/health_check for Rails is great) for a very long time.

It's not perfect, but it's (1) cheap (2) easy (3) quick (the mythical trifecta). It misses some of issues due to high loads (but still technically available) but works perfectly when things actually crash (like queue workers deciding to turn off).


Ive struggled with this for years. AWS is not foolproof and with environments for web, Android, amd ios availability gremlins have killed much of my spirit despite users proclaiming how they've been looking like a service like mine for yrs.

Docker, elastic beanstalk, SNS, and the hidden world of AWS instance performance are all a PITA. Oh yea, certs...

Welcome help as well.


I've used runscope.com and I love it. I don't know about their pricing so can't tell if it's suitable for someone in your situation but I'm sure there are tons of similar services. You could also build your own with Lambda and hope AWS is reliable enough to keep Lambda running. (Who monitors the monitoring tools? :) )


Services like heroku try to solve this problem.


I'm working on FormAPI [1] as a solo founder. I started on Heroku, but Heroku was a bit unreliable and I had some random outages that I couldn't predict or control. (This was even while using dynos in their professional tier.)

I also had a lot of free AWS credits, so I migrated to AWS. I didn't want to write all my terraform templates from scratch, so I spent a lot of time looking for something that already existed, and I found Convox [2].

Convox provides an open source PaaS [3] that you can install into your own AWS account, and it works amazing well. They use a lot of AWS services instead of re-inventing the wheel (CloudFormation, ECS, Fargate, EC2, S3.) It also helps you provision any resources (S3 buckets, RDS, ElastiCache), and everything is set up with production-ready defaults.

I've been able to achieve 100% uptime for over 12 months, and I barely need to think about my infrastructure. There's even been a few failed deployments where I needed to manually go into CloudFormation and roll something back (which were totally my fault), but ECS keeps the old version running without any downtime. Convox is also rolling out support for EKS, so I'm planning to switch from ECS to Kubernetes in the near future (and Convox should make that completely painless, since they handle everything behind the scenes.)

[1] https://formapi.io

[2] https://convox.com

[3] https://github.com/convox/rack


Convox looks nice! My only issue is that it seems tied to AWS?

It would be nice to have a stateless tool abstract Terraform a bit, to let you use more providers as a basic PaaS.


I made my own (every minute) monitor: http://monitor.rupy.se

I also warns me if the CPU load goes up over 80%.

For the first two years of going live I had this hardwired to my Pebble via real-time mail, but now I know my platform is robust; so I can choose worry about other things.


Sounds like there is a business opportunity here. A kind of DevOps-AAS, though to make it scale you'd probably need the customer to probably architect their system in a certain way.

(though this is essentially a single-line comment, it's earnest, not intended to be sarcastic)


I build my projects on Google App Engine and it has been stable and reliable without much administration. The platform is not without its challenges, especially with the Gen 2 rollout, but no issues related to administration/interruption. PaaS could be a good place to explore...


seconding this, if you don't need lots of resources it makes sense. I pair it with Amazon cloudfront and so far, almost one year in with zero problems. By far the biggest win for me is the peace of mind


My applications which are built using Laravel are deployed through Laravel Forge. There is definitely extra charge for it, but having Forge to simplify deployment really save my time especially in case of any issue.

For monitoring, I am using Stackdriver which has easy-to-use health check.


Have you thought about hiring someone remote in the same or different timezone to be on-call for outages? I'm sure there are many people around that would be able to help with this. You could hire someone on a retainer who can be on-call via PagerDuty or something.


There are a lot of whatever-as-a-service offers which can relieve you of updating, patching, and restarting. But if your troubles originate as bugs in software, poorly formatted data, or something along this lines then human supervision is probably the only solution.


I've been using Linode's managed service, about $100 a month per server. If something goes wrong they have access and can triage, or let me know if they can't fix it. It's been very helpful, especially since they have (excellent) phone support.


TICK stack. Literally everything handling or supporting production traffic should be monitored


I think another consideration that might not be an obvious risk is your use of two factor auth.

It’s important for critical services, yet if you lose your 2FA device, like a phone, you will be locked out for a while. Like many things, it will happen at a bad time.


AWS is not "your servers", it's "your services". How you monitor, manage, and set up redundancy/recovery is going to be very very different between running real servers or just paying AWS for semi-managed services.


For a single server setup, write bash scripts that check whether the server is down, and bring it back up of it's down.

Also, send errors through chat platforms like Telegram to be notified of any errors/monitor the servers


Use uptimerobot to monitor your site. Have a scheduled job that ping healthchecks.io every 5 minutes. Configuration both to email you if anything goes down.

These are both reactionary but at least you'll know if things break.


I guess it really depends on what AWS services you use. There are companies that can manage your AWS 24/7 for a fee.

Other options includes using another service that offers 24/7 uptime. Obviously you pay more for that.


I have been using Convox for the last 3 years and it has been super reliable. Convox is essentially bring your own infra Heroku built on top of AWS ECS (Docker). I believe one of the founders is from Heroku.


Yes, you can, by outsourcing it to someone or scaling up your team of one as others have mentioned. I am currently in the process of rolling out a service that does this. Hit me up if you are still keen.


1) Setup a status page for what you want to monitor (/health/queuelength) 2) Point statuscake at that url 3) Connect statuscake to pagerduty.

This approach is easy to implement and scale


Your bus factor is 1. Use managed services.

https://en.wikipedia.org/wiki/Bus_factor


Recommend doing the AWS certification training. AWS has redundancy built in with health monitoring and auto scaling groups etc. The training covers all this.


by not managing servers, use a PAAS such as heroku, it significantly reduces devops allowing you more time to focus on what matters, ie product market fit.


You can try to use AWS Elasticbeanstalk. It can recover from failures automatically by spawning new nodes right behind the load balancer.


I outsource to Tummy.com They've been terrific, and I don't have to worry about anything I don't want to.


I run a company that helps with this single founder scenario. We monitor your infrastructure, and resolve issues 24x7, along with other proactive items.

https://www.mnxsolutions.com/services/linux-server-managemen...

I’d be happy to chat with anyone, even if to provide some feedback or a quick audit to help you avoid the next outage.

- nick at mnxsolutions com


1) Have a monitoring in place. 2) Never miss alerts, use smth with multi-channel escalation like amixr.io


Pieter Levels (Founder of remoteok.io) hired a guy to monitor his server for $2k / a month


The direct question was, can I pay someone to monitor my AWS. And the answer is yes. You want redundancy at every layer including the human one. For 24x7 coverage, long term you need a team but for now two people will do.

Funny enough I was just talking to someone who passed all his AWS certifications and was looking for some AWS work.


Don't sell things that require high availability as a solo founder.


i used to have a telegram bot sending me events from supervisord: http://supervisord.org/events.html


After reading another recent post... Tinder for Founders, anyone?


Short answer: promising 5 nines of uptime is not a thing for startups. Downtime is going to happen and you are going to be asleep, drunk, or otherwise not fit for doing any emergency ops. It's not the end of the world. Happens to the best of us.

So given that, just do the right things to prevent things going down and get to a reasonable level of comfort.

I recently shut down the infrastructure for my (failed) startup. Some parts of that had been up and running for close to four years. We had some incidents over the years of course but nothing that impacted our business.

Simple things you can do: - CI & CD + deployment automation. This is an investment but having a reliable CI & CD pipeline means your deployments are automated and predictable. Easier if you do it from day 1. - Have good tests. Sounds obvious but you can't do CD without good tests. Writing good tests is a good skill to have. Many startups just wing it here and if you don't get the funding to rewrite your software it may kill your startup. - Have redundancy. I.e. two app servers instead of 1. Use availability zones. Have a sane DB that can survive a master outage. - Have backups (verified ones) and a well tested procedure & plan for restoring those. - Pick your favorite cloud provider and go for hosted solutions for infrastructure that you need rather than saving a few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon RDS or equivalent and don't reinvent the wheels of configuring, deploying, monitoring, operating, and backing that up. Your time (even if you had some, which you don't) is worth more than the cost of several years of using that even if you only spend a few days on this. There's more to this stuff than apt-get install whatever and walking away. - make conservative/boring choices for infrastructure. I.e. use postgresql instead of some relatively obscure nosql thingy. They both might work. Postgresql is a lot less likely to not work and when that happens it's probably because of something you did. If you take risks with some parts, make a point of not taking risks with other parts. I.e. balance the risks. - When stuff goes wrong, learn from it and don't let it happen again. - Manage expectations for your users and customers. Don't promise them anything you can't deliver. Like 5 nines. When shit goes wrong be honest and open about it. - Have a battle plan for when the worst happens. What do you do if some hacker gets into your system or your data-center gets taken out by a comet or some other freak accident? Who do you call? What do you do? How would you find out? Hope for the best but definitely plan for the worst. When your servers are down, improvising is likely to cause more problems.


monit, keepalived, statuscake, with more money datadog, newrelic help as well with PagerDuty


hetrixtools.com is a great option for monitoring and real-time notifications


Heroku


Imo you will have to get outsourced on-call if your downtime tolerance is very very low.

Otherwise I'd suggest religiously documenting your outage root causes and contemplating hard what could've avoided that outcome.

Then lastly for monitoring on the cheap:

Sentry.io - alerts.

Opsgenie - on-call management.

Heroku+new relic - heartbeat & performance.

Tldr; Keep your stack small and nimble and try to learn from past outages


yes you can do. Or try to automatize as much as possible:

- add health check mechanisms

- if health check is broken => restart service

- if restart service doesn't help after X retry => redeploy previous state (if any available)

Try to use Kubernetes or Docker Swarm if possible, combined with Terraform


Restarting the service and redeploying it should be absolutely the last resort and aren't really sound advice, mainly, because you are losing the invaluable crashed state of the system, that may be vital (sometimes logs are not enough) to discover _why_ the system crashed in the first place and then delivering a fix for that particular issue. Once that's done, you incorporate this into your infrastructure automation (having which goes without saying) be it Ansible, Terraform, Kubernetes or whatever else.

Otherwise you allow the problem to persist, pile up with other issues (also fixed by restarts, I assume) and implementing automated restarts in that manner reduces not only your uptime in uncontrollable manner, but also your code/infrastructure quality, increasing your tech debt beyond the point of recovery.

Friends don't let friends fixing things by restarting them ;)


Restarting the service and redeploying it should be absolutely the last resort and aren't really sound advice, mainly, because you are losing the invaluable crashed

I’m speaking in terms of AWS translate to your chosen infrastructure.

At the bare minimum you should have two redundant servers behind an autoscaling group with a min/max of two with health checks.

When you need to get something up now and you want to keep the crash state, configure the crash instance to be taken out the autoscaling group but not terminate and start up a new instance. You can then troubleshoot.


[deleted]


If you’re a solo developer, why would you have something as complicated as k8s? I’m referring to dead simple VMs.


> Restarting the service and redeploying it should be absolutely the last resort and aren't really sound advice, mainly, because you are losing the invaluable crashed state of the system, that may be vital

I assume you have a separated logging mechanism, where all logs are collected, independently from the restarted service. Don't forget to log the state of your system as much as possible for post-mortem analysis


One word.. Serverless. It's a bit more pricy but the ease of mind is worth it.


GKE


I know the situation. I haven't got to the production stage yet but I totally get it. Beside using Kubernetes, Nomad or some other scheduler, you will always have to invest your own time to resolve issues manually. You could have triggers that would invoke ansible playbooks if you don't want to handle any of the aforementioned but in the end the type of business simply requires maintenance - there is no way around that. A real human being has to be keeping an eye on the entire architecture and make sure it is running as it is supposed to.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: