Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Startup Devs -What's your biggest pain while managing cloud deployments?
59 points by tj1516 9 months ago | hide | past | favorite | 92 comments
When you're a tiny team of developers working on something beyond MVP level and beyond heroku — managing your CI, deployments/rollbacks, DBs etc. looks like a nightmare to me without Devops expertise.

I want to understand what kind of challenges you all are facing with regards to this. And any tools, practices, you’ll are using to reduce this pain? Ex- How do you deploy resources? How do you define architecture? How do you manage your environments, observability?, etc.




Killer one that has shafted us is tying everything into popular deployment and CI tooling. Your product should be buildable, deployable and tested on a standalone workstation at any time. We have lost literally days fucking around with broken products which get their tentacles into that shit.

I'd use AWS + ECR + ECS personally. Stay away from kubernetes until you have a fairly large deployment as it requires huge administrative overhead to keep it up to date and well managed and a lot of knowledge overhead. Keep your containers entirely portable. Make sure you understand persistent storage and backups for any state.

Also stay the hell away from anything which is not portable between cloud vendors.

As for observability, carefully price up buy vs build because buy gets extremely expensive. Last thing you want is a $500k bill for logging one day.

And importantly every time you do something in any major public cloud, someone is making money out of you and it affects your bottom line. You need to develop an accounting mentality. Nothing is free. Build around that assumption. Favour simple, scalable architectures, not complex sprawling microservices.


> Killer one that has shafted us is tying everything into popular deployment and CI tooling. Your product should be buildable, deployable and tested on a standalone workstation at any time

I've currently a VP Eng at a "growth stage" startup, and the state of things when I joined was that about 75% of a dev's time was spent fighting with the external services and data. Someone would spend an entire sprint on a feature, but most of the time was actually fighting with our IdP.

A huge effort and focus (that I had to beat into everyone's head) was that being able to a) run everything locally and/or b) have reasonable fakes for external dependencies means that we can spend far more time building actual product stuff and far less time debugging infra issues.


> Also stay the hell away from anything which is not portable between cloud vendors.

Doesn't ECS violate this? This is something I've preferred in theory about kubernetes. At least in theory, it's supported everywhere. But I have also wondered whether it would be just as "easy" in practice to move from ECS to another cloud, as it is with EKS. (That is, neither would actually be easy.)


The metadata and configuration does violate this but the containers are portable if you care enough to make sure that they don't depend on anything AWS specific.


Gotcha!


ECS is the way imo. I dislike the complexity in auth around cloud services but ECS is a really nice way to build scalable systems cheaply. I am a big fan of using the underlying Fargate spot machines within ECS.


I inherited a project on fly.io but before that I’ve used ECS as my primary platform. While ECS is kubernetes lite, fly is ECS but a little more dev friendly in the realm of not dealing with the complexities of IAM etc. It has other downfalls, but it’s the sweet spot for my early stage products right low…although i do expect I’ll end up on ECS if this is going well a year from now.


If the rest of your infra is on AWS(e.g., S3/RDS/Dynamo), why not just start with ECS? Egress fee's would seem to take a toll after a bit using Fly.io


Mostly just because it’s there already, and after getting familiar with it i really like how fast it is to get a project started. Ecs is great, but fly is 10x easier to deploy to.


I love ECS. But I'd also tip my cap to Azure App Service. They're both full flexible models that utilize the core services of public cloud but limit your exposure to the bare necessities.

CDN->Load Balancer->autoscale docker->database. That serves 90% of use cases but you'll have access to the myriad other services they offer for anything else that comes up.


On GCP, Cloud Run is also a nice solution with minimal complexity. It's very similar to Fargate / ECS.


> I'd use AWS + ECR + ECS personally. Stay away from kubernetes until you have a fairly large deployment as it requires huge administrative overhead to keep it up to date and well managed and a lot of knowledge overhead.

> ...

> Also stay the hell away from anything which is not portable between cloud vendors.

But if your team happens to have Kubernetes expertise, I would say go for Kubernetes.

Anything after building container images will be somewhat cloud-specific regardless. And arguably it can be easier to hire infra/platform folks with Kubernetes skillsets later when you need to.


ECS is terribly underrated, I agree.

Most people don’t need kubernetes, and ECS + Terraform is very approachable and easy to get started with, especially with Fargate for compute.


There's a lot of noise when trying to learn the advanced features of each cloud provider's "way of doing XYZ." I think it helps to focus on the things worth protecting: secrets, credentials, code.

Who has access? How do we audit / rotate? How do we secure?

You can use this approach for each step along the way, how to secure secrets in your cloud? code? IaC? container deployments? CI/CD?

If we assume infra / app is code, the tooling matters a lot less. How do you provision certificates via IaC? How do you grant IAM to resources and how do you revoke?

There are examples like https://github.com/terraform-google-modules/terraform-exampl... of more advanced IaC architectures, but you can start as small or as complex as you want and evolve if done properly.

Personally, I love me some Kubernetes + ArgoCD (GitOps) + Google Workload Identity + Google Secret Manager, but I am 100% biased.


100% and going one step beyond: how do the answers to those Qs evolve as you grow from 1 engineer to 5 to 20 etc.


> There are examples like https://github.com/terraform-google-modules/terraform-exampl... of more advanced IaC architectures, but you can start as small or as complex as you want and evolve if done properly.

Is there something like this for AWS?


https://github.com/terraform-aws-modules

Check out the ECS repository for more complete examples.


- Keep it simple. Cloud providers offer hundreds of services - just because you can use them doesn't mean you should. A simple Docker container or a small VM might already be enough for you, and it's a lot less moving parts.

- Document everything. Hacking together some cloud infra is easy, maintaining it is a completely different story. You want to have clear and up-to-date documentation on what exactly you have running and why. If you don't have good docs, it'll be impossible to scale, refactor, or solve outages.

- Make sure people are trained. Cloud infra management gets complicated real fast. You want people to actually know what they are doing (no "hey, why is the bucket public" oopsies), and you want to avoid running into Bus Factor issues. Not everyone needs to be an expert, but you should always have one go-to person for questions who's actually capable, and one backup in case something happens to them.

- Watch your bills. Every individual part isn't too expensive, but it all adds up quickly. Make sure you do a sanity check every once in a while (no, $5k / month for a DB storing megabytes is not normal). Don't use scaling as a "magic fix-everything button" - sometimes your code just sucks.


That these days there are only two serious choices: deploy on a single-node Docker/podman machine, or Kubernetes.

I can do both, but for a bootstrapped solo business, Kubernetes is overkill and overengineered. What I would really love is a multi-node podman infrastructure, where I can scale out without having to deal with k8s and its infernal circus of YAML, Helm, etcd, kustomize, certificate rotation, etc.

Recently I had to set up a zero-downtime system for my app, I spent a week seriously considering a move to k3s, but the entire kubernetes ecosystem of churn frustrated me so much I simply wrote a custom script based on Caddy, regular container health checks and container cloning. Easier to understand, 20 lines of code and I don't have to sell my soul to the k8s devil just yet.

Sadly, I don't think a startup can help make this better. I want a bonafide FOSS solution to this problem, not another tool to get handcuffed to. I seem to remember Red Hat where working on a prototype of a systemd-podman orchestration system to make it easy to deploy a single systemd unit into multiple hosts, but I am unable to remember what is it called any more.

---

Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.


> Recently I had to set up a zero-downtime system for my app, I spent a week seriously considering a move to k3s, but the entire kubernetes ecosystem of churn frustrated me so much I simply wrote a custom script based on Caddy, regular container health checks and container cloning. Easier to understand, 20 lines of code and I don't have to sell my soul to the k8s devil just yet.

I'd say it depends where you're coming from. For me, setting up a Kubernetes cluster (no matter which flavor) with external-dns and cert-manager will most likely take 30m-1h and that is the basic stuff that you need for running an app with the topics you mentioned. To navigate through k8s just use k9s and you're golden.

I never get where all the "k8s is the devil" comments come from. There is nothing really complex about it. It's a well defined API with some controllers, that's it. As soon as I need to have more than one server running my workloads I would always default to k8s.


> I never get where all the "k8s is the devil" comments come from. There is nothing really complex about it. It's a well defined API with some controllers, that's it.

And Linux is just a kernel with some tools, which are all well defined, that's it!. But if you need to debug a complex interaction "that's it" and "it's well defined" isn't enough.

Kubernetes is quite complex, with a lot of interactions between different components. Upgrades are a pain because all those interactions need to be verified to be compatible with one another, and the versioned APIs, as cool as a concept they are on paper, mean that there's constantly moving targets that need constant supervision. You can't just jump a version, you need to check all your admission controllers, CNI and CSI drivers, Ingress controller, cert-manager and all other things are compatible with the new version and with each other. This is not trivial at any scale, which is why many orgs adapt the approach of just deploying a new cluster, redeploying everything to it and switching over, which is indicative of exactly how much of a pain it is.

Even Google themselves that created it admit it's complex and have 3 managed services with different levels of abstraction to make it less complicated to use and maintain.


k8s is fine. It's the ecosystem around it that I dread.

I understand how it works, it makes sense, but then you're faced with Helm, kustomize, jsonnet and a lot of bullshit just to have minimal and reproducible templating around YAML. Or maybe you should use Ansible to set it up. Maybe instead try ArgoCD. Everybody and their dog is an AWS or GCP evangelists and keep trying to discourage you from running it outside of the blessed clouds. If anything breaks you're told you're stupid and that's what you get, and I should've paid someone else to manage it.

It feels like everybody is selling you something to just manage the complexity they have created. This is what keeps me away. It's insane.


There’s also Docker swarm for containerized workloads.

Personally I like VMs, and I’ve been toying around with the idea of having a LXD cluster on dedicated servers where each LXD container hosts some of my workloads.


This works. I had that setup ages ago, and you can nest Docker inside LXD. I’m using Proxmox now to manage the VMs, LXCs and storage underneath Portainer, and it is surprisingly trivial to snapshot and move an entire Portainer node around in Proxmox. One click to back up to shared storage, another to restore in another physical node.


I set up Docker Swarm at my previous company, but it was a dead and stale project 5 years ago when I last used it. I honestly cannot recommend it in 2024.


I don't think there has been a ton of iteration on it, but did you run into any specific problems or bugs or is this lack of recommendation based off caution against adopting something that is not being iterated on? Just asking because while I haven't used it in years, it's been my go-to for small projects in the past, it seemed to do what it advertised very well. I hope that someone picks up the swarm torch, I really liked the abstractions and workflow it enabled. K8s was always too heavy for me and introduced too much complexity I was uncomfortable with


Not GP, but I recently attempted to migrate a single-node Docker Compose setup to Docker Swarm and ran into the following issues:

- No ability to use secrets as environment variables, and no plans to change this

- Cannot use `network_mode` to specify a service to use for network connections a la Docker Compose

There were a few other minor issues which resulted in ditching Docker Swarm completely and moving to a Nomad + Consul stack instead.


While the way secrets work in Swarm seems weird when compared to Kubernetes, this is usually pretty easily solved by a quick overriding entrypoint in the docker stack file that does essentially this:

    export SOME_VAR=$(cat /run/secrets/some_secret)
    exec /original/entrypoint.sh

Can you explain the second one? I don't get the usecase.


I just set up my first docker swarm cluster. It's not in production yet, just a stage environment, but it is working very well so far, and I like it very much. From my experience so far I can very much recommend it, and I hope it will get more attention again. Because it does fill the gap which is described in the original post.


> That these days there are only two serious choices: deploy on a single-node Docker/podman machine, or Kubernetes.

I disagree - there's cloud PaaSes like Fargate or Cloud Run, or when self-managing, a third option which is much less known but quite easier while also being more flexible - HashiCorp Nomad. Disclaimer time: I work at HashiCorp, but I've had this opinion for years before joining - https://atodorov.me/2021/02/27/why-you-should-take-a-look-at... (it's out of date, but the principles still apply, just to a different extent, Nomad having gotten easier). All opinions are my own etc.

Not FOSS anymore, but free and source available, composable, and does a big portion of what Kubernetes does at a fraction of the complexity. I ran it in production for a few years and everything was a breeze, unlike the Kubernetes clusters I was maintaining at the same time. You get all the "basics" of HA, failover, health checks, easy ingress, advanced deployment types, basic secrets storage, etc. without having to write thousands of lines of YAML, and with simple upgrades and close to no maintenance.

> Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.

Because running on dedicated servers means you spend time on managing them which could be more productively spent working on whatever your product is. Don't get me wrong, I love that stuff, but it's like when I started my blog - I spent weeks on CI/CD, a nice theme, a fancy static site generator, all sorts of optimisations, choosing a good hosting provider etc. etc. etc.... instead of actually doing what I was supposed to, writing content. When you manage your own servers your costs might even be higher compared to a F/C/P aaS from a cloud provider, especially with free trial year/free tier/startup credits. As long as you keep an eye on not locking yourself in, you could easily migrate most products to self-managed dedicated servers if performance/costs require it.


This is silly but I will not use any Hashicorp product ever again, for two reasons: one is Terraform (it is a bad product with a bad DSL), the second is that I bought one service from them that advertised a clear 30-day money back guarantee, found a bug in one hour of running, a multiple-year old Github issue they had been ignoring, and to make it worse, they completely ghosted me when I went to ask for a refund. Both on Twitter and via email.

So, I don't really care to try out Nomad for my business, honestly.


Nah, it's not silly - you had bad experiences with a company, it's normal not to want to interact with them further. I'm personally not a fan of Microsoft because I spent my formative tech years installing Windows on family friends' PCs and had so much inconsistencies and issues that I have a very negative association with Windows and Microsoft.

I disagree with the DSL part, but it's in big part subjective so people have all sorts of different experiences and opinions.


Terraform may seem bad, but cloud-specific solutions like Cloud Formation are even worse...


Lambda? Digital ocean app service? Azure containers?

I guess it is proprietary tech though which might be off putting.


My main challenge is people, especially "experts," who come, preach whatever popular cloud stuff is today, and then leave. This leaves me with a lot of shi* to deal with.

We should keep things simple, KISS. My few key points for the future myself:

1. Be cloud agnostic. Everyone operates on margins and a slight increase in a cloud provider's fees could be a death sentence for your business. Remember, cloud providers are not your friends; they are in the business of taking your money and putting it in their pockets.

2. Consider using bare metal if it's feasible for your operations. The price/performance ratio of bare metal is unbeatable, and it encourages a simpler infrastructure. It also presents an opportunity to learn about devops, making it harder for others to sell you junk as a premium service. This approach also discourages the proliferation of multiple databases/tech/tools for the sake of CV updates by your colleagues, keeping your infrastructure streamlined.

3. Opt for versatile tools like Ansible that can handle a variety of tasks. Don't be swayed by what's popular among the "cool kids". Your focus should be on making your business succeed, not on experimenting with every new tool in the market. Master it well.

4. Make sure you can replicate your whole production stack on your box in a few seconds, a minute max. If you can't, well, back to the drawing board.

5. Use old and tried tech. Choose your tech wisely. Docker is no longer cool, and Podman is a rage on HN, but there are hundreds of man-hours of documentation online of every Docker issue you can think of. And Docker will stay for a while. The same for Java/Rails/PHP...

6. Keep everything reproducible in your repository: code, documentation, deployment scripts, and infra diagrams. I've seen people use one service for infra diagrams and another to describe database schema. It's madness.

7. (addon) Stay away from monorepos. They are cool, they are "googly," but you are not Google or Microsoft. They are notoriously hard to scale and secure without custom tooling, no matter what people say. If you have problems with the code sharing between repos, back to the drawing board.


This is all sage advice except #7 "avoid monorepos" doesn't resonate with me. I worked at a seed level startup for 2 years and we used a monorepo for everything. I felt like it saved me a bunch of time vs my current employer with 400+ repos.

I could potentially* get behind a logic of splitting unique web services into their own repo, but for libraries I find separate repos to be a huge unnecessary overhead.

*potentially, as in I still prefer a monorepo for everything but I can see valid arguments for this one case


I really liked this comment and we follow all points except 7. We use a monorepo and it's working well for us. What's hard to scale about them? In what way are they hard to secure, and why?


The "experts" will happily lead you down the path of cloud lock-in, building solutions around proprietary tech like DynamoDB and Lambda. You might even wind up in a state where you can't develop locally. It's totally non-productive and I've witnessed it first hand.


Really like this advice. Especially 4.

I only think monorepos for a Javascript / Typescript stack are heaven sent (pnpm, turborepo), because reusing NPM packages otherwise is a huge pain.


A year ago I left my job at a 500-employee SaaS business working on the team that maintains the devops infrastructure, to found a startup. For me the biggest pain point is going from nothing to a sufficiently flexible devops setup.

There are a lot of great tools out there, but making them play well together is an exercise for the reader. There are also a lot of preference-based choices you need to make in how you want your setup to look, and what you chose will affect what tools make sense to you.

Do you go monorepo or polyrepo? If you go monorepo, how do you decide what to build and deploy on each merge? If you go polyrepo, how do you keep stuff in sync between any code you want to share?

Once a build is complete, how do you trigger a deployment? How does your CI system integrate with your deployment system, or is the answer "with some shell scripts you have to write"?

> How do you deploy resources?

For us, we have a monorepo setup with bazel. I wrote some fairly primitive scripts to scan git changes to decide what to build. We use Buildkite for CI, which triggers rollouts to kubernetes with ArgoCD. I had to do a non-trivial amount of work to tie all this together, but it's been fairly robust and has only needed a minimal amount of care and feeding.

> How do you define architecture?

Kubernetes charts for our services are in git, but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file

> How do you manage your environments

We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed.

> observability

Datadog and sumologic.

Overall our setup doesn't come close to the setup I worked on at my last employer, but I have to balance time spent on devops infra with time spent on the product, and that setup took ~5 full time engineers to maintain.


> but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file

Out of curiosity, why just the "readmeware" for those components? I can't think of a single thing that requires clickops in a modern k8s setup, so much so that in the beginning we used to bring up the full stack from nothing based on a single CFN template - roles, load balancer, auto-scaling group, control plane, csi driver (this was back when EKS was a raging tire fire), and then lay the actual business apps on it. The whole process took about 8 minutes from go

If nothing else, one will want to be cautious about readmeware components in disaster recovery situations. If no one has run those steps in 6 months, and then there's some kind of "all hands on deck," the stress will likely make that institutional knowledge leak out of their ears


> Out of curiosity, why just the "readmeware" for those components?

Because there are so few of them. Our setup has an ingress controller and a certificate manager, and then some bookkeeping like copying the container registry credentials into every namespace

> I can't think of a single thing that requires clickops in a modern k8s setup

Absolutely agree.

> The whole process took about 8 minutes from go

How long to do the development and testing of the template, and what size is your team?

Don't get me wrong, I'm not happy about this situation. As well as the DR concern you raise, we can't quickly spin up short lived clones of our infra for testing complex changes, so we test them in our staging environment and have to block prod deploys until we're either happy with the change or decide to roll it back. At a larger org this would be a major headache but at our current size it does not matter.


> We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed.

So you don't need to deploy multiple times or you don't do it because the system is stable when you deploy less often? I mean is it by choice or because of some tool or expertise limitation?

> Also, for architecture stored in text files - does that cause any problems for you?


We deploy our services on every merge (multiple times a day) from CI assuming the tests pass. But we don’t re-create our Kubernetes infra when we do that. I set up 3 Kubernetes clusters (staging, prod, internal apps) when I went full-time and have barely touched them other than to apply updates.

> does that cause any problems for you?

Nope. Our Kubernetes setup is just about as simple as it is possible to have a Kubernetes setup. I entertained the idea of going with something else, because we definitely don’t need all Kubernetes has to offer. But I settled on it because it’s what I know best and the overhead+risk of something new would have exceeded the cost of the unnecessary for us baggage that Kubernetes brings.

If our requirements were different or if I was making regular changes, we would be in a very different spot. But as it stands today it is just not a priority.


If it looks like a nightmare to you it's because clouds like GCP and AWS require training just to operate properly. There are so many components who are all intertwined with weird names and even stranger APIs, and you have no idea which of them you need unless you have significant experience, or formal training.


There are a lot of ways of doing things, and its really easy to make mistakes.

my advice would be:

Separate your build from your infra. Whilst its nice to have your cloud be spun up with CI, its really not a great use case, and means your CI has loads of power that can be abused.

Gitlab with local runners is a good place to start for CI. its relatively simple and your personal runners can be shared between projects (this is great for keeping costs down, but speeds up, as you can share a massive instance)

Avoid raw Kubernetes until you really really have to. Its not worth the time unless you have someone to manage it and your use case requires it. Push back hard on if anyone asserts that its a solution to x. Most of the time its because "its cool" K8s only really becomes useful if you are trying to have multiple nodes from different clouds/hybrid local/cloud deployment. For almost everything else, its just not worth it.

You are unlikely to change cloud providers, so choose one and stick to it. Use their managed features. Assuming you are using AWS, Lambdas are really good for starting out. But, make sure you start deploying them with cloudformation/terraform (terraform is faster, but not always better)

Use ECS to manage services, use RDS to manage data. Yes it is more expensive, but backups and duplication comes for free (ie you can spin up a test deployment with actual data.) Take the time to make sure that you are not using hand rolled stuff made in the web console, really put the effort into make sure everything is stored in terraform/CF and in a git repo somewhere.

Limit the access you grant to people, services and things. Take the time to learn IAM/equivalent. Make sure that you have bespoke roles for each service/thing.

Rotate keys weekly, use the managed key/secrets storage to do that. Automate it where you can.


> You are unlikely to change cloud providers, so choose one and stick to it. Use their managed features.

I am curious about this because I see opposing views expressed by different people. I have never personally been in a position where the decision has been relevant.

I work at a cloud provider an I'm told that a big slice of our revenue comes from customers who are already load-balancing across multiple clouds, so if we degrade perf/dollar they just turn a dial to shift load to our competitors.

This has always seemed very smart to me and I would love to get more perspectives on how easy it is to get to that position where your infrastructure is so commoditised that you can migrate between providers and back at the push of a button.

Is this something people achieve late in their lifecycle after massive cost-optimisation push? Or is this actually something you can build towards from day 1?


> I work at a cloud provider an I'm told that a big slice of our revenue comes from customers who are already load-balancing across multiple clouds, so if we degrade perf/dollar they just turn a dial to shift load to our competitors.

It's only anecdata but I'd highly question that. All companies I worked in (startup, midsize, megacorps) went with one cloud provider and stick to it. That is also not only my experience but also from friends who work in the same field. There might be a slight difference with megacorps where I saw them using multiple cloud providers but more like: Team A is using AWS, Team B is using GCP. But never: Team A is using AWS and has a copy running, ready to go, on GCP.

I tend to agree with the GP, to a degree. Chose an cloud provider and stick with it, the probability that the company changes the provider is very low (in the end all cloud provider offer the same with similar prices, no need to switch) but don't fully buy in. Like if you're on AWS, of course use RDS for DB and S3 for object storage but don't use Code Pipelines to build. So don't go "fully" proprietary.


i think it’s likely to be a certain class of customer, rendering or high power batch processing. These are relatively simple, pull a task off a stack and crunch for 90 mins, then dump the results in a bucket workloads. Large volume customers that will represent a disproportionate bottom line value to a cloud provider. Probably not the customers you are talking about above


> I am curious about this because I see opposing views expressed by different people.

My sample size is me and my immediate friends. I suspect that if you have a resilient multi-cloud deployment, then you'll use it to hunt for a metric you want to hit (speed/price/latency)

However, the engineering cost to get there is pretty high, and almost negates the point of having the cloud (unless you have a scaling requirement where you need 10x at short notice.)

I worked at a large news company, and it was decided that it was cheaper to just pay for the hosted services than pay for the people to run them (think RDS vs home grown DB) RDS is what 2x the cost of a normal instance, but thats still cheaper than the three engineers + oncall to manage a custom deployment and manage the backups and migrations. (along with conway's law of having DBAs)


I work at a 60 dev shop so not a startup per se.

Our biggest problem is feature environments, or actual integration tests where multiple services have to change. Because infra is in its own repo in terraform and the apps have their own repo we don’t have a good way of creating infra-identical environments for testing code changes that affect multiple services. We always end up with some hack and manual tweaks in staging.

Data engineering is another problem, managing how to propagate app schema changes to the data warehouse is a pain because it has to happen in sequence across repo borders. If it was all one repo and we got a new data warehouse per PR it would be trivial.

Not trusting CI to hold secrets is another. As soon as we do anything in CI that needs “real” data we need to trigger aws ecs tasks, because circleci has leaked secrets before so we don’t trust them and keep all our valuable secrets that can access real data in aws ssm. The more complex the integrations the harder they are to test.

If we had a monorepo I think this type of work would be much easier. But that comes with its own set of problems, mainly deployment speed and cost.

If there was a way to snapshot all our state and “copy” it to a clean environment created for each PR that the PR could then change at will and test completely, end to end, that would be the dream.


> Our biggest problem is feature environments, or actual integration tests where multiple services have to change

OT1H, :fu: terraform, so I could readily imagine it could actually be the whole problem you're experiencing, but OTOH it is just doing what it is told, so that's why I wanted to hear more about what, specifically, the problem is? too many hard coded strings? permission woes? race conditions (that is my life-long battle with provisioning stuff)?

this whole Ask HN is nerd sniping me, but I'm also hoping that we genuinely can try and find some "this has worked for me" that can lift all boats


The main problem is state, it’s too big and scattered. If it was just one db that knew everything that would be easy enough to snapshot, but there’s multiple dbs and s3 buckets and messages on queues and its all spread out. We could, with a bunch of work to make stuff actually portable, spin up a complete new environment, but a lot of the work that happens is related to synchronizing state and since all the state is scattered we can’t make an atomic clone of it.

Then of course there’s always the issue of people taking shortcuts, “i will test this once then we know it works then i’ll just hardcode this thing”. Making stuff truly impotent and portable is extra work for a “nice to have” of a feature env. YAGNI, until you do. For most devs they’re happy to have something that works and they can move on and ship the next thing.

Personally i’m a big fan of monoliths because they supposedly make this stuff easier. Then again i’ve never worked on a huge one and my colleagues much prefer spinning up an “isolated independent loosely coupled” service to adding it in the main app.


> Our biggest problem is feature environments

I'm always baffled to see how many shops claim to either not have this problem, or sidestep it. Every project I've ever worked on has had multiple enhancements/fixes in flight at the same time, they need to be tested and deployed independently of each other, on their own timelines. For this we need story branches and a fast way of deploying different story branches to different test environments. If you're merging everything into kitchen-sink "dev", "staging", "test" branches because someone drank the Gitflow kool-aid, your confidence goes way down that a specific story branch is "ready to go" and that production will behave exactly like dev and test. And, as you mention, accomplishing the story-branch approach across multiple repos (assuming the change is large enough to affect multiple repos) sounds like swimming with crocodiles.


I think it could be done with a monolithic app with a single db and truly stateless external services in a single repo. I only have experience from a single shop though and that’s not how the stuff we build ends up.


I’d argue the obvious answer is address the lack of great answers for declarative schema migration in PostgreSQL. There is Skeema https://github.com/skeema/skeema but it doesn’t support Postgres and Prisma iirc forces you into an ORM, atlas looks perfect but has a nonstandard license.


We have been using https://www.red-gate.com/products/flyway/ with Postgres (via CloudSQL) for at least 8 years and are happy with it.

Now that I looked at the website to share the link, I would probably be scared away because it looks very enterprise and corporate but it works well and it's just a Docker container built in CI and I don't have to interact with it.


I've used Flyway and it works great, can also recommend Alembic from the Python world - https://alembic.sqlalchemy.org/en/latest/


Flyway and Alembic are imperative, not declarative.

Imperative tools express schema changes as a filesystem of incremental "migrations", whereas declarative tools use a filesystem of CREATE statements and can generate DDL diffs to transition between states.

The declarative flow has a lot of benefits [1]. It is much closer to how source code is typically managed/reviewed/deployed, which is especially important for stored procedures [2].

However, using a purely declarative flow (as opposed to using a declarative tool just to auto-generate incremental migrations, which are then run from another tool) isn't necessarily advisable in Postgres, since Postgres supports DDL in transactions. So there's potentially multiple paths/orderings to reach any given desired state, as well as the ability to mix schema changes with data changes in one transaction. This means in Postgres there's often manual editing of the auto-generated incremental change, whereas in other DBs with fewer features this isn't necessary.

Disclosure: I'm the author of Skeema, mentioned in the GP comment.

[1] https://www.skeema.io/blog/2019/01/18/declarative/

[2] https://www.skeema.io/blog/2023/10/24/stored-proc-deployment...


Yeah, I was talking more in general as to what parent said suggesting good, battle-tested migration tools like Flyway. Declarative/imperative wasn't the focus.


We are using Hasura for schema migration. You don't have to use the graphql stuff: it will just follow your tables and generate .up and .down migrators for you.

Then you can just use pg like normal, directly.


The basic thing when dealing with the "Cloud" is to stay away from their exclusive offering and vendor-specific stuff to avoid lock-in. Once you're locked in you're gonna get fleeced really hard as they are aware that moving away in such a situation is much harder.

In general when you have a solution that is supposed to handle any problem and scenario like with AWS you'll eventually end up with some complicated Frankenstein-y creation, there's probably no going around it if they want such a robust set of features and capabilities.


Networking. Especially when you have to combine on-prem with cloud sollutions. At least in Azure it's an outright nightmare with the various private end-points and what not. AWS might be better, I know I much preferred it when we had Route50something instead of the Azure dns thing. Frankly a lot of the IT-operations tools are pretty shitty in Azure, but things like networking is where you are less ok with "winging it".

Well that and the constant cost increases. Container apps in Azure went from around $20-25 to $120 on our subscription. Along with all the other price hikes we're looking to move out of the cloud and go into something like Hetzner (but localized to our country).


Managing costs is the single biggest pain point. Be careful and don't start with kubernetes etc. if you just need to deploy a monolith because managed k8s is way more expensive than e.g. Azure Web Apps or AWS Lightsail. Review your architecture carefully to make the deployment as simple as possible. Microservices aren't for the product-market-fit stage but when you really need them (you will know when).


I would say be cognizant that a major cloud provider might always just contact you out of the blue and say they after a few years of usage they blew away your VPS, and have no recovery for it, and you should rebuild it from scratch off your last backup.

I already was using more than one service but I cancelled Rackspace when this happened, although was not making enough yet to be fully and automatically redundant. It was a pain to suddenly have to drop everything and rebuild the service, as it broke the entire service I was offering. Actually I rebuilt everything on my existing service at Linode (now Akamai), and then got a VPS at Ramnode as my backup. So it was a lot of work suddenly thrown at me that I didn't need. Luckily, while my backup practices are not completely ideal, I do follow the 3-2-1 backup rule enough that it wasn't catastrophic.

Here's the message I got from Rackspace in 2019:

This message is a follow-up to our previous notifications regarding cloud server, "c0defeed". Your cloud server could not be recovered due to the failure of its host.

Please reference this ID if you need to contact support: "CSHD-5a271828"

You have the option to rebuild your server from your most recent server image or from a stock image. Once you have verified that your rebuilt server is online, you must (1) Copy your data from your most recent backup or attach your data device if you use Cloud Block Storage and (2) Adjust your DNS to reflect the new server’s IP address for any affected domains. When you have verified that your rebuilt server is online, with your data intact, you will need to delete the impacted server from your account to prevent any further billing associated with the impacted device.

We apologize for any inconvenience this may have caused you.


I tend to avoid any PaaS service like Vercel or Heroku unless you're guaranteed to never need scale or complexity. If you can dockerize your main application code, it's easy enough to deploy on ECS or Azure app service. Lean heavily on COTS software for anything outside your core competency (ie CRM or marketing) and even consider outsourcing your on-call to an MSP.


Number one would be: tying your deployment to the cloud providers way of doing things. ECS, CodeDeploy, heroku.... it's easy to get stuck when your ops requirements grow beyond how they do things, and really hard to set up something more agnostic alongside it (especially when you have a tight budget on cloud resources).

Number two: using tech that you really don't need. Keep it simple, even if it means not using something that the rest of the industry is in love with. Figure out what your requirements are, pick the simplest way of setting it up (keeping #1 in account, because using tooling that isn't able to grow with you is worse IMO), and keep it agnostic. Then when you start to hit the edges of the capabilities, either scale up your ops understanding and start using more sophisticated things, or hire someone in to help out.

Also, I really do agree with @KaiserPro when they say to separate your infra from your deploy. It makes moving to something else much, MUCH easier when you inevitably need to.


Full disclosure I work at a company [1] attempting to address this very problem.

Without dedicated devops the challenge is allocating developer time to do the integration work required to get the various ops tools playing nicely together. In my experience that work is non-trivial and eventually leads to some form of dedicated ops.

That is also part of a dynamic -- lots of tools available for solving these problems exist because of dedicated ops. It's not easy for a team trying to build software to also take on these extra operational responsibilities.

What we're trying to do is create a highly curated set of what we think of as application primitives. These primitives include CI/CD, Logs, Services, Resources etc. Because they're already integrated it doesn't require developers to figure out API keys, access control, data synchronization, etc.

1. https://noop.dev


for others similarly curious, here's an example of the thing: https://github.com/noop-inc/template-java-spring-boot/blob/m...

they seem to be using the excellent lima <https://github.com/lima-vm/lima#readme> for booting on macOS; I run colima for its containerd and k8s support but strongly recommend both projects $(brew install lima colima)


We were using lima, but switched to directly interface with qemu. We were able to get better control for networking and lots of performance and status notification improvements.

Linked in parent is a blueprint which is how we define applications. That config can be used to stand up an application locally as well as in a cloud environment without additional configuration.


I think without aws it would be painful to be honest.

Use code pipeline/cloud build then either container registry and ecs or beanstalk.

You get everything you mentioned for free with either set up.

Seriously it takes moments to write terraform or even do it by hand.


Understood. But what if you have a multi-cloud setup or you want to migrate to a different cloud to efficiently use credits?


I personally think multi-cloud is over hyped and the complexity isn’t worth it for most organizations. The apps I am responsible for are one AWS region / 3 AZ and we have sufficient uptime. We could be multi-region but the added cost wasn’t worth the benefit and complexity. Multi-cloud makes zero sense.


Multicloud makes sense once you get to megascale where you have a dedicated team negotiating with CSPs over your discounts. If you can credibly flip a switch and move your multi-million dollar workload to a competitor, you can save big money.

This comes on top of the benefit that architecting for multi-cloud at that scale usually forces you to simplify/rationalize/automate ruthlessly and so is probably beneficial for a "mature" org with lots of cruft anyway.

That said there may only be a few dozen companies in the world where this makes sense. I happen to work for one, but most engineers don't. Even then, this is at the extreme right of the maturity curve, meaning that there is lots of lower-hanging fruit to pick first


Multicloud is just an inevitable reality--the default. The only time I have run into homogenous cloud architectures has been in large organizations and even then I was removed from the multi-cloud concerns because I was focused on a single product/service that was hosted in a single cloud infrastructure.


> Understood. But what if you have a multi-cloud setup or you want to migrate to a different cloud to efficiently use credits?

So yeah, one of the clients we were working with had to move from AWS to GCP (mostly for utilizing cloud credits). The setup was small but the estimated time was nearly 2-3 months. It's really a pain, and I think a lot of organizations don't think of doing it because when they weigh rewards and the effort/time needed, it doesn't make sense.


99% of the time it's just a docker container that connects to a database, an object store and redis.

If we need to move it then we move it... not exactly difficult or time consuming. There are always ways of half moving things too, for example we could move the container deployment to another cloud but keep our build process and object store at the original vendor and so on.

In reality we are not going to be shifting production load between cloud vendors every 2 weeks.


I run a profitable SaaS business all on my own.

Keeping it very simple: I push code to Github; then Capistrano (think bash script with some bells and whistles) deploys that code to the server and restarts systemd processes, namely Puma and Sidekiq.

The tech stack is fairly simple as well: Rails, SQLite, Sidekiq + Redis, and Caddy, all hosted on a single Hetzner dedicated server.

The only problem is that I can't deploy as often as I want because some Sidekiq jobs run for several days, and deploying code means disrupting those jobs. I try to schedule deployment times; even then, sometimes, I have to cut some jobs short.


> I can't deploy as often as I want because some Sidekiq jobs run for several days, and deploying code means disrupting those jobs

Sounds like a use case for Cadence/Temporal-style fault-oblivious stateful execution with workflows. At last job, we did Unreal Engine deployments with pixel streaming at scale on a huge fleet of GPU servers, and the way we could persist execution state so hassle-free that the code would magically resume at the line right where it got interrupted was so astounding.


If you can, look into the `sidekiq-iteration` gem.


Costs - predicting what-if changes, debugging unexpected increases, etc.


A monolith on GCP Cloud Run + Postgres Alloy db is as close to Heroku as exists today. No external tools for ci, deployments/rollbacks, logs, metrics are needed. Zero lock-in, nothing to maintain. If for some reason you get to 10+ services and want to go to kubernetes, that's easy.

If AWS is required, a monolith on ECS Fargate + Aurora Postgres Serverless is perfectly cromulent. You'll need to string together some open source tools for e.g. ci/cd.


Pay attention to data sovereignty. Think about what you'll do if your customers or your regulator ask you to keep data in their jurisdiction.

"In the cloud" has recently come to mean "NSA gets a copy of whatever they want" unless you choose wisely.


To be honest, with Azure Devops platforms, web app instances, containers and deployment slots, its almost bulletproof once you spend the time to get it originally configured. Same with AWS I imagine.


how much time did it take for you to initially configure it? So with your and other comments, I'm getting a feeling that if you are just on one cloud and don't have a complex architecture, then it's pretty easy to do all these tasks, right?

And even so, how much time would developers be spending on these tasks weekly?


That we still need infra humans that understand it and maintain it.


K8s is the name of the pain.

And lack of attention to lean alternatives like docker swarm - is another one.


I don’t trust using AWS without exposing myself to the risk of an enormous bill.


What do you do instead? And what if you have cloud credits or a very small setup?


a lot can be done via gitlab auto-devops. you get alot of out of the box feature like pipelines, deployments, review environments...

you can quite easily connect it to GCP/AWS/Azure


Vendor lock-in


how often did you feel this pain? Because a lot of people here are also saying that orgs stick to one cloud only.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: