Hacker News new | past | comments | ask | show | jobs | submit login
Does your startup need complex cloud infrastructure? (hadijaveed.me)
290 points by hjaveed 4 days ago | hide | past | favorite | 385 comments





I went through sweat and tears with this on different projects. People wanting to be cool because they use hype-train-tech ending up doing things of unbelievably bad quality because "hey, we are not that many in the team" but "hey, we need infinite scalability". Teams immature to the point of not understanding what LTS means have decided that they needed Kubernetes because yes. I could go on.

I currently have distilled, compact Puppet code to create a hardened VM of any size on any provider that can run one more more Docker services or run directly a python backend, or serve static files. With this I create a service on a Hetzner VM in 5 minutes whether the VM has 2 cores or 48 cores and control the configuration in source controlled manifests while monitoring configuration compliance with a custom Naemon plugin. A perfectly reproducible process. The startups kids are meanwhile doing snowflakes in the cloud spending many KEUR per month to have something that is worse than what devops pioneers were able to do in 2017. And the stakeholders are paying for this ship.

I wrote a more structured opinion piece about this, called The Emperor's New clouds:

https://logical.li/blog/emperors-new-clouds/


I started my career in a world where we did everything using shell scripts running directly on bare metal servers, usually running Solaris, and later SuSe or RedHat. I never understood the "how would you reproduce your setup without Docker (or X, where X is some other technology)". The scripts were deterministic. The dependency versions were locked. The configurations were identical. The input arguments were identical. The order of execution was identical. It all ran on a deterministic computational device. How could it not be reproducible?

Well that's exactly the point! Creating complex cloud resources with, for instance, Terraform, is less reproducible than a shell script on an LTS system like Ubuntu or RHEL - that's because the cloud provider interfaces drifts and from time to time stops accepting the terraform manifests that previously worked. And to fix it, you have to interrupt your normal work for yet another unplanned intervention in the terraform code - this happened to my teams several times.

This does not happen with Puppet + Linux, because LTS distributions have a long release cycle where compatibility is not broken.

I tried to explain this topic in the article linked above. Not sure how far I succeeded.


Leaning into LTS is nice until you near EOL and have to migrate everything in an often Herculean effort to work with the next LTS release.

Like 12 years of life cycle is not enough for you to plan a transition?

You can use the entire life cycle but not one is forcing you to. You can update from one LTS to another every 2 years, or 4 years, or 5 years... you decide.


I don't really think we're in disagreement here. The longer you wait, the harder the transition will be. LTS is a good foundation, and usually the right choice for "enterprise" or "business" settings, but you should not rely overmuch on any one LTS release's way of doing things, when the wider Linux ecosystem moves much faster.

The longer you wait the harder the pain. The less you wait the more frequent the pain. So it depends on the function that converts intensity and frequency to suffering :p But, most importantly, the fact that LTS gives you a choice is what I was highlighting.

For the scope I operate, which is pretty standard Linux packages (PostgreSQL, MariaDB, Nginx, Docker, OpenVPN, OpenSSH) the changes between 16.04 and 22.04 have been quite OK to deal with.


It's a tradeoff. Doing a big effort once every 4 or 5 years, vs a hopefully smaller effort every year. Sometimes the intermediate smaller steps help you move forward, sometimes it just means more migrations. Sometimes the software/hardware you need means you can't use a LTS OS at all.

If possible, it's nicer to pick established, mature software for as much of your stack as you can, so that there's less of a difference in APIs over longer time frames. But it's not always possible.


It's not terrible in my experience of doing it several times now.

It is definitely less terrible than trying to unfuck tangles of terraform / terragrunt / yaml / bits of cloud infra.


I went through the migration from CentOS 6 to 7 and never want to do anything like that again. The good news, I guess, is that it never will happen again: CentOS is basically dead anyway, and it's not likely that so many core pieces of system software will change that drastically anymore.

I did CentOS 3 -> 4 -> 5 -> 6 -> 7 -> Debian. Very few problems.

(30 nodes)


I can't imagine you leaned into any one of those releases, then. That sequence involves major changes to the kernel, the init system, the configuration management tools, the core libraries, Apache, Python, Perl, etc. Any one of those alone could (and did, in my experience) trigger a major rewrite of configuration and/or code.

I'm glad it was painless for you. In my experience, it was not, and most of the reasons were beyond my control.


What does lean into mean here? A lot of software from 20 years ago compiles (if needed) and runs fine on the latest versions.

Every major release of every major distribution makes choices. These are choices about what software to include in the first place, what versions of that software to pin (especially for LTS releases), what default configuration to provide, recommendations about how to solve certain problems, etc. These choices are made based upon the experience and opinions of the distribution maintainers. However, those maintainers are (usually) not major contributors to the software they're distributing. This means distros can make "bad" choices, choosing for example to focus on software that eventually dies out, or recommending configurations that eventually get deprecated or removed, etc. Sometimes, these choices are even made in a way such that they exclude what will become the winning alternative, leaving no migration path except complete and total overhaul.

If all Linux is to you is a place to run some application software, these choices are mostly irrelevant. As long as the software you care about continues to run, the other things are just picayune details. If this comes off as derisive, I apologize, because I'm actually broadly endorsing that view of things, as much as it is possible to achieve. But if you start really taking advantage of the things which the distribution provides out of the box and recommends, especially around large-scale multi-system operation, you end up buying into the distibution's choices. When a large organization you're a part of does it too, now the sunk costs really start to mount. As the Linux ecosystem continues to evolve, especially in different directions than the distribution chose at the time, the cost of migrating to later releases grows. This is all a good reason to me to not marry oneself so tightly to those particular choices, but that isn't always feasible with deadlines and compliance requirements and so on bearing down on the sysadmin.

There's also an even bigger problem that can arise, the distribution can just end, such as the termination of CentOS, leaving lots of people hanging. In that case, I know some who started to pay Red Hat for RHEL, but most seem to have moved on to other distros, like Ubuntu. That kind of migration has a lot of the same issues, too, once again leaving me to recommend not to lean into the particulars too much.


> But if you start really taking advantage of the things which the distribution provides out of the box and recommends, especially around large-scale multi-system operation, you end up buying into the distibution's choices.

You mean management interfaces and repo mirroring stuff provided by the OS vendor, like cockpitd and Satellite and whatever?


Sure, that's part of it, if those tools are used. Daemons like the particular flavor of syslog and cron are also part of it. Patched kernels used to be more common, too. I listed a bunch of things that actually broke for me before in a sibling thread; sometimes it was down to e.g. the Python packages that were in EPEL vs. the Python packages that were actually being maintained by their original authors in PyPI, or various security tools configured around paths that changed, etc. There were usually workarounds or alternatives, but they were more difficult to set up than doing things the "native" way.

I see! Thanks for referring to your sibling post, that definitely made clearer what you're talking about.

And yeah if you package stuff against, e.g., the Python libs included in the distro (or EPEL), you essentially need to maintain a repo as a downstream repo of the distro, then rebuild the whole repo with whatever subsequent release as a new upstream when it's time to upgrade. That kind of thing is doable but it's substantial integration work, and if it's aomething you do once a decade nobody is ever going to be fluent in it when it's time to be done.

I think I'd rather just maintain two repos— one against the latest stable release and one against the upstream rolling release (Fedora Rawhide, Debian Unstable, openSUSE Factory or Tumbleweed, etc.)— and upgrade every 6 months or whatever than leap the wider chasms between LTS releases.

And yeah the Python and Python libs shipped in a distro are generally there for the distro's integration purposes, which may involve different goals and constraints than app developers usually have. Building against whatever a distro ships with is not always the best way, as your painful migrations demonstrated.


> There's also an even bigger problem that can arise, the distribution can just end, such as the termination of CentOS

If you are doing something serious you probably want to chose suppliers in such a way that you can demonstrate you have security and business continuity under control. That means you probably want to use RHEL, Suse or Ubuntu, distributions for which commercial support exists.

(Ubuntu is particularly interesting because you can start with an LTS release for free and activate commercial support if business goes well, without changing your processes.)

You can think about this beforehand or wait until customers require some kind of certification and the auditors ask you for your suppliers list + the business continuity plan, among other things. You will face this if you deliver to a regulated market or if your customers are large enough to self regulate this kind of thing.

LTS not good enough? Well, cloud native does not have LTS comittement and Pipy does not provide security fixes separated from logical changes.

Try to keep your Terraform code stable for two years in AWS, or try to understand the lifecycle of AWS Glue versions from the docs. Or trust that Google will not discontinue their offers :-)

I mean, maintaining software is never easy or effortless but I respect the effort done by LTS Linux providers - they sell stability and security for a fraction of what you pay for cloud native.


apache -> nginx. Python versions. postgres. All fine.

Did you crossgrade to Debian in-place?

What is it that people do that breaks so often due to lack of backwards compatibility from the OS?

IMO, the lure of an LTS is that you don't need to keep testing if your computer is still working every week when a set of updates come. Not that things that your software depends on the details remain frozen. If your software depends on the details of something, you should add it as a dependency.


The bigger problem IMO is not that things break, it's that if you depend on one LTS release too heavily, and you wait too long to migrate from one LTS to another, everything breaks all at once.

What should be a gradual migration as new things develop turns into a singular nightmare.


What are you depending on the OS that isn't extremely backwards compatible?

Once in a decade you get something like a breaking upgrade of nginx, or the glibc debacle of 2003. That may take a person-week to fix[1], what can hardly be called "herculean".

1 - If you go with 1 person * 1 week, if you try to go with 7 people * 1 day, it will suddenly cost 7 person-weeks. But the only way upgrading is such a hurry is if you borked a lot of things prior to it.


Off the top of my head, some of the things that have broken at an LTS transition that I've been involved with are out-of-tree kernel module builds, C code using OpenSSL, Puppet config, Salt config, RPM specfiles, Python code, Perl code, Apache configs, shell scripts, Java code, bootloader configs, bootstrap scripts, and init scripts/configs (esp. sysvinit to systemd). Any one of these things is not a problem in isolation, the problem is due to having to fix all of them all at once. Too much complexity put into any one of them (often arising from external requirements or rushed implementations) also makes migrating harder. Waiting until the 11th hour on the EOL clock just adds to the stress of the process.

Many of my bad experiences were because of corporate policies and lack of proper prioritization at levels above system administration. However, the sysadmin does have some choice in the matter, especially when greenfielding. You can turn stability into a vice if you're not careful.


You said it: Your versions were locked. Therefore it is not constantly up-to-date.

I was pinched myself: Security.

- With the cloud threats, everything needs to be constantly up-to-date. Docker images make it easier than permanent servers that need to be upgraded. We used to upgrade every week, now we’re upgraded by default. So yes, sometimes our images don’t start with the latest version of xyz. But this is rare, downgrade is easy with Docker, and reproduction on a dev engine easier.

- With the cloud threats, everything needs to be isolated. Docker makes it easy to have an Alpine with no other executable than strictly necessary, and only open ports to the required services.

I hate the cloud because 4GB/2CPU should be way enough to run extremely large workloads, but I had to admit that convenience made me switch.


We did upgrades periodically, each time a conscious choice after reviewing the release notes of the dependency. Occasionally a script would need to be updated, but that was it.

What needs to be constant and up to date is reviewing the new patches and which ones can be released and not locked.

The versions that are not locked can be a test or dev environment that constantly updates and checks for errors.

Security threats are a thing, how we do and don't use technologies as well which ones can also factor in to how much is exposed.


A container is locking the whole OS, on this axis it's not an improvement either direction. You still need a way to update deps.

To be fair there's real issues with this approach, too. For example, shell scripts aren't actually very portable. GNU awk vs nawk vs... multiply that by all your tools, and yeah those scripts don't run deterministically (they rely too much on the environment). This alone was a big reason why systemd exists today.

But there's a middle ground here too. To me there's a HUGE gap between Kubernetes distributed systems and shell script free for all.


reproducibility isn't just on your deployments, it's for development too. got old REAL fast when your fancy build doesn't work the same on every devs device or some one off issue with how your dev has setup their environment steals hours from everyone.

it was a big reason why we moved to containers at the bare minimum, because its quick and easy to spin up and destroy and you are guaranteed what runs locally runs on prod. no more "well it worked on my system!".


>reproducibility isn't just on your deployments, it's for development too

Absolutely. Adhoc configurations should be forbidden! It is easy to ensure dev env reproducibility when you run Linux. If you have config management your devs can have VMs that subscribe to the same exact configuration that the staging prod and dev environments have. They can literally have a deplpyment server in their machine, as a VM. Since the configuration is stored on a server and applied continuously, it is hard to screw it.

You can achieve this with Docker as well, if the arrangement is not too complex.

The problem, at least in my experience, comes when you start depending on several cloud native components where local emulations are always different from the real cloud env in tiny details that are going to screw the deploys over and over.


Wouldnt there be slight differences in different Unix flavors so that the script couldnt run in all of them? If it only worked on Solaris, what would happen if Solaris retired? (Like what happened to Centos)

You will likely have to adapt your scripts for OS-specific or installation-specific tasks like package management and modifying filesystems. In the past I've used Nix (either via `nix run` and `nix shell` or templating in Nixpkgs' `writeScript` or similar) for this stuff to guarantee that I'm always running the same tools regardless of what's installed on the base system. This can free you up to use a different shell, rely on recent features of Bash, use GNUisms in coreutils, sed, grep, find, etc., fix a specific version of jq, use external templating tools, etc. For systemd-based distros, you can even use Nix to manually install system-wide services: just install a package to the default or system profile, and then symlink the included unit files from the profile (not the direct store path) into /etc/systemd. `systemctl daemon-reload` and you can manipulate them in all the usual ways one would with systemd.

Other Unix distros don't have first-class support with Nix so you may need to take some additional care when working out your script (especially the part of it that installs Nix), but if you don't need to set up services this way you can write portable scripts with few limitations that will work across all Linux distros, macOS, probably FreeBSD and maybe NetBSD.

I've never been so lucky as to work at a place that used any Unix flavors other than Linux and macOS, though.


That's what POSIX was for. Keep your scripts and system calls POSIX compliant and you could move from something like AIX to Linux easily.

POSIX never specified things like disk partitioning or package management, so this still requires something else to give you a working system in the first place.

You know what happened when centos retired? Nothing for us. We still use centos 7 at work as we speak.

Depends on where you are in the ecosystem. If you're running your own service, the only flavors that matter and the ones you're using.

If all my machines are FreeBSD 4.11, I don't care if my scripts don't run on Linux or Solaris or SCO or even FreeBSD 4.8 or 14. I might care someday, but not today.

Maintenance scripts need to run on all the versions in the fleet (usually), but setup scripts can often be limited to the latest version, because why not use the latest OS if you're setting up a new machine.

If you're distributing software, yeah you've got to support a lot of variation. If you're at a shop that runs lots of different flavors, you have to support lots of variation. But a lot of people just pick a flavor and update the scripts as needed when the flavor of the day changes.

Trying to keep dependencies and running services as tight and small as possible helps a lot with keeping up to date on security. Don't need to update things that aren't installed, and may not need to update things that are installed but not running (but sometimes you do).


I feel like Kubernetes is always randomly mentioned in rants like this. Instead of saying your hardened VM has Docker you could have just said it has kubelet on it. Then instead of a bunch of ad hoc "docker services" you could pay pennies for a k8s control plane that gives you control over everything on those VMs. I fail to see how your way is anything but worse.

The bad cloud infrastructure is when people try to use every single thing AWS sells and their whole infrastructure is at super high levels of abstraction that they could never migrate to another platform. K8s isn't that at all.


Unfortunately in air-gapped systems you cannot simply pay pennies for a managed k8s platform. In these cases you have to bootstrap and manage k8s on your own in your data centers. While I do not think bootstrapping and managing a cluster is difficult at all (especially if you only handle stateless workloads) it may still not fit or integrate well with a companies overall management infrastructure.

While I am a happy cloud infrastructure user in private, I have to go through some extra hoops to deploy applications at work, regardless of if k8s is used or not.


In think in either case, if you already have code that's done, using that is going to be less effort than switching.

However, I ran kubeadm on a hetzner server and it's just sat chugging along forever basically. I use the cluster to run ephemeral apps where I build and deploy 1 golang service, a couple of node services in about 60 seconds ( with cache, obviously ).

As someone old enough and skilled enough to do the same with puppet, why bother when it's simpler easier that even the kids who don't understand TLS can do it with k8s?


100% best comment in this thread.

With k8s you get a way of saying 'WHAT YOU WANT' without 'HOW TO DO IT', and this is applies not only to the actual infra aspect, but the people maintaining it too. Any cloud platform and devops worth their salt can maintain a k8s system. Good luck finding someone to understand what that 'custom Naemon' plugin is doing.


> Good luck finding someone to understand what that 'custom Naemon' plugin is doing.

You Kubernetes people get triggered very easily. I was already lucky to have found several juniors that worked in this kind of thing with minimal training. The 'custom Naemon plugin' is 30 lines of bash and you can adapt it to any monitoring system.

Of course this is scary and complicated. I might consider switching to 'Kubernetes operators', which sounds simpler :-)


I've done all of this and then some. I used to deploy websites by FTPing into the server and copying files. Then it was bash scripts, then Ansible. IMO Kubernetes hits a very good level of abstraction. You can totally deploy 30 lines of bash to every server, you just have to wrap it in a docker container. That's all k8s asks for for a workload. You don't have to use operators. That would be something to explore much later. Honestly I just think you should be more generous and not assume people have created this stuff just for fun. K8s really does address real problems around deployment and it's very well thought out.

To be fair in other comments OP made an effort not to get involved into those endless Kubernetes vs VM discussions. However either side eventually posts a snarky comment and there goes.

I think everyone just has to acknowledge that there are use cases for both. Also Kubernetes and "classic" configuration management via Ansible (or others) are orthogonal to each other. So these discussions are somewhat misguided in the first place.

For example: you might want to deploy a VM or auto-install and configure a physical machine with custom tooling and something like Ansible or Puppet and _then_ configure said machine as a Kubernetes node that handles the actual workloads. In other cases some Dev might want to install and run an application without the k8s layer using Nginx as a webserver. In this case, too, Puppet/Ansible might or might not be involved in configure the application but only handle the "OS layer" if there is such a thing. And in yet other cases you get away with a simple cloud-init script that makes your machine a k8s node and leave out other configuration management tools altogether.

Guess what: All of this is fine. Evaluate solutions based on what you need, not what other people working in giant corporations urge you to use. And then go and build it, ideally having fun doing it.

Representing either tool as a one-size-fits-all is misguiding at best and seems to be overly simplistic to the complex problem of deploying your applications.


> Honestly I just think you should be more generous

I am generous in the context for generosity. Turns out that engineering is not about being generous but rather about choosing the most efficient solution for problems that in the end need to be business driven. This requires evaluating requirements, context and tradeoffs. That takes a cold, rational mind more than generosity.

> K8s really does address real problems around deployment and it's very well thought out

It's great where it makes sense. It's less than great elsewhere.

Not everything is SaaS, not everything needs scaling, not everything needs 99.99% of uptime, not everything needs a CDN, not every company is VC backed operating at high risk / high reward, etc, etc. Context is better than ideology. If you read the article I posted you will see that stated clearly.


I completely agree that most people don't need that. This is always what people say when k8s comes up. This is also what people said about git 15 years ago (you're not the kernel etc). But the thing is you don't have to use any of the bits you don't need. At first I listened to the naysayers and was wary of k8s thinking it would create more problems than it solves. That simply hasn't been the case. It's not a no-brainer, there are tradeoffs, but I really think it makes sense especially if you're doing docker anyway. Like I said in another comment, people tend to talk about two different things. There's k8s which can be as little as just a single node k3s server which is basically docker compose with a few extras like automatic rollout etc. Then there's the over the top "cloud native" stuff. One does not imply the other.

How do you monitor this setup?

How do you control access to this setup?

How do you deploy on a different provider to Hetzner?

How do you access logs on this setup?

How do others maintain this setup?

How do you run backups?

How do you run cron jobs?

How do you deal with an offline node?

How do you expose a new ingress?

How do you provision extra storage on this setup?

If any of those is answered with 'something homegrown' or 'just write a script' then you have all the reasons k8s is worth it.


The questions are short but the answers would be long. Puppet manages all fine grained OS resources (files, dirs, repos, cronjobs, sudo declarations, firewall rules, etc) and you aggregate those resources into classes which are then pushed to different machines. The classes are parametrizable for the differences between systems.

If I was to write an idempotent script for each native resource I would finish in some years :-)

You chose whatever monitoring system you like the most.

For offline nodes you use whatever the level of criticity of your node justifies. This is something people struggle to understand: not every business needs 99.99% uptime. That said, I never had a downtime in Hetzner. On Digital ocean I had one short forced reboot in 4 years. YMMV so protect yourself as much as necessary.

Deploying on a different provider than Hetzner is the same as deploying on Hetzner except the part of launching the machine which is trivial to script - the added value is making the machine work and Ubuntu/Debian/RHEL are the same everywhere. You don't have vendor lock in with this.

If K8s works for you, enjoy it. Nobody is telling you to stop :-)


Hetzner and Kubernetes are not mutually exclusive.

- https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne...

- https://www.hetzner.com/hetzner-summit --> "Managed Kubernetes Insights and lessons learned from developing our own Kubernetes platform"


Serious question for you, why use Docker at all? You can just get rid of the clunky overhead.

You mentioned Python backend, so literally just replicate build script, directly in VPS: "pip install requirements.txt" > python main.py" > nano /etc/systemd/system/myservice.service > systemd start myservice > Tada.

You can scale instances by just throwing those commands in a bash script (build_my_app.sh) = You're new dockerfile...install on any server in xx-xxx seconds.


I mentioned Docker because it interests many developers but on VMs that I control I do not need Docker at all. Deploying with Docker provides host OS independence which is nice if you are distributing but unnecessary if the host is yours, running a fixed OS.

For Python backends I often deploy the code directly with a Puppet resource called VcsRepo which basically places a certain tag of a certain repo on a certain filesystem location. And I also package the systemd scripts for easy start/stop/restart. You can do this with other config management tools, via bash or by hand, depending on how many systems you manage.

What bothers me with your question is Pip :-) But perhaps that is off topic...?


No, you are tied to docker supported operating systems.

Will not run on FreeBSD, for example.


>Will not run on FreeBSD, for example.

Not true:

https://podman.io/docs/installation#installing-on-freebsd-14...

ATM experimental


Yes, so not really supported.

That's the lamest excuse ever, are you a tech guy or a lawyer?

I'll correct myself:

s/host OS independence/a certain level of host OS independence

And getting containers to run depends on the OS - if you don't control the host, leads to major ping-pongs.

Even within Linux (Ubuntu, Debian, RHEL, etc) when you are distributing multiple related containers there are details to care about, not about the container itself but about the base OS configuration. It's not magic.


OP is talking about substituting a Kubernetes setup. FreeBSD was never in the cards. 99% of companies in the cloud don’t run or care about anything other than Linux.

That may be true, but it’s still not “host OS independence”, which was my point

> No, you are tied to docker supported operating systems

No, you're tied to operating systems using a Linux kernel that supports the features necessary for running images.


You can run Linux under FreeBSD using either bhyve, using the Linux emulator and under jails. But you cannot run docker.

>But you cannot run docker.

You can -> Podmaaan

https://podman.io/docs/installation#installing-on-freebsd-14...

ATM experimental


Famously, no one has ever had Python environment problems :D

If you really want to open that can of worms, here it goes:

Pipy is an informal source of software that has low security levels and was infested with malware many times over the years. It does not provide security updates: it provides updates that might include security-related changes as well as functional changes. Whenever you update a package from there, there is a chain reaction of dependency updates that insert untested code in your product.

Due to this, I prefer to target an LTS platform (Ubuntu LTS, Debian, RHEL...) and adapt to whatever python environment exists there, enjoying the fact that I can blindly update a package due to security (ex: Django) without worrying that it will be a new version which could break my app. *

Furthermore, with Ubuntu I can get a formal contract with Canonical without changing anything on my setup, and with RHEL it comes built-in with the subscription. Last time I checked Canonical's security team was around 30pax (whereas Pipy recently hired their first security engineer). These things provide supply-chain peace of mind to whoever consumes the software, not only to who maintains it.

I really need to write an article about this.

* exceptions apply, context is king


I've just doubled down on "making my own Debian packages".

There's tons of examples, you are learning a durable skill, and 90% of the time (for personal stuff), I had to ask myself: would I really ever deploy this on something that wasn't Debian?

Boom: debian-lts + my_package-0.3.20240913

...the package itself doesn't have to be "good" or "portable", just install it, do your junk, and you don't have to worry about any complexity coming from ansible or puppet or docker.

However: docker is also super nice! FROM debian:latest ; RUN dpkg -i my_package-*.deb

...it's nearly transparent management.


I don't mean this as a rebuttal, but rather to add to the discussion. While I like the idea of getting rid of the Docker layer, every time I try to I run into things that remind me why I use Docker:

1. Not needing to run my own PPA server (not super hard, it's just a little more friction than using Docker hub or github or whatever)

2. Figuring out how to make a deb package is almost always harder in practice for real world code than building/pushing a Docker container image

3. I really hate reading/writing/maintaining systemd units. I know most of the time you can just copy/paste boilerplate from the Internet or look up the docs in the man pages. Not the end of the world, just another pain point that doesn't exist in Docker.

4. The Docker tooling is sooooo much better than the systemd/debian ecosystem. `docker logs <container>` is so much better than `sudo journalctl --no-pager --reverse --unit <systemd-unit>.service`. It often feels like Linux tools pick silly defaults or otherwise go out of their way to have a counterintuitive UI (I have _plenty_ of criticism for Docker's UI as well, but it's still better than systemd IMHO). This is the biggest issue for me--Docker doesn't make me spend so much time reading man pages or managing bash aliases, and for me that's worth its weight in gold.


Yuuup! I'm super-small time, so for me it's just `scp *.deb $TARGET:.` (no PPA, although I'm considering it...)

Really, my package is currently mostly: `Depends: git, jq, curl, vim, moreutils, etc...` (ie: my per-user "typically installed software"), and I'm considering splitting out: `personal-cli`, `personal-gui` (eg: Inkscape, vlc, handbrake, etc...), and am about to have to dive in to systemd stuff for `personal-server`, which will do all the caddy, https, and probably cgi-bin support (mostly little home automation scripts / services).

I'm 100% with you w.r.t. the sudo journalctl garbage, but if you poke at cockpit https://www.redhat.com/sysadmin/intro-cockpit - it provides a nice little GUI which does a bunch of the systemd "stuff". That's kindof the nice tag-along ecosystem effects of "just be a package".

I'm definitely relatively happy with docker overall, but there's useful bits in being more closely integrated with the overall package system management (apt install ; apt upgrade ; systemctl restart ; versions, etc...), and the complexity that you learn is durable and consistent across the system.


In situations at work where we use something as an alternative to Docker as a deployment target, it's Nix. That has its own problems and we can talk about them, but in the context of that alternative I think some of your points are kinda backwards.

> 1. Not needing to run my own PPA server (not super hard, it's just a little more friction than using Docker hub or github or whatever)

Docker actually has more infrastructure requirements than alternatives. For instance, we have some CI jobs at work whose environments are provided via Nix and some whose environments are provided by Docker. The Docker-based jobs all require management of some kind of repository infrastructure (usually an ECR). The Nix-based jobs just... don't. We don't run our own cache for Nix artifacts, and Nix doesn't care: what it can find in the public caches we use, it does, and it just silently and transparently builds whatwver else it needs (our custom packages) from source. They get built just once on each runner and then are reused across all jobs.

> 2. Figuring out how to make a deb package is almost always harder in practice for real world code than building/pushing a Docker container image

Definitely depends on the codebase, but sure, packaging usually involves adhering to some kind of discipline and conventions whereas Docker lets you splat files onto a disk image via any manual hack that strikes your fancy. But if you don't care about your OCI images being shit, you might likewise not care about your DEB packages being shit. If that's the case, you can often shit out a DEB file via something like fpm with very little effort.

> 3. I really hate reading/writing/maintaining systemd units. I know most of the time you can just copy/paste boilerplate from the Internet or look up the docs in the man pages. Not the end of the world, just another pain point that doesn't exist in Docker.

> 4. The Docker tooling is sooooo much better than the systemd/debian ecosystem. `docker logs <container>` is so much better than `sudo journalctl --no-pager --reverse --unit <systemd-unit>.service`. It often feels like Linux tools pick silly defaults or otherwise go out of their way to have a counterintuitive UI (I have _plenty_ of criticism for Docker's UI as well, but it's still better than systemd IMHO). This is the biggest issue for me--Docker doesn't make me spend so much time reading man pages or managing bash aliases, and for me that's worth its weight in gold.

I don't really understand this preference; I guess we just disagree here. Systemd has been around for like a decade and a half now, and ubiquitous for most of that time. The kind of usage you're talking about is extremely well documented and pretty simple. Why would I want a separate, additional interface for managing services and logs when the systemd stuff is something I already have to know to administer the system anyway? I also frequently use systemd features that Docker just doesn't have, like automatic filesystem mounts (it can do some things fstab can't), socket activation, user services, timers, dependency relations between units, descri ing how services that should only come up after the network is up, etc. Docker's tooling really doesn't seem better to me.


> Docker actually has more infrastructure requirements than alternatives.

I was mostly comparing Docker to system packages, and I was specifically thinking about how trivial it is to use Docker Hub or GitHub for image hosting. Yeah, it's "infrastructure", but it's perfectly fine to click that into existence until you get to some scale. I would rather do that than operate a debian package server. Agreed that Nix works pretty well for that case, and that it has other (significant) downsides. I'm spiritually aligned with Nix, but Docker has repeatedly proven itself more practical for me.

> Definitely depends on the codebase, but sure, packaging usually involves adhering to some kind of discipline and conventions whereas Docker lets you splat files onto a disk image via any manual hack that strikes your fancy. But if you don't care about your OCI images being shit, you might likewise not care about your DEB packages being shit. If that's the case, you can often shit out a DEB file via something like fpm with very little effort.

I'm not really talking about "splatting files via manual hack", I'm talking about building clean, minimal images with a somewhat sane build tool. And to be clear, I really don't like Docker as a build tool, it's just far less bad than building system packages.

> don't really understand this preference; I guess we just disagree here. Systemd has been around for like a decade and a half now, and ubiquitous for most of that time.

Yeah, I don't dispute that systemd has been around and been ubiquitous. I mostly think it's user interface is hot garbage. Yes, it's well documented that you can get rid of the pager with `--no-pager` and you can put the logs in a sane order with `--reverse` and that you specify the unit you want to look up with `--unit`, but it's fucking stupid that you have to look that stuff up in the man pages at all never mind type it every time (or at least maintain aliases on every system you operate) when it could just do the right thing by default. And that's just one small example, everything about systemd is a fractal of bad design, including the unit file format, the daemon-reload step, the magical naming conventions for automatic host mounts, the confusing and largely unnecessary way dependencies are expressed, etc ad infinitum.

> Why would I want a separate, additional interface for managing services and logs when the systemd stuff is something I already have to know to administer the system anyway?

I mean, first of all I'm talking about my preferences, I'm not trying to convince you that you should change, so if you know and like systemd and you don't know Docker, that's fine. And moreover, I hate that I have to choose between "an additional layer" and "a sane user interface", but having tried both I've begrudgingly found the additional layer to be the much less hostile choice.

> I also frequently use systemd features that Docker just doesn't have, like automatic filesystem mounts (it can do some things fstab can't), socket activation, user services, timers

Yeah, I agree that Docker can't do those things. I'm not even sure I want it to do those things. I'm talking pretty specifically about managing my application processes. But yeah, since you mention it, fstab is another technology that has been around for a long time, is ubiquitous, and is still wildly, unnecessarily hostile to users (it can't even do obvious things like automounting a USB device when it's plugged in).

> ... dependency relations between units, descri ing how services that should only come up after the network is up, etc. Docker's tooling really doesn't seem better to me.

Docker supports dependency relations between services pretty well, via its Compose functionality. You specify what services you want to run, how to test their health, and how they depend on each other. You can have Docker restart them if they die so it doesn't really matter if they come up before the network (but I've also never had a problem with Docker starting anything before the network comes up)--it will just retry until the network is ready.

Docker's tooling is better in its design, not necessarily a more expansive featureset. It has sane defaults, so if you do `docker logs <container>` you get the logs for the container without a pager and sorted properly--you don't need to remember to invoke `sudo` or anything like that assuming you've followed the installation instructions. Similarly, the Compose file format is much nicer to work with than editing systemd units--I'm not huge fan of YAML, but it's much better than the INI format for the kind of complex data structures required by the domain. It also doesn't scatter configs across a bunch of different files, it doesn't require a daemon-reload step, the files aren't owned by root by default, they're not buried in an /etc/systemd/system/foo/bar/baz tree by default, etc.

Like I said, I don't think Docker is perfect, and I have plenty of criticism for it, but it's far more productive than dealing with systemd in my experience.


This is the way. And truthfully if you can learn to package for Debian, you already know how to package for Ubuntu and you can easily figure out how to package for openSUSE or Fedora or Arch.

Even `alien` or I think ~suckless package manager~ `fpm` for 90% of things.

Option 1: python3 -m venv venv > source project/venv/bin/activate

Option 2: use Poetry

How is this different than a Dockerfile that is creating the venv? Just add it to beginning, just like you would on localhost. But that is why I love to code Python in PyCharm, they manage the venv in each project on init.


My comment about pip is orthogonal to Docker. This is the same with or without Docker - I added a comment on this thread with more detail.

> why use Docker at all?

We have a simple cloud infrastructure. Last year, we moved all our legacy apps to a Docker-based deployment (we were already using Docker for newer stuff). Nothing fancy—just basic Dockerfile and docker-compose.yml.

Advantages:

- Easy to manage: we keep a repo of docker-compose.yml files for each environment.

- Simple commands: most of the time, it’s just "docker-compose pull" and "docker-compose up."

- Our CI pipeline builds images after each commit, runs automated tests, and deploys to staging for QA to run manual tests.

- Very stable: we deploy the same images that were tested in staging. Our deployment success rate and production uptime improved significantly after the switch—even though stability wasn’t a big issue before!

- Common knowledge: everyone on our team is familiar with Docker, and it speeds up onboarding for new hires.


Python, Ruby, and to a much larger extent PHP are the Docker showcase!

For example, if you have a program that uses wsgi and runs on python 2.7, and another wsgi program that runs on python 3.16, you will absolutely need 2 different web servers to run them.

You can give different ports to both, and install an nginx on port 80 with a reverse proxy. But software tends to come with a lot of assumptions that make ops hard, and they will often not like your custom setup... but they will almost certainly like a normal docker setup.


I think a lot of (justifiable) Docker use comes out of being forced to use other tools & ecosystems that are fundamentally messy and not really intended for galactic-scale enterprise development.

I have found that going all-in with certain language/framework features, such as self-contained deployments, can allow for really powerful sidestepping of this kind of operational complexity.

If I was still in a situation where I had to ensure the right combination of runtimes & frameworks are installed every time, I might be reaching for Docker too.


Dockerfiles compose and aren't restricted to running on linux. Those two reasons alone basically mean I never need to care about systemd again

Yeah, not caring about systemd is a big win for me. And I don't just mean the cryptic systemd unit syntax, but also the absolutely terrible ux of every CLI tool in the suite. I'm tired of having to pass half a dozen flags every time I want to view the logs of a systemd unit (or forgetting to type `sudo` before `systemctl`). I'm tired of having to remember the path to the systemd unit files on each system whenever I need to edit the files (is it `etc/systemd/system/...` or `etc/system/systemd/...`?). Docker is far from perfect, but at least it's intuitive enough that I don't have to constantly reference man pages or manage aliases.

I would love to do away with the Docker layer, but first the standard Linux tooling needs to improve a lot.


Honestly most people's dockerfile could just as well be a bash script.

I find Dockerfile's even simpler to work with than bash scripts.

Thing is, for many people they are just bash scripts with extra steps.

I am under the impression that those using Docker are those using shitty interpreted languages that fail hard on version incompatibilities, with Docker being used for version isolation as a workaround. How would a bash script help?

You don't run a Dockerfile on every machine, and a bash script doesn't produce an image. They're not even solving the same problem.

So many people only need one machine. And these people certainly don't need an image.

Exactly! This person gets it.

Oh, and not only build their app, they can take it a step further and setup the entire new vps and app building in one simple script!


I feel y’all are too focused on the end product.

I deploy to pared down bare metal, but I use containerization for development, both local and otherwise, for me and contributors.

So much easier than trying to get a local machine to be set up identically to a myriad of servers running multiple projects with their idiosyncratic needs.

I like developing on my Qubes daily driver so I can easily spin up a server imitating vm, but if I’m getting your help, especially without paying you, then I want development for you to be as seamless as possible whatever your personal preferred setup.

I feel containerization helps with that.


Once you do it for long enough it might be worth it to consider configuration management where you declare native structured resources (users, firewall rules, nginx reverse proxies, etc) rather than writing them in shell.

I use Puppet for distribution of users, firewall rules, SSH hardening + whitelisting, nginx config (rev proxy, static server, etc), Let's Encrypt certs management + renewal + distribution, PostgreSQL config, etc.

The profit from this is huge once you have say 20-30 machines instead of 2-3, user lifecycle in the team that needs to be managed, etc. But the time investment is not trivial - for a couple of machines it is not worth it.


Honestly not having to use Puppet or Ansible are among my reasons for using Docker. I do some basic stuff in cloud-init (which is already frustrating enough) to configure users, ssh, and docker and everything else is just standard Docker tooling.

Which is fine if it works well for you.

The point of this discussion is clear: complexity adds extra ops work, so the gains obtained from additional complexity need to compensate for that extra work.

Detailed config management has a learning curve and pays off only from a certain fleet size on.

Dedicated hardware pay off at a larger scale.

Complex cloud native arrangements pay off when... [left as an exercise for the reader].


> I do some basic stuff in cloud-init (which is already frustrating enough)

What do you find frustrating about cloud-init? I'm relatively new to it.


I'm doing it :)

I split it into multiple scripts that get called from one, just for my own sanity.


Because it seems unobvious but docker always saves you. It's actually quicker than running pip install requirements.txt once you get a year in. (Trust me, I used to take your approach).

Forget about "clunky overhead" - the running costs are < 10%. The dockerfile? You don't even need one. You can just pull from the python version you want e.g. Python1.11 and git pull you files from the container to get up and running. You don't need to use container image saving systems, you don't need to save images, or tag anything, you don't need to write set up scripts in the docker file, you can pass the database credentials through the environment option when launching the container.

The problem is after a year or two you get clashes or weird stuff breaking. And modules stopping support of your python version preventing you installing new ones. Case in point, Googles AI module(needed for gemini and lots of their AI API services) only works on 3.10+. What if you started in 2021? Your python - then cutting edge - would not work anymore, it's only 3.5 years later from that release. Yeah you can use loads of curl. Good luck maintaining that for years though.

Numpy 1.19 is calling np.warnings but some other dependence is using Numpy 1.20 which removed .warnings and made it .notices or something

Your cached model routes for transformers changed default directory

You update the dependencies and it seems fine, then on a new machine you try and update them, and bam, wrong python version, you are on 3.9 and remote is 3.10, so it's all breaking.

It's also not simple in the following respect: your requirements.txt file will potentially have dependency clashes (despite running code), might take ages to install on a 4GB VM (especially if you need pytorch because some AI module that makes life 10x easier rather needlessly requires it).

life with docker is worth it. i was scared of it too, but there are three key benefits for the everyman / solodev:

- Literally docker export the running container as a .tar to install it on a new VM. That's one line and guaranteed the exact same VM, no changes. That's what you want, no risks.

- Back up is equally simple; shell script to download regular back ups. Update is simple; shell script to update git repo within the container. You can docker export it to investigate bugs without affecting the production running container, giving you an instant local dev environment as needed.

- When you inevitably need to update python you can just spin up a new VM with the same port mapping on Python 3.14 or whatever and just create an API internally to communicate, the two containers can share resources but run different python versions. How do you handle this with your solution in 4 years time?

- If you need to rapidly scale, your shell script could work fine, I'll give you that. But probably it takes 2 minutes to start on each VM. Do you want a 2 minute wait for your autoscaling? No you want a docker image / AMI that takes 5 seconds for AWS to scale up if you "hit it big".


Clunky overhead from Docker?

Sorry, but you've got no idea what you're talking about.

You can also run OSI images, often called docker images directly via systemds nspawn. Because docker doesn't create an overhead by itself, its at its heart a wrapper around kernel features and iptables.

You didn't need docker for deployments, but let's not use completely made up bullshit as arguments, okay?


I have no idea what I am talking about? Docker is literally adding middleware between your Linux system and app.

That doesn't necessarily mean there aren't Pro's to Docker, but one Con to Docker is - it's absolutely overhead and complexity that is not necessary.

I think one of the most powerful features of Docker by the way is Docker Compose. This is the real superpower of Docker in my opinion. I can literally run multiple services and apps in one VPS / dedicated server and have it manage my network interface and ports for me? Uhmmm...yes please!!!! :)


Docker's runtime overheads on Linux are tiny. It's pretty much all implemented using namespaces, cgroups and mounts which are native kernel constructs.

Well designed, written and efficient...middleware. It's a wrapper around linux and a middle between my OS and my app! A spade is a spade.

There are cons beyond performance. For example Docker complexity - you need to learn a new filetype, a new set of commands, a new architecture, new configurations, spend hours reading another set of documentation. Buy and read another 300 page O'Reily book to master and grasp something that again has Pro's and Con's.

For me? It's not necessary and I even know some Docker Kung-Fu but choose not to use it. I do use Docker Desktop occasionally to run apps and services on my localhost - it's basically a Docker Compose UI, and I really enjoy it.


> It's a wrapper around linux and a middle between my OS and my app

No. Docker doesn't "wrap" anything, and it certainly does not wrap Linux. Please reconsider looking at the documentation. It uses native kernel features. SystemD does a similar thing.

> For example Docker complexity - you need to learn a new filetype, a new set of commands, a new architecture, new configurations, spend hours reading another set of documentation

I can't say I agree.


A wrapper CLI that produces the same outcome wouldn't really be considered middleware, which surely should affect runtime?

Docker is native Linux. Your app uses the same kernel as the host. Is "chroot" middleware? No. Neither is docker.

It does require a running daemon. Other solutions, like podman, do not. There is an overhead associated with docker.

Yes, but containers do not incur overhead because of the daemon. It is there for management purposes. In other words, system calls / network access / etc are not going "through" the daemon.

> Docker is literally adding middleware between your Linux system and app.

Not really, no. Docker just uses functionality provided by the Linux kernel for its exact use case. It's not like a VM.

> it's absolutely overhead and complexity that is not necessary.

This is demonstratively wrong. Docker introduces less complexity compared to system native tools like Systemd or Bash. Dockerfiles will handle those for you.

> I have no idea what I am talking about

I wouldn't say that. You seem to have strong puritarian opinions tough.


O rly, pray tell, which middleware?

Your most powerful feature is literally a hostfile that docker generates on container start that's saved at /etc/hosts + Iptables rules

Edit: and if you don't want them, use Network-Mode: host and voila, none of that is generated


>have it manage my network interface and ports for me

...and bypass the host firewall by default unless you explicitly bind stuff to localhost :-/

I don't particularly love or hate docker, but when I realized this, I decided to interact with it as little as possible for production environments. Such "convenient" defaults usually indicate that developers don't care about security or integrating with the rest of the system.


> docker doesn't create an overhead by itself

Yes it does, the Docker runtime (the daemon which runs under root) is horribly designed and insecure.


Insecure in what way? Rootful docker is a mature product that comes with seccomp and standard apparmor policies ootb!

It runs as root, requires sudo to use, turns off all system firewalls, and has no way of doing security updates for containers.

> It runs as root

A lot of system applications on a standard Linux machine run as root or run with rootful permissions. This problem is solved by sandboxing, confining permissions and further hardening.

> requires sudo to use

Yes. However, this is a security plus and not a disadvantage.

> turns off all system firewalls

This statement makes no sense.

> has no way of doing security updates for containers.

I don't know what you mean by this.


There isn't a "Docker runtime", and the daemon is not a runtime any more than systemd is a runtime. They're both just managing processes. If you want to argue that Docker containers have an overhead, you could maybe argue that the Linux kernel security features they employ have an additional overhead, but that overhead is likely to be marginal compared to a less secure approach and moreover since you're Very Concerned About Security™ I'm sure you would prefer to pay the security cost.

Duplicating a base Linux distribution a thousand times for every installed piece of software absolutely is overhead.

(Theoretically you could build bare images without pulling in Alpine or Ubuntu, but literally almost nobody ever does that. If you have the skills to build a bare Docker image then you don't need Docker.)


> Duplicating a base Linux distribution a thousand times for every installed piece of software absolutely is overhead.

You're not duplicating an entire distribution, just the user land that you want. Typically we use minimal user lands that just have certs and /etc/passwd and maybe `sh`. And to be clear, this is mostly just a disk overhead, not a CPU or memory performance overhead.

> Theoretically you could build bare images without pulling in Alpine or Ubuntu, but literally almost nobody ever does that

Yeah, we do that all the time. Google's "distroless" images are only about 2MiB. It's very commonly used by anyone who is remotely concerned about performance.

> If you have the skills to build a bare Docker image then you don't need Docker.

Building a bare Docker image isn't hard, and the main reason to use Docker in a single-host configuration is because Docker utilities are just far, far saner than systemd utilities (and also because it's just easier to distribute programs as a Docker images rather than having to deal with system package repos and managers and so on).


I'm with you, but for me Cloud does have one major benefit:

If you use it as IaaS, it's a lot quicker to get prototypes working than if you use anything else, including VPS's from other providers.

Google Cloud in particular has very few vectors for lock-in, and follows more principle of least surprise.

But once you have prototyped, you should ask the question about rebuilding it somewhere that is cheaper.

Near infinite scalability of disk drives is nice, and snapshotting, and cloud in general can allow you to extend your prototype into taking production load and allowing you to measure what you will need; but leaning in to "cloud magick" (cloud run, lambdas, etc) will consume almost as much time to learn and debug as just doing it the old school way anyway. In my lived experience.


I am not against the cloud. VMs are also cloud, unless you run them on your own servers. For instance, the Hetzner Cloud (mostly VMs, plus load balancers and disks) is so cheap and has such a nice CLI API that it competes aggressively with dedicated servers - I would definitely start any with VMs, not with iron.

The biggest problem is the so called cloud native stuff which is both more expensive and more complex. There are contexts where it makes sense but for startups they are doing more harm than good.


Thing is, by the time the cloud native stuff makes sense most companies are at a scale where it'd be cheaper to just hire a good devops team, and start building your own cloud infra on own hardware.

Probably so. And that would be likely my approach at such scale.

Still, my most benevolent interpretation of current reality is, rather than saying "that cloud native stuff is crap", accepting that there are cases where it may make sense.

For instance, large companies might have trouble hiring a good ops team because they have in general trouble hiring and retaining talent (another conversation topic).

Ops people are a scarce good because univs do not train people for that and most people prefer coding. I am leaving the work devops out because the market completely perverted its meaning.

(my take on the devops funeral: https://logical.li/blog/devops/ )


Reference:

https://survey.stackoverflow.co/2022/#developer-profile-deve...

Only around 11% of the whole devs identify as devops specialist or cloud infrastructure engineer.

This is why I am saying ops people are a scarce good (unfortunately) from a data driven perspective. Of course my daily life confirms it.


Most of my money comes from companies unable to handle even simple setups - and having trouble to find the right people, so I somewhat agree too. But it's mainly an education problem - it's pretty much impossible to find good people with that skillset, but it is possible to find people straight out of University willing to learn.

I fully agree with you: it is mostly an education problem and you can find people willing to learn right out of univ. Indeed, that is exactly my experience: I successfully onboarded several (carefully selected) junior people into the ops skillset over the years and I have seen them do wonders with customer systems, while enjoying their "ops life", without having fires every day.

The connection of this to the replies above it: I am not sure if this kind of junior people would be easy to retain in a large corporate environment. We certainly can do that in niche consulting.


We're a tiny company doing ops as services for large corporations - with one customer now coming close to a decade. That solves the retaining problem as we have limited exposure to all that big corporation nonsense, and have the option for individuals to go on a vacation in other projects without losing their knowledge in the organisation.

I had the exact same business for 18 years :-) and yes, without corporate nonsense it is easy to retain intelligent people. Cheers

And somehow I feel these cloud native services keep breaking. Again Azure Container Instances found a interesting new way to fail. I have to check on Monday is it still in booting itself more often than usual(dev environment so have not tried any fixes)...

While the VMs that run some parts of the system have been rock solid giving zero issues... Should have just thrown the stuff on one of them or added third one. Cost would have been same.


Apart from the operation side, there is a development side parallel too.

Two examples that I came across

- "Test" mean if it passes on CI, it is good. Failing to run test on local? Who do development on local anyway?

- Teams so reliant on "AI" because this is the future of coding. "how to sort a list in python" became a prompt, rather than a lookup on the official documentation.


I’ve just recently gotten into ansible and find myself building the same thing. I wrote a script to interact with virsh and build vms locally so I can spin up my infra at home to test and deploy to the cloud if and when I want to spend actual money.

I’m still very much an ansible noob, but if you have a repo with playbooks I’d love to poke around and learn some things! If not, no worries, I appreciate your time reading this comment!


> while monitoring configuration compliance with a custom Naemon plugin.

While I absolutely agree with you and your approach, would you mind elaborating what kind of configuration compliance you are referring to in this statement? I suppose you do not mean any kind of configuration that your Puppet code produces as that configuration is "monitored", or rather managed, by Puppet.


I don't mind elaborating - the fact that people are asking me questions reminds me that I need to invest a bit more effort on some articles.

This case is actually pretty simple.

Puppet applies the configuration you declare impotently when you run the Puppet agent: whatever is not configured gets configured, whatever is already configured remains the same.

If there is an error the return code of the Puppet agent is different from that of the situations above.

Knowing this you can choose triggering the Puppet agent runs remotely from a monitoring system, (instead of periodical local runs), collecting the exit code and monitoring the status of that exit code inside the monitoring system.

Therefore, instead of having an agent that runs silently leaving you logs to parse, you have a green light / red light system in regards to the compliance of a machine with its manifesto. If somebody broke the machine leaving it in an unconfigurable state or if someone broke its manifesto during configuration maintenance you will soon get a red light and the corresponding notifications.

This is active configuration management rather than what people usually call provisioning.

Of course you need an SSH connection for this execution and with that you need hardened SSH config, whitelisting, dedicated unpriviledged user for monitoring, exceptional finegrained sudo cases, etc. Not rocket science.


Thank you for your thorough explanation. Interesting to see that you basically use your monitoring system as a scheduler to run Puppet and it sounds beneficial to closely integrate it with your monitoring to have it all in one place.

At my place of work we went the "traditional" way of running Puppet locally. It has been our experience that Puppet failures due to user misconfiguration or some such do not require our immediate attention (e.g. after hours), so we just check Puppetboard a few times per day to identify failing nodes.

Another reason why we use Puppetboard to monitor Puppet nodes is that every alert that our Icinga monitoring system produces is automatically interpreted as an incident which needs immediate attention. We are currently in the process of changing that so we are able to process non-critical alerts in a saner way.

Anyway, interesting to see how a fellow Puppet user manages their setup. Keep it up!


Thank you as well, for sharing these notes about your setup. Indeed concentrating everything in the same monitoring system is very helpful as it reduces the cognitive load. You can likely do the same with Icinga.

Feel free to reach out on Linkedin if you need some more details. More than happy to share.


I can't remember the last time I've seen a position description for a software developer (or anything tech related for that matter) that didn't include a requirement for skills in some cloud related tech.

Sometimes the job descriptions are boastful in their reference to those technologies, and other times you can detect some level of despair.


Now I am curious: how do you detect despair regarding cloud tech in job descriptions?

Your first paragraph resonates strongly with what the folks have done at my startup......lol

My thoughts and prayers :-\ Wish you a quick recovery!

Basically doing this for a small startup - there are some complexities around autoscaling task queues with gpus and whatnot, but the heart of it is on a single VM (nginx, webapp, postgres, redis). We're b2b, so there's very little traffic anyway.

The additional benefit is devs can run all the same stuff on a Linux laptop (or Linux VM on some other platform) - and everyone can have their own VM in the cloud if they like to demo or test stuff using all the same setup. Bootstrapping a new system is checking in their ssh key and running a shell script.

Easy to debug, not complex or expensive, and we could vertically scale it all quite a ways before needing to scale horizontally. It's not for everyone, but seed stage and earlier - totally appropriate imo.


> Bootstrapping a new system is checking in their ssh key and running a shell script.

If it interests you, both major git hosts (and possibly all of them) have and endpoint to map a username to their already registered ssh keys: https://github.com/mdaniel.keys https://gitlab.com/mdaniel.keys

It's one level of indirection away from "check in a public key" in that the user can rotate their own keys without needing git churn

Also, and I recognize this is departing quite a bit from what you were describing, ssh key leases are absolutely awesome because it addresses the offboarding scenario much better than having to reconcile evicting those same keys: https://github.com/hashicorp/vault/blob/v1.12.11/website/con... and while digging up that link I also discovered that Vault will allegedly do single-use passwords, too <https://github.com/hashicorp/vault/blob/v1.12.11/website/con...>, but since I am firmly in the "PasswordLogin no" camp, caveat emptor with that one


Yeah, I've used the github ssh key thing before, but never heard of key leases - will take a look. Thx!

I did this type of setup but without even redis. Postgres can do anything.

True, I use it mainly for a few convenience things - holding ephemeral monitoring data, distributed locks, redis streams for some pub/sub stuff, sorted sets can be handy - things I could do in Postgres, but are a bit simpler in Redis.

I love the simplicity of this approach. In your setup, how do you track config and updates of your VMs?

I like this but one of the issues with this approach is if no Docker images like traditional configuration management tool, you are going for a world of pain. Docker and Docker images have tons of best practices already defined for plenty of use cases. If it's already containerized; then, jumping to any orchestrator that supports OCI images is more about adjusting the business to a new set of operations.

I have a custom deployment system which idempotently configures an Ubuntu LTS VM. All the config templates are checked into source control. I don't configure anything by hand - it's either handled in this thing or via a small user-data script run at provisioning time.

Like everything, it's context dependent, but wowzers my life has improved so much since I got on board the Flatcar or Bottlerocket train of immutable OS. Flatcar (née CoreOS) does ship with docker but is still mostly a general purpose OS but Bottlerocket is about as "cloud native" as it comes, shipping with kubelet and even the host processes run in containers. For my purposes (being a k8s fanboy) that's just perfect since it's one less bootstrapping step I need to take on my own

Both are Apache 2 and the Flatcar folks are excellent to work with

https://github.com/flatcar/Flatcar#readme

https://github.com/bottlerocket-os#bottlerocket


Sure, but again, complexity - stuff people have to learn/maintain/upgrade, etc. ymmv

Running and configuring VMs isn't hard to do correctly, it just takes discipline to never "hack it in the moment" - or if you do, can that change in your config system.


> it just takes discipline to never "hack it in the moment" - or if you do, can that change in your config system.

Yup, and I'm glad your experience has been different from mine but mine has been that tired and stressed people are anything but disciplined, so nipping a few "I'll just apt-get ..." in the bud goes a long way. So does Reverse Uptime (or its friend, Chaos Engineering)


As usual, I'm stoked to see I'm not the only one using Flatcar. :)

The answer is "no, it doesn't".

I've been running my SaaS first on a single server, then after getting product-market fit on several servers. These are bare-metal servers (Hetzner). I have no microservices, I don't deal with Kubernetes, but I do run a distributed database.

These bare-metal servers are incredibly powerful compared to virtual machines offered by cloud providers (I actually measured several years back: https://jan.rychter.com/enblog/cloud-server-cpu-performance-...).

All in all, this approach is ridiculously effective: I don't have to deal with complexity of things like Kubernetes, or with cascading system errors that inevitably happen in complex systems. I save on development time, maintenance, and on my monthly server bills.

The usual mantra is "but how do we scale" — I submit that 1) you don't know yet if you will need to scale, and 2) with those ridiculously powerful computers and reasonable design choices you can get very, very far with just 3-5 servers.

To be clear, I am not advocating that you run your business in your home closet. You still need automation (I use ansible and terraform) to manage your servers.


The scaling thing is a great boogeyman. It preys on this optimism your software is going to be so successful in such a short amount of time which people want to believe.

The answer is "it depends".

Did you read the article or just the headline?

Scroll down to the bottom, under the section "A few considerations" and try not to laugh.

"A few considerations" turns out to be a pretty significant chunk of security work ESPECIALLY if you are storing/transmitting highly sensitive information.

How do you handle something like HIPPA compliance when you're in this situation?

There are 2 types of programmers: those that think they've seen everything and those that know they've seen next to nothing. And as such, these absolute takes are tiring.


I've written a HIPPA-compliant application that was VPS-hostable. It's been a while, but IIRC, it simply involved a combination of TLS everywhere and encrypting the sensitive fields in the DB. I don't remember if there was any other trick involved, but it wasn't difficult. By far the hardest thing about that project was the complexity of the medical codes-- not HIPAA compliance-- and that is something the cloud wouldn't help with at all.

> , it simply involved a combination of TLS everywhere and encrypting the sensitive fields in the DB.

I'm sorry, are you saying securing patient data is simple? No offense, but you might be the only person on this planet to share this sentiment and there's a reason why.

So, it's simpler to secure sensitive information in a database, secure your hosting, maintain security updates to those hosts, undergo audits, keep up with changing regulations, keep up with the latest threat vulnerabilities, staff a full response team in case something happens, etc?

Not trying to be rude, but it's obviously not simple.

What's crazy about your answer is that we had a whole host of "Bitcoin for your data hacks" that were only made possibly by setups your describing.

>By far the hardest thing about that project was the complexity of the medical codes-

Yes, this is also complex. But a totally different problem in a totally different space.


> secure sensitive information in a database, secure your hosting, maintain security updates to those hosts, undergo audits, keep up with changing regulations, keep up with the latest threat vulnerabilities, staff a full response team in case something happens

To be fair of the things you've described, if you can swing it, you should be doing most of this regardless for a business setup. Specific to HIPAA would be the auditing and 'changing regulations' (and depending on client needs, you'll likely have other audits for business needs).

I'm going through a gap analysis for HIPAA now; would you mind sharing what impactful changing regulations you've seen in the past 5 years?


> To be fair of the things you've described, if you can swing it, you should be doing most of this regardless for a business setup

Not sure how to respond to this. Are you saying I should go out and hire 2-3 people to set up a ton of infrastructure and maintain it for me instead of relying on the professionals at Azure (who specialize in this) and it's done automatically at a fraction of the cost? We went through 5 years of "bitcoin for your data" fraud in exactly the situation your describing.

I don't need to hire anybody as of now. None.

> I'm going through a gap analysis for HIPAA now; would you mind sharing what impactful changing regulations you've seen in the past 5 years?

This is my point. I don't know and don't care. I don't have to worry about it at all. I don't have to worry about updating the handful of apps and servers that connect to all the different integrations we use because this field is siloed into a 1,000,000 little pieces. I don't have to worry about PHI getting leaked out of some server I forgot to update somewhere or misconfigured because I made a mistake while installing it or setting it up the first time. That stuff is all handled through Azure's existing cloud infrastructure. It's literally tailored to healthcare solutions. No single person (or 2 or 3 or even 4) full time people could come close to what they offer at the cost.


I don't think I was communicating my first point effectively; I didn't mean to reference you personally or to the approach taken (VPS or cloud). If there is a business who needs HIPAA, then most likely, the business should be doing all of those original points because doing them is better (more effective, better security, etc.) than not doing them. I'm trying to say than extending to HIPAA could potentially be 'simple' if there is a business already doing most of this.

I understand that you're using Azure's existing infrastructure to handle your logistical technical management, but I was here asking if you had to make any changes to keep abreast of changing regulations. There seems to be practical business decisions that need to be made that HIPAA impacts, such as what data constitutes PHI (has that changed? Maybe you had to go back and change what data you were keeping because of the above regulation changes- I don't know if that could be the case, that's why I'm asking, I'm not aware of what I don't know). If Azure is somehow keeping track of all "changing regulations" for you (including business needs) and you've never had to worry about it, that's good to know. I would still be interested in any specific details if you're aware of it.


Sorry, totally misinterpreted that.

> but I was here asking if you had to make any changes to keep abreast of changing regulations.

No, we haven't. Not yet.

> If Azure is somehow keeping track of all "changing regulations" for you (including business needs) and you've never had to worry about it, that's good to know. I would still be interested in any specific details if you're aware of it.

I get your question know. So, when I was referring to Microsoft and HIPAA it was primarily around this side of things: https://learn.microsoft.com/en-us/azure/compliance/offerings...

You do bring up a good point and I shouldn't have implied otherwise that it can handle everything for you. So yes, there is a ton of other stuff that isn't magically handled by you such as identifying PHI and stuff. That being said, they have a whole suite of analytical and machine learning tools that will help you do this.

But since you mentioned policy changes, https://www.cms.gov/priorities/key-initiatives/burden-reduct... this is big and will have wide-reaching consequences and things like the ability to export patient data isn't necessarily baked into Azure.

BUT, they do have this healthcare platform they're building like this stuff https://learn.microsoft.com/en-us/dynamics365/industry/healt... that I would imagine would provide a bit more coverage on those types of changes than something you're building yourself.

Here's a deidentification service that can be integrated: https://learn.microsoft.com/en-us/azure/healthcare-apis/deid...


Awesome, I really appreciate your time and the references. Thank you!

No problem at all. It's such a fascinating and cool field to build software in.

Someone else above had mentioned the complexity of medical coding and I don't know what you do or what you're working on but that's another really interesting part of the puzzle. And starts to get into why it's so hard for one system to communicate with each other in healthcare.


There was a business person in charge of keeping up with any regulatory changes. The regulations at the time were pretty stable, and I can’t think of a single change order that came from it.

The most important things to consider (IIRC) were ensuring that the data was encrypted at rest and in flight, and that access to the data was audit logged and properly authorized.

We had an audit every so often. None of this was hard. Just tedious. It does help to have a HIPPA expert advise.

I don’t think public cloud vs self hosting makes a massive difference. Of all the problems such a project faces, that is not close to the top one.

Keeping machines patched and up to date is also not terribly hard.

Anyway, I’m not saying you’re totally wrong. Our project may have had more hidden risk than I realize. But it’s my opinion based on that experience.


> I don’t think public cloud vs self hosting makes a massive difference.

Right now, I'm the CTO of a medium-sized healthcare company. We're building our own EMR to replace the one we're currently using ON TOP of building out some line-of-business integrations that can help modernize other parts of our office.

Part of that is grabbing data from an FTP EDT source from an HIE, storing that, processing it and then reporting. Our EMR has a bulk data download that we roll throuhg each night, processing data, building reports, etc. These integrations also tie into existing apps we use like Microsoft Teams, Microsoft Forms, Power BI, etc.

With the EMR we're building, I was able to pull on some help early on, set up all environments in Azure (dev, test, prod), all databases, background services (which we use A TON), blob storage, certificates, etc. I can count on one hand the number of times I've had to touch it since.

Prior to me coming on, all our data was stored on a server we hosted ourselves. It was a simple shared drive that constantly needed to be patched and updated. Went down ALL the time. And became a nightmare to manage on top of the 20 other pieces of technology we needed to use to get by. You know what I did? Copied the entire share to OneDrive and shut down the server and I was done. Never had to think about it again. And it's versioned. That's another benefit of cloud infrasturcture.

I'm a single dev at a healthcare company that has dozens of things going on all because I can rely on Azure's cloud infrastructure.

And that's not even counting the additional healthcare services they offer like FHIR servers, deidentifications services, pulling out snomed, med, and diagnoses codes from history and physicals, etc.

I couldn't come close to this if I was tasked to do it myself. And the problem is that healthcare changes constantly. So you need to be able to be nimble and fast. Being able to offload those sort of challenges has been super helpful in that regard.

It's not a silver bullet. My biggest issues NOW are people related. Links in emails are the hands down the biggest attack vector I have to worry about (for better or for worse).

As far as the coding complexity, while a totally different animal, is another huge challenge as you mentioned. And it's not just "how do I translate this to a billing code" it's being able to make sense of unstructured clinical documentation, being able to report on it and analyze it, and most importantly share it. An encounter with a patient could potentially have to collect upwards of 2000 data points that are changing based on the patient, the diagnoses, or what's happening the world (Covid for instance). It's an insanely challenging problem which it sounds like you have experience with.


Yeah. The unstructured data is a massive PITA.

I’m not opposed to the cloud. I run my current (non-HIPAA) project on Render, and it is really convenient. But, I also run a number of things on VPSs, and they aren’t difficult at all other than the up-front friction. They have been rock solid for us. I think it’s mostly a function of how simple we keep our setup. The cloud is certainly more convenient when managing a big team with lots of dynamic allocations of resources. But, VPSs (which some consider to be then cloud), and physical servers get more shade than I think they deserve.

You can go really far as a business on a single physical server and with a second backup server. With a bit of care, deployments can be simple and reliable, too.


> How do you handle something like HIPPA compliance when you're in this situation?

I'm a dev who hasn't seen anything related to that. Since you bring it up, can you give some pointers on why something like a MySQL db coupled to a monolithic backend isn't good enough? What shortcomings did you experience?

All of the things raised in the article seem possible to solve without the need for microservices.


> All of the things raised in the article seem possible to solve without the need for microservices.

First, this has nothing to do with microservices. Needing cloud infrastructure and building microservices are 2 orthogonal things.

Second, it has nothing to do with the tech you're using. MySQL is irrelevant. So is a monolithic backend.

What IS important is the security and infrasture behind the data your storing. Clinical data (and data captured in EMR's) is easily some of the most sensitive stuff you'll come across (unless you work in govt). The idea that I wouldn't use off-the-shelf, already-tested solutions specifically for this problem with a cloud provider is nuts. I pay Azure peanuts compared to what I'd have to pay a full-time person to manage multiple environments, security updates, provisioning new infra, etc. And that's not even considering the actual process you need to go to connect to outside systems.

Most integrations want you to have a SOCS audits and stuff. What happens when there is a breach? Do you have the personnel on staff to understand and troubleshoot the issue? Remember the "we have your data and will release it for bitcoin" hacks? That's only made possible by these systems sitting in closets in someone's facility.

And trust me, this isn't just a "large enterprisey" problem. It's a "everyone who wants to build an app in this space" problem.

So you can use MySql (if you can host it compliantly) and I'm building what you could theoretically call a "monolithic" backend and it's working well. I use MSSQL on Azure though.


That makes sense, cloud infra does reduce risk in that sense. I assume you're allowed to say "we need to be compliant with X, and our cloud provider is compliant with X, therefore we are compliant with X".

When something bad does happen, is the cloud company liable?


Most of it falls on the shoulders of the providers not cloud companys. One aspect that's reappy hard to control is the whole human side of things. Most of my time in the "healthcare security" side of things is with employees opening emails with viruses in them and their constitutional incapablility of not clicking on links in emails.

Im a developer who is a CTO for a healthcare company (not like a big corp or anything) and also administers an Office 365 tenant while building out custom apps and an EMR. The office side of things is so much harder to get secure.


There is a core 20% of kubernetes, which is deployments, pods services and the way it handles blue-green deployments and declarative based definitions, namespace seperation, etc. that is really good. Just keeping to those simple basics, using a managed cloud kubernetes service, and running your state (database) out of cluster is a good experience (IMO).

It's when one starts getting sucked down the "cloud native" wormhole of all these niche open source systems and operators and ambassador and sidecar patterns, etc. that things go wrong. Those are for environments with many independent but interconnecting tech teams with diverse programming language use.


For me this is all Kubernetes is. I feel like people are often talking about two different things in discussions like this. For me it's just a uniform way to deploy stuff that is better than docker compose. We pay pennies for the control plane and workers are just generic VMs with kubelet.

But I think for many "kubernetes" means your second paragraph. It doesn't have to be like that at all! People should try settling up a k3s cluster and just learn about workloads, services and ingresses. That's all you need to replace a bunch of ad hoc VMs and docker stuff.


For a lot of company and project I worked on, this is the same conclusion I came to. 99% we only need / want is docker-compose++. Things like 0-downtime deployment out of the box, simple configuration system for replica set and other replication / distribution mechanism, and that is basically it.

I which there was something that did just that, because kube comes with a lot of baggage, and docker-compose is a bit too basic for some important production needs.


The author posted almost exactly this.

https://github.com/hadijaveed/docker-compose-anywhere


Why not use docker swarm?

Exactly this. Kubernetes has a million knobs and dials you can tweak for any use case you want, but equally they can be ignored and you can use the core functionality and keep it simple.

I can have something with nice deployments, super easy logs and metrics, and a nice developer experience setup in no time at all.


Yeah I found out my work was using kurbernetes. Given its reputation - having never used it before - when I asked if I could set up a server for some internal tooling I was braced for the worst.

What I actually got was a half an hour tutorial from the guy who set it up, in which he explained the whole concept (I had no clue) and gave me enough information to deploy a server, which I did with zero problems. I had automatic deployment from `git push` working very quickly.

To me this seemed like a no brainer. Unless you literally have one service this is waaay easier to use.

Granted I didn't have to set it up - maybe that's where the terrible reputation comes from?


Who is going to get a new job without k8s on their resume. :)

Seriously, I think a lot of people do things the hard way to learn large scale infrastructure. Another common reason is 'things will be much easier when we scale to a massive number of clients', or we can dynamically scale up on demand.

These are all valid to the people building this, just not as much to founders or professional CTOs.


Excuse my harshness but people doing it needlessly is just unprofessional waste and abuse.

Some people seem to have no concern with the needs and timetables of the would be customers but instead burn through cash building fancy nonsense.

It's like going in to a car mechanic for tires and then finding out it took 3 weeks because the guy wanted to put on low rider hydraulics and spinner hubcaps for his personal enrichment.

The worst part is it's inherently ambiguous to the next people. They don't know if the reason something is there is because it's needed or because it's just shiny bling.


I am certainly not saying everything you say is not all true. My comment is dark humour. I really like your last point. Years ago I replaced a huge hadoop cluster data processing job with a single app on one machine with a few CPUs, that reduced a job that took over 8 hours to 20 minutes. What is even dumber is, it was just a python script and gnu parallel, which used to be perl.

I've seen people do hadoop clusters for a few hundred MB.

It's so insane. Like hiring a long haul truck to pick up a sandwich


…but if the bosses at competing mechanic shops hire based on quality of low riders a mechanic can install, of course they'll practice on the paying customers.

I quit working about 1.5 years ago. I think I still love computers while I simultaneously hate "the web". Don't get me wrong, to my amazement people have called me the best web developer they've ever met and I routinely get put on web like things at every company I go to - hardware, logistics, finance, I've been trying to run away from it but it keeps finding me and I think I hate it.

I've got this allergic reaction to bullshit and fetishize successful products and customer satisfaction. I think we've both changed; I'm different than I was 20 years ago and so is web development.

Tight applications with minimal tools that can be pivoted and changed swiftly which require competence and finesse to administer where you don't create developer debt, these are out of fashion.

All profitable hacker spaces professionalize as romantic magic becomes a liability.

I'm a middle aged divorced man, not divorced from a person, but from a profession and I've been trying to date around with new loves.


Just take a look at the level of complexity in home lab subreddits!

I don’t quite get if people do it for interest, for love of the tech, or if they are technocratic and believe in levelling up their skill to get k8s on their CV like you say.

All I think is “this looks painful to manage”!


K8s is painful to get started, and painful to learn. But once you have it up you can just keep adding stuff to it.

I run a k8s cluster at home. Part of it yes, is to apply my existing skills and keep them fresh. But part of it is that kubernetes can be easier long term.

Ive got magical hard drive storage with rook ceph. I can yoink a hard drive out of my servers and nothing happens to my workloads.

I can do maintenance on one of the servers with 0 down time.

All of my config for what I have deployed is in git.

I manage VMS and kubernetes at work, and im not going to pretend that kubernetes isnt complex, but it's complex up front instead of down the road. VMs run into complexity when things change. I'm sure you can make VMS good but then why not use something like kubernetes, you will have to reinvent a lot of the stuff that's already in kubernetes.

It's a hammer for sure and not everything is a nail, but it can be really powerful and useful even for home labs.


I don't run k8s at home, but I have worked in k8s-heavy environments and studied it deeply. This is the accurate, nuanced take.

Few but not no people will ever run into problems at the kind of scale k8s operates at. Plus, learning how it "expects" the programs running inside its Pods to behave is kind of like learning how Django or Rails "expect" a web app to work - it's a more complicated style than just writing your own totally custom, hermetically-sealed Python apps for your personal use, sure, but it also comes with a slew of benefits in case you ever do hit that level of scale and want to move over.

Or, maybe you look over the app you're writing and say "Fat chance." In which case you can justify e.g. not making everything an API endpoint, keeping a ton of state mucking about, etc. But I still feel that's an improvement over not even realizing the questions are being asked.


What you also can do is starting with just a single node, incredibly easy to install with e.g. https://k3s.io/. You still have to invest the upfront effort to understand how it works but you can already reap a lot of benefits with a lot less complexity.

Kubernetes does not force you into the distributed systems hell, you can go that route later, or never.


Kubernetes/k3s on a single node turns what could have been immutable 1-step upgrades into multi-step mutable upgrades, since kubernetes's software itself and all the management components you need are a mutable layer on top of the operating system.

a) It doesn't have to be mutable. You can easily setup k3s on a single node, install the apps and bake an AMI or equivalent. And using something like ArgoCD or GitOps will ensure that your k8s stack is in sync with a tracked and managed Git repository.

b) In what world is upgrading your entire platform ever a single step. Even for a basic Python app you still have Python itself plus dependencies. And then of course whatever front end web server you're using.


You can use Talos linux for an immutable (and tiny) OS.

> K8s is painful to get started

Is that really true anymore? Even self hosting k8s these days (e.g with rke/rke2) is a single yaml file and one command to deploy an entire cluster.. Maybe back when we all used kubespray and networking was more complicated (to the user at least) etc.. But today? I don't think so.

Using a hosted offering is even easier, literally a couple of clicks, a ./gcloud-cli or terraform apply -- again not very hard and all the cloud providers provide you with example code you just need to plug some machine sizes etc into..

Dev setup? Install orbstack and click 'kubernetes' and you're done, your IDE (likely) will automagically pick up your kubeconfig and you can go right ahead creating services, deployments, jobs, whatevers...


I'm not talking about setting up a cluster. I'm talking about all the learning you have to do.

I’m sure there are countless other benefits. But how many layers of abstraction, services and things that need configuring are their compared to basic RAID to get support for magical hard disks that can be yoinked without affecting workloads?

> Compared to basic RAID to get support for magical hard disks that can be yoinked without affecting workloads?

These things aren't mutually exclusive though. I've spent the last few years working with kubernetes at work and running a 'simple'(but with tons of containers and weird edge cases / uses) unraid server at home for all of my needs. At some point I flipped over from 'jeez kubernetes is just too much, almost nobody should ever use this' to 'wow I have to migrate 99% of my home services to a cluster, this is driving me nuts.' I haven't quite gotten around to that migration, but I do think that k8s cluster for services / temporary storage / parallel jobs and separate unraid box that runs NFS (and doesn't do much else) is going to be a great setup for a home lab.


You get an aligned infra layer. You get a great opensource ecosystem (k8s, argocd, git / gitops, helm, helm charts, grafana, prometheus etc.)

You get basic loadbalancing, health checks, centralized and nearly out of the box logging and monitoring and tracing.

You get a streamlined build process (create a container image, have an image build, create your helm chart, done)

Your RAID commment is quite far away of what k8s makes k8s


Aren't disks so large those days that losing a disk almost means you will lose a second disk during resilvering unless that by "basic raid" you're doing not-basic-raid things such as btrfs raid1c3?

> But once you have it up you can just keep adding stuff to it.

I dunno why, but the k8s in my workplace keeps breaking in painful ways. It also has an endless supply of breaking points that makes life painful for anybody that depends on what runs in it, but aren't detected by the people that manage it.

Honestly, that second part is an exclusivity there, but I have never seen people "just keeping adding things to it" on practice.


It depends on how well you know k8s and what your stack is. Rancher is an extra complex version of k8s. Longhorn is pretty fragile in my experience, so is canal. But chillum and eks don't really have the same reliability issues in my experience.

Assembling complex systems is just inherently fun as long as you don't have deadlines or performance metrics to hit.

It's a bit like factorio with the extra dopamine hit of getting to unbox stuff.


K8s is painful to manage. It's a lot less painful than getting paged in the middle of the night because your server is down - And much much less than realizing that you've been down for an entire day and didn't notice. (K8s isn't even a complete solution to these problems! Just one part of a complete ~balanced breakfast~ production stack)

You don't need k8s for all of that, but there's not a simpler solution than k8s that handles as much.

Life is full of pain. Deal with it.


It's because it is complex. And in the long run, things become simpler. The only difficulty is the initial setup and once you are past that, the overall maintenance workload just becomes easier compared to a single VM setup

> And in the long run, things become simpler.

aka, you're front loading the complexity.

You can even think of it as paying insurance premiums upfront. You get to "make a claim" if the requirements do grow into the sort of need that suit such a cluster/complex setup.


But, on the same insurance theme; I am not sure paying 10K a year to insure my 5K car makes a lot of sense, because, in the long run, I might write my car off.

> I think a lot of people do things the hard way to learn large scale infrastructure

Having seen some of these half-rolled, first-time-understood k8s deployments, and the multi-year projects to unravel the mess that was created, overflowing with anti-patterns and other incorrect ways of doing things, I think I would prefer a narrower scope of true experienced professionals (or at least some experienced pros that can help guide the ship for their mentees) working on and designing k8s infra.

And for those that don't need it (the vast majority of startups, small businesses, regular-sized businesses, etc), just stick to the easier-to-use paradigms out there.


Nubank, the Brazilian bank unicorn, described their approach as “if this works, it’s because we reached massive scale quickly” (paraphrased) and started with an architecture that would support that from the beginning. They were very happy with their choices and have blogged about them in detail.

This is a case where “things will be much easier when we scale to a massive number of clients” turned out to be true.


Resume driven development is worth learning to recognize.

This is a retreaded and often tiresome debate. I'll still throw my 2c in...

Should you pick a complex framework from day one? Probably not, unless your team has extensive experience with it.

My objection is towards the idea that managing infrastructure with a bespoke process and custom tooling will always be less effort to maintain than established tooling. It's the idea of stubbornly rejecting the "complexity" bogeyman, even when the process you built yourself is far from simple, and takes a lot of your time from your core product anyway.

Everyone loves the simplicity of copying over a binary to a VPS, and restarting a service. But then you want to solve configuration and secret management, have multiple servers for availability/redundancy so then you want gradual deployments, load balancing, rollbacks, etc. You probably also want some staging environment, so need to easily replicate this workflow. Then your team eventually grows and they find that it's impossible to run a prod-like environment locally. And then, and then...

You're forced to solve each new requirement with your own special approach, instead of relying on standard solutions others have figured out for you. It eventually gets to a question of sunken cost: do you want to abandon all this custom tooling you know and understand, in favor of "complexity" you don't? The difficult thing is that the more you invest in it, the harder it will be to migrate away from it.

My suggestion is: start by following practices that will make your transition to the standard tooling later easier. This means deploying with containers from day 1, adopting the 12 factors methodology, etc. And when you do start to struggle with some feature you need, switch to established tooling sooner later than later. You're likely find that your fear of the unknown was unwarranted, and you'll spend less time working on infra in the long run.


This is a good articulation of the ambivalence I can feel around this.

One approach that I’ve considered is to start with the standard tooling (k8s + gitops) from day one, but still run it in a single VM. Any thoughts?


There's no correct answer here. Your choice seems reasonable _if_ you already have some previous familiarity with managing k8s. If not, you might want to consider starting with a managed k8s solution from a cloud provider. The bulk of the work will be containerizing your stack, and getting familiar with all the concepts. You don't want to do all that while also keeping k8s running. After that you would be able to relatively easily migrate to a self-hosted cluster if you need to.

If you do want to self-host, k3s could also be an option, like a sibling comment suggested. It's simpler to start with, though it still has a learning curve since it's a lightweight version of k8s. I reckon that you would still want to run at least 3 nodes for redundancy/failover, and maybe a couple more for just DB workloads. But you can certainly start with one to setup your workflow, and then scale out to more nodes as needed.


k3s single node + ArgoCD/Flux is what I would if I had to build infrastructure of a small startup by myself.

Unfortunately it's HN so people are more likely to do everything in bash scripts and say a big "fuck you" to all new hires that would have to learn their custom made mess


This is exactly the setup I’ve been considering. Feels like the best of both worlds: you learn the standard tooling and can easily upgrade to full blown distributed k8s, but you retain the flexibility and low cost aspects of single VM.

Also leaning towards putting it behind a Cloudflare tunnel and having managed Postgres for both k3s and application state.

Counterpoints anyone?


No counterpoints from me.

Have been running k3sup provisioned nodes on Hetzner for services and even a Stackgres managed Postgres cluster on another node (yes, it backs up to the cloud). And it's been great. Incredibly low cost and I do not have to think about running out if compute or memory for everything I need for a tiny startup.


The other aspect of this is it's literally impossible to hire someone from industry already familiar with your home grown SDLC systems. But you can find plenty of "cloud engineers" who do understand these "complex" cloud systems who can deploy and maintain them via terraform. It's a turn-key skill set.

VMs, block & blob storage, DNS, IdP, domain registrar.

These are the only things I have ever been comfortable using in the cloud.

Once you get into FaaS and friends, things get really weird for me. I can't handle not having visibility into the machine running my production environment. Debugging through cloud dashboards is a shit experience. I think Microsoft's approach is closest to actually "working", but it's still really awful and I'd never touch it again.

The ideal architecture for me after 10 years is still a single VM with monolithic codebase talking to local instances of SQLite. The advent of NVMe storage has really put a kick into this one too. Backups handled by snapshotting the block storage device. Transactional durability handled by replicating WAL, if need be.

Dumbass simple. Lets me focus on the business and customer. Because they sure as hell don't care about any of this and wouldn't pay any money for it. All this code & infra is pure downside. You want as little of it as possible.


> VMs, block & blob storage, DNS, IdP, domain registrar.

This is the most expensive way to build cloud services. When people talk about the cloud being more expensive than on-prem this is often the reason why. If you're just going to run VMs 24/7 there are better options.


Even the book on Microservices says “First build the Monolith”. You don’t know how to split your system until you have actually got some traction with users, and it’s easier to split a monolith than to reorganize services.

You may never need to split your monolith! Stripe eventually broke some stuff out of their Rails monolith but it gets you surprisingly far.

You are not going to get easier to debug than a Django/Rails/etc monolith.

I bit of foresight on where you want to go with your infra can help you though; I built the first versions of our company as a Django Docker container running on a single VM. Deploy was a manual “docker pull; docker stop; docker start”. This setup got us surprisingly far. Docker is nice here as a way of sidestepping dependency packaging issues, this can be annoying in the early stages (eg does my server have the right C header files installed for that new db driver I installed? Setup Will be different than in your Mac!)

We eventually moved to k8s after our seed extension in response to a business need for reliability and scalability; k8s served us well all the way through series B . So the setup to have everything Dockerized made that really easy too - but we aggressively minimized complexity in the early stages.


Yes! Also, use the damn framework, instead of rebuilding shitty versions of features it offers! One good seasoned person will outperform 10 non-seasoned people in this regard. It will add up over time. I think half the real reason people are soured to monoliths is because they are bad, poorly run monoliths.

> Even the book on Microservices says “First build the Monolith”.\

And yet, funnily enough, the book on Monoliths says to break things up into smaller services! It says your data should be stored in its own service (possibly multiple services, if you need multi-paradigm access [e.g. relational, full-text search, etc.]). The user experience should use its own service. And, at very least, you should have another service in between (this is where Django and Rails usually fit). Optionally, it says, you will probably want to have additional services as well (auth, financial transitions, etc.)


I've run a project for 6 years on a single $10/month VPS (I pay even less due to a perpetual discount I bagged from lowendtalk) run by a gameserver-focused VPS provider for about SIX years with about 99.999 reliability if you exclude the one time I fucked up a config and it was down for a whole day because I wanted to do a clean OS reinstall, and one other time when they changed my IP address (they gave me notice).

VPS technology has come a very long way and is highly reliable. The disks on the node are set up in RAID 1 and the VM itself can be easily live migrated to another machine for node maintenance. You can take snapshots etc.

To me, I would only turn to cloud infra not for greater reliability but more for collaboration and the operational housekeeping features like IAM, secrets management, infra-as-code etc, or for datacenter compliance reasons like HIPAA.


Which provider? Sounds great!

It depends. I personally love cloud based solutions because they save me lots of time. But I'm highly selective in what I use and there are some solutions that are clearly counter productive because they are too complicated.

I run a small, bootstrapped startup. We don't have enough money to pay ourselves and I make a living doing consulting on the side. Being budget and time constrained like that I have to be highly selective in what I use.

So, I love things like Google cloud. Our GCP bills are very modest. A few hundred euros per month. I would move to a cheaper provider except I can't really justify the time investment. And I do like Google's UI and tools relative to AWS, which I've used in the past.

I have no use for Kubernetes. Running an empty cluster would be more expensive than our current monthly GCP bills. And since I avoided falling into the micro-services pitfall, I have no need for it either. But I do love Docker. That makes deploying software stupidly easy. Our website is a Google storage bucket that is served via our load balancer and the Google CDN. The same load balancer routes rest calls to two vms that run our monolith. Which talk to a managed DB and managed Elasticsearch and a managed Redis. The DB and Elasticsearch are expensive. But having those managed saves a lot of time and hassle. That just about sums up everything we have. Nice and simple. And not that expensive.

I could move the whole thing to something like Hetzner and cut our bills by 50% or so. Worth doing maybe but not super urgent for me. Losing those managed services would make my life harder. I might have to go back to AWS at some point because some of our customers seem to prefer that. So, there is that as well.


But it's so embarrassing if your startup is running on shared hosting, FCGI, Go programs, and MySQL, costing about $10 per month.

You immediately see there's no load ;)

That's not a joke. Go is a fast compiled language, and Go programs are self-contained executables. So you don't need containers. FCGI is an orchestration system, like Kubernetes. It's single-machine, but will start up and shut down processes as the load changes. A crashed process will be restarted. Host the web pages on a static page server, and use client-side Javascript for any dynamic stuff. Good for maybe 20-100 transactions per second. The database will be the bottleneck.

Boring, but useful.


> 20-100 transactions per second

In all seriousness, that is "no load". I know it fits 99% of all startups, and many larger companies too, but that's kind of the point.

I wouldn't do it differently though, I think it's a perfectly fine architecture :)


  > > 20-100 transactions per second
  > "no load"
Ruby on Rails applications with even a modest amount of ActiveRecord work would like a word xD

Couple thousand per second is expected on my Go services (per node) before any optimizations.

If so, you may want to rent more than one server and set multiple web servers with a centralized database. Like people did in the 90s!

But that will cost more than $10/month.


That's huge supermarket inventory system top load. Or rather, the lower end of that is huge supermarket inventory system top load.

Depends on the service. For b2b that is already a lot.

When you know how much can be done on a $10 vpc, you'll realise how much compute in a kubernetes cluster is only used to support the cluster

Don't worry, I host serious stuff on a single machine, and am quite happy with it ;) What set me off a bit was the shared hosting. You don't want noisy neighbors, usually. That's worth a few bucks.

Agreed, VPS providers often blind users with super low prices, I didn't even notice this until I started hosting game servers where realtime performance is important. Always make sure that "iostat -c 1" column "%steal" is zero. Luckily there are providers which give guaranteed performance.

Honestly, you'd be surprised just how much load a single server Go application can handle.

I've not seen it with Go because I haven't worked with Go in a production capacity; but I've seen C# handle thousands of RPS per node.


Production Go experience. My go to estimate is per node 1-5k http rps with a couple db calls, maybe a network call to an internal service or three, and serializing json. I use that before building for server count and cost estimates. Some services exceed that, never saw a server we made that couldn't do 1k rps.

Friends don't let friends use ruby|python|perl|php|...


I'm more embarrassed about our organization not running on something like that.

Are there shared hosting providers now that support FastCGI generically, that is, not just for PHP?

Dreamhost does.

hahahahahah that was funny :-)

Yeah, the MySQL part of that is kind of faux pas these days.

Thankfully. ;)


I agree that we are overthinking about infrastructure. Boring stack like traditional RDMS, single server with regular backup, few bash script for deployment is fine for normal startup that targets to non-tech customer. They will serve you well at least one or two years, then you will know what should be improve. One of the big surprise is database like PostgreSQL can handle like 100tps very well with cheap hardware cost. That mean you can handle up to 86 millions transaction per day.

if you take the time to understand k8s and have a straightforward k8s deployment, these things aren't really a problem - and you don't have to do the custom sysadmin timesinks that need to go into the "simple" suggestion. What is suggested here is "easy". But it is not simple: it proliferates custom work.

I have had great success with a very simple kube deployment:

- GKE (EKS works well but requires adding an autoscaler tool)

- Grafana + Loki + Prometheus for logs + metrics

- cert-manager for SSL

- nginx-ingress for routing

- external-dns for autosetup DNS

I manage these with helm. I might, one day, get around to using the Prometheus Operator thing, but it doesn't seem to do anything for me except add a layer of hassle.

New deployments of my software roll out nicely. If I need to scale, cut a branch for testing, I roll into a new namespace easily, with TLS autosetup, DNS autosetup, logging to GCP bucket... no problem.

I've done the "roll out an easy node and run" thing before, and I regret it, badly, because the back half of the project was wrangling all these stupid little operational things that are a helm install away on k8s.

So if you're doing a startup: roll out a nice simple k8s deployment, don't muck it up with controllers, operators, service meshes, auto cicds, gitops, etc. *KISS*.

If you're trying to spin a number of small products: just use the same cluster with different DNS.

(note: if this seems particularly appealing to you, reach out, I'm happy to talk. This is a very straightforward toolset that has lasted me years and years, and I don't anticipate having to change it much for a while)


> I manage these with helm. I might, one day, get around to using the Prometheus Operator thing, but it doesn't seem to do anything for me except add a layer of hassle.

One big advantage of the operator is that its custom resources are practically kind of standard by now. This means helm charts for a lot of software ship those and integrating that piece of software into your monitoring is a matter of setting a few flags to true. The go to solution for a k8s monitoring setup is https://github.com/prometheus-community/helm-charts/tree/mai...


yeah, I know, that's the only reason I'm even thinking of using it. but tbqh I don't really install many things, as you can see...

I just hosted a site on Elastic beanstalk. Didn’t need to really do anything honestly. Upload a zip file with python code that runs locally really well. Database is on RDS. It has and continues to work well for 5+ years and lots of productivity.

Fwiw I run more than 'a' site. EBS is great for 'a' site. Last I checked, it had serious cost consequences past the one site.

But yeah, if I only wanted a thing, Ebs works.


LOL we have 2 full time people managing the production monitoring stack. And it costs money. And it generates a lot of internal traffic. Nope!

rsyslog + knowing what the fuck you are doing is much better.


Curious, does rsyslog support metrics or traces? My impression has always been it's log lines.

What product needs autoscaling?

I think this goes for any technology group with any stage of company. I work in networking and genuinely of the product I sell, my customers only need a small amount of core functionality and default settings - the rest is “bells and whistles”.

But still, no matter what, the odd customer demands they need all these complexities turned on for no discernible reason.

IMO it’s a far better approach with any platform to deploy the minimum and turn things on if you need to as you develop.

Incidentally, I’ve been exposed to “traditional” cloud platforms (Azure, GCP, AWS) through work and tried a few times to use them for personal projects in recent years and get bewildered by the number of toggles in the interface and strange (to me) paradigms. I recently tried Cloudflare Workers as a test of an idea and was surprised how simple it was.


> ... and Docker Swarm was deprecated..

I thought the same thing until recently. Apparently there's a "Docker Swarm version 2" around, and it was the original (version 1) Docker Swarm that was deprecated:

https://docs.docker.com/engine/swarm/

  Do not confuse Docker Swarm mode with Docker Classic Swarm which is no
  longer actively developed.
Haven't personally tried out the version 2 Docker Swarm yet, but it might be worth a look at. :)

Yes, swarm is not deprecated. I haven't used it myself yet, but I read elsewhere that swarm offers an easy way to manage secrets with containers. Some people run their 1 container in a swarm cluster with 1 node just for this feature. I see it's even officially suggested as a Note in the doc:

> Docker secrets are only available to swarm services, not to standalone containers. To use this feature, *consider adapting your container to run as a service. Stateful containers can typically run with a scale of 1 without changing the container code.*

(Emphasis mine. From https://docs.docker.com/engine/swarm/secrets/ )


I use Swarm with Portainer, it’s quite a nice experience!

After reading all the comments here, the conclusion is to start simple, then switch to k8s and later to cloud-native only when your business has grown to 1000 and then 1 million daily customers respectively.

We have B2B-Customers around 700. It all runs on a single Server (not VM though).

Since it's B2B we don't need zero downtime, updates at midnight are all right.

A day before rollout they go through the staging server and the test environment, so no surprises the next morning.

Before updates, the backups kick in, so if we need to recover from a bad update we can roll back.

Sounds all 2000 and not very fancy but boring and profitable cuts for us


The question is if you have so much buffer that it doesn't matter or if you could do a lot more but you just don't know.

My ci/cd is doing a system test because everything is in containers. I can do full e2e tests and automatic rollouts without a downtime.

What i can do, can everyone else do when i'm on holiday.

How fast are you back if your server burns down tomorrow? How often have you tested that?

Are your devs waiting regularly on things?


> The question is if you have so much buffer that it doesn't matter or if you could do a lot more but you just don't know.

Yes, we collect server metrics - that's pretty old-school

> How fast are you back if your server burns down tomorrow? How often have you tested that?

25 Minutes - we test it once a year and we have third partys to check it. It's called an audit. They also check other cyber security related stuff.

> My ci/cd is doing a system test because everything is in containers. I can do full e2e tests and automatic rollouts without a downtime.

We have a staging system for this.

> What i can do, can everyone else do when i'm on holiday.

We also have documentation; is this really a big thing?

> Are your devs waiting regularly on things?

Code Reviews, these take time

---

Are these real problems organizations have?


If someone thinks rolling their own infrastructure is "starting simple" than I have some land in Antartica my great, great uncle is trying to get rid of they might be interested in.

> rolling their own infrastructure

Huh. I never said to roll one's own [hardware] infrastructure, although it even makes sense if having a GPU cluster.


Points to be noted.

1. It took the end of ZIRP era for people to realize the undue complexity of many fancy tools/frameworks. The shitshow would have continued unabated as long as cheap money was in circulation.

2. Most seasoned engineers know for the fact that any abstractions around the basic blocks like compute, storage, memory and network come with their own leaky parts. And that knowledge and wisdom helps them make the suitable trade-offs. Those who don't grok them, shoot themselves in the foot.

Anecdote on this. A small sized startup doing B2B SaaS was initially running all their workloads on cheap VPSs incurring a monthly bill of around $8K. The team of 4 engineers that managed the infrastructure cost about $10K per month. Total cost:$8K. They made a move to 'cloud native' scene to minimize costs. While the infra costs did come down to about $6K per month, the team needed new bunch of experts who added about another $5K to the team cost, making the total monthly cost $21K ($6K + $10K + $5K). That plus a dent to the developer velocity and the release velocity, along with long windows of uncertainty with regards to debugging complex stuff and challenges. The original team quit after incurring extreme fatigue and just the team cost has now gone up to about $18K per month. All in all, net loss plus undue burden.

Engineers must be tuned towards understanding the total cost of ownership over a longer period of time in relation to the real dollar value achieved. Unfortunately, that's not a quality quite commonly seen among tech-savvy engineers.

Being tech-savvy is good. Being value-savvy is way better.


Thanks for sharing the story. Despite the whole TCO being higher, I wonder how the 8K to 6K reduction happened.

On AWS, fargate containers way are more expensive than VMs and non fargate containers are kind of pointless as you have to pay for the VMs where they run anyway. Also auto scaling the containers - without making a mess - is not trivial. Thus, I'm curious. Perhaps it's Lambda? That's a different can of worms.

I'm honestly curious.


After listening to @levelsio on Lex Friedman’s podcast, I became obsessed to simplify my deployments:

Do startups really need complex cloud architecture?

Inspired, I wrote a blog exploring simpler approaches and created a docker-compose template for deployment

Curious to know your thoughts on how you manage your infrastructure. How do you simplify it? How do you balance?


Funny I drew the same conclusion. Previously a cloud architect at Microsoft, now I don't use Azure anymore for the project I am working on right now.

Rather, I have decided to opt for Supabase instead. Probably over the long time it may cause issues for my startup - but even more realistically my startup is going to fail and my increased developer velocity by using simple tooling like this will allow me to figure out why my idea doesn't work in in a shorter amount of time, so I can go on to my next pursuit.

To be honest I think even using docker is overengineering.


> Curious to know your thoughts on how you manage your infrastructure.

What I quite like about your repo:

  - there is a separate API and background job instance
  - there is a separate web image, to not always couple front end deployments to back end
  - there are specialized data stores like Redis (or maybe RabbitMQ or MinIO in a different type of project)
  - Dozzle seems nice https://dozzle.dev/ (I use Portainer mostly, but seems useful)
What I think works quite nicely in general:

  - starting out with a monolithic back end but making it modular with feature flags (e.g. FEATURE_REPORTS, FEATURE_EMAILS, FEATURE_API), so that you can deploy vastly different types of workloads in separate containers BUT not duplicate your data model and don't need to extract shared code libraries (yet) and if you ever need to split the codebase into multiple separate ones, then it won't be *too* hard to do that
  - having a clear API (RESTful or otherwise) as the contract between a separate back end and front end deployment, so that even if your SPA technology gets deprecated (AngularJS, anyone?) then you can migrate to something, unlike when doing SSR and everything being coupled
  - the same applies to NOT having the same container build process have both the front end and back end build (I've seen a Java project install a specific Node version through Maven and then the build dragging on cause Maven ends up processing thousands of files as a part of the build)
  - using the right tool for the job: many might create full text search, key-value storage, message queues, JSON document storage, even blob storage all with PostgreSQL and that might be okay; others will go for separate instances of ElasticSearch, Redis, RabbitMQ, something S3 compatible and so on, probably a tradeoff between using well known libraries and tools vs building everything yourself against a single DB instance
  - in my experience, many projects out there are served perfectly fine by a single server so Docker Compose feels like the logical tool to start out with, if multiple instances indeed become necessary, there is always Docker Swarm (yes, still works, very simple), Hashicorp Nomad or K3s or one of the other more manageable Kubernetes distros
  - self-hosted (or self-hostable) software in general is pretty cool and gives you a bunch of freedom, though using managed cloud services will also be pleasant for many, more expensive upfront but less so in regards to your own time spent managing the stack; the former also lends itself nicely to being able to launch a local dev environment with the full stack, which feels like a superpower (being able to really test out breaking migrations, look at what happens with the whole stack etc.)
  - having some APM and tracing is nice, something like Apache Skywalking was pretty simple to setup, though there are more advanced options out there (e.g. cloud version of Sentry, because good luck running that locally)
  - having some uptime monitoring is also very nice, something like Uptime Kuma is just very pleasant to use
  - heck, if you really wanted to, you could even self-host a mail server: https://github.com/docker-mailserver/docker-mailserver (though that can be viewed as a hobbyist thing), or have MailCatcher / Inbucket or something for development locally

I'm a big fan of the modular monolith pattern, but haven't used feature flags for the purpose you're describing. Do you use any specific tools or frameworks for that? I'd also imagine there would be calls between features from within the same codebase, do those become network calls? And how does this interact with your Docker Compose/single server recommendation?

> Do you use any specific tools or frameworks for that?

You don't need to, you can just enable/disable certain features during app startup, based on what's in the environment variables/configuration, though many frameworks have built in functionality for something like that, for example: https://www.baeldung.com/spring-conditional-annotations

If I wanted to allow toggling access to the API, then I'd have an environment variable like FEATURE_API and during startup would check for it and, if not set with a value of "true", then just not call the code that initializes the corresponding functionality.

It's really nice when frameworks/libraries make this obvious, like https://www.dropwizard.io/en/stable/getting-started.html#reg... but it might get harder with some of the "convention over configuration" based ones, where you have to fight against the defaults.

> I'd also imagine there would be calls between features from within the same codebase, do those become network calls?

It depends on how you architect things!

There's nothing preventing you from using the service layer pattern for grouping logic, and accessing multiple services in each of your features as needed, and poking the different bits of your data model (assuming it's all the same DB).

If you are at the point where you need more than the same shared instance of a DB, then you'd probably need a message queue of some sort in the middle, RabbitMQ is really nice in that regard. Though at that point you're probably leaning more in the direction of things like eventual consistency and giving up using foreign keys as well.

> And how does this interact with your Docker Compose/single server recommendation?

Pretty nicely, in my experience!

When developing things locally, you can enable all of the needed FEATURE_* flags on your laptop, then it's more like a true monolith then.

Want to deploy it all on a single server when the scale is not too big? Do the same with Docker Compose, or maybe have separate containers on the same node, each with one of the features on, so the logs are more clean and the resource usage per feature is more obvious, and the impact of one feature misbehaving is more limited.

The scale is getting bigger? Docker Swarm will let you scale out horizontally (or Nomad/K8s, maybe with K3s) and you can just move some of those containers to separate nodes, or have multiple ones running in parallel, assuming the workload is parallelizable (serving user API requests, vs some centralized sequential process).

At some point you'll also need to consider splitting things further in your database layer, but that's most likely way down the road, like: https://about.gitlab.com/blog/2022/06/02/splitting-database-...


I quit my last job because of these kinds of shenanigans.

I was brought in to help get a full system rewrite across the finish line. Of course the deployment story was pretty great! Lots of automated scripts to get systems running nicely, autoscaling, even a nice CI builder. The works.

After joining, I found out all of this was to the detriment of so much. Nobody was running the full frontend/backend on their machine. There was a team of 5 people but something like 10-15 services. CI was just busted when I joined, and people were constantly merging in things that broke the few tests that were present.

The killer was that because of this sort of division of labor, there'd be constant buck-passing because somebody wasn't "the person" who worked on the other service. But in an alternate universe all of that would be in the same repo. Instead, everything ended up coordinated across three engineers.

A shame, because the operational story letting me really easy swap in a pod for my own machine in the test environment was cool! But the brittleness of the overall system was too much for me. Small teams really shouldn't have fiefdoms.


> There was a team of 5 people but something like 10-15 services

Puff! Talk about microservices! Or is it macropeople?! :-)


If you'll allow me, I'd like to shill my company for a second. We provide all the benefits of "single server deployment" while providing the scalability of the "30 lambdas" solution.

You can even run the whole thing locally.

We actually just did a Show HN about it:

https://news.ycombinator.com/item?id=41502094


Simple is robust.

Focus on product market fit (PMF) and keep things as straightforward as possible.

Create a monolith, duplicate code, use a single RDBMS, adopt proven tech instead of the “hot new framework”, etc.

The more simple the code, the easier it is to migrate/scale later on.

Unnecessary complexity is the epitome of solving a problem that doesn’t exist.


Can you expand om what of code duplication you deem reasonable?

Early in a project you see a lot of similar code paths, and so it’s often tempting to take the logic from two or three e.g. API routes and merge the “clean” abstraction into single piece of logic both routes can call.

Over-time this “clean” abstraction adopts a bunch of optional parameters based on the upstream API routes, leaving you with an omni-function that is more convoluted, and thus harder to change, than if the API routes weren’t overly optimized from the get-go.

As a personal rule, I’ll let myself copy something 3 times before taking a step back and figuring out a “better” way.


A very reasonable approach indeed

More of this.

Yeah, I would focus on a better user experience over a beautiful backend architecture.

... and this!

The backend doing the rendering for the 550 eink calendars that I have sold to far runs on a small, 10-Euro-a-month Hetzner server.

Low operational costs are essential for a hardware business if you don't want to burden your customers with an ongoing subscription fee. Otherwise the business turns into some kind of pyramid scheme where you have to sell more and more units in order to keep serving your existing customers.

I have a moral obligation towards my customers to keep running even if the sales stop at some point.

So I always multiply my cost for anything with 10 years, and then decide if I am willing to bear it. If not, then i find another solution.


It's funny because OPs solution was his docker-compose-anywhere, which is exactly what, from my experience, I've seen so many start-ups running with. Sure it works while you're running an MVP but it's incredibly brittle for running something in production as soon as the application grows in complexity. IMO the primary draw of k8s isn't necessarily "infinite scalability" but its resilience.

I sometimes wonder how many of these post boil down to "I don't want to learn k8s can I just use this thing I already know?".


In my experience, having done it both ways, first on VM's, then on lots of fully or mostly managed services, I generally prefer the latter because systems tend to be a lot more "self-healing" - because they're someone elses responsibility. This has had a dramatic effect on improving my sanity and sleeping well at night. I only wish I could migrate to an even more fully managed stack that's more reliable and still less work. The cases where I haven't been able to are either too expensive or would be too difficult to migrate.

> 20-30 Lambda functions for different services

My team of 6 engineers have a social app at around 1,000 DAU. The previous stack has several machines serving APIs and several machines handling different background tasks. Our tech lead is forcing everyone to move to separate Lambdas using CDK to handle each each of these tasks. The debugging, deployment, and architecting shared stacks for Lambdas is taking a toll on me -- all in the name of separation of concerns. How (or should) I push back on this?


Does the tech lead have the CTO or CEO's graces for that decision?

Why did the tech lead decide to move everything to lambda when you only have 1k DAU? Can they be reasoned with or is it lambda or the highway?

You can pull put the stats and do comparison, note the wasted time, how it's not beneficial but rather detrimental. Note how long it now takes to debug for such a small codebase, then extrapolate that out.

Having tons of lambdas is a massive pain in terms of debugging. Cloud watch is not that great to debug, and the debug tooling tends to be rather expensive, like data dog so not too much is invested. Or it's too resource intensive to setup open telemetry.


Yes, use boring technology, I'm all for that.

But an application built in the high pressure environment of a startup also has the risk of becoming unmanageable, one or two years in. And to the extent you already have familiar tools to manage this complexity, I vote for using them. If you can divide and conquer your application complexity into a few different services, and you are already experienced in an appropriate application framework, that may not be such a bad choice. It helps focus on just one part of the application, and have multiple people work on the separate parts without stepping on each other.

I personally don't think that should include k8s. But ECS/Fargate with a simple build pipeline, all for that. "Complex" is the operative word in the article's title.


But it's never just ECS/Fargate is it. It's ECR, S3, ALB, CF etc.

And at that point you've assembled a stack just as complex as doing it all inside a single k8s cluster.


Except it's not anywhere near as complex because you need to manage far far less using the AWS services than if you ran all of your own inside a k8s cluster. And even if you use k8s, you're probably already using most of those anyway. Who bothers building their own container hosting and file hosting at a startup?

Hence I said "I personally" and "already have familiar tools".

Also, if you're fair... not all those AWS acronyms you're listing would be displaced by the single k8s cluster. (Maybe you weren't arguing to swap out complexity, rather that the complexity floodgates were open already anyway?)


You can absolutely run object store, container serving, front end load balancing etc from a single k8s cluster.

Very common in fact since many k8s clusters are air-gapped except for a single inbound edge node.


And if one of those services is down, your entire application is down. You basically build a server made of abstract components ECR, S3, ALB, CF , all of which are able to fail.

Yah this all sounds good until you realize you have to actually maintain those servers, apply security patches and inevitably run into configuration drift.

Like all things, there's a good middle ground here-- use managed services where you can but don't over-architect features like availability & scaling. For example, Kubernetes is an heavy abstraction; make sure it's worth it. A lot of these solutions also increase dev cycles, which is not great early on.


Probably not - but by calling out EC2 instances as the way and then failing to mention patching or configuration management, this article loses some credibility for me. These considerations are not optional over any significant length of time, and will cause misery if not planned for.

Bare minimum, script out the install of your product on a fresh EC2 instance from a stock (and up-to-date) base image, and use that for every new deploy.


I strong agree this is the way.

We run Spacelift workers with Auto Scaling Groups and pick up their new image ~monthly with zero hassle since everything is automated.

Raw EC2 is just part of the story...

Edit to add: I also recommend using Amazon Linux unless you _have_ to have RHEL / Cent / Rocky or Ubuntu. Just lean into the ecosystem and you can get so many great features (and yes, I ACK the vendor lock-in with this advice). A really cool feature is the ability to just flip on various AWS services like the systems manager session manager and get SSH without opening ports a-la wireguard.


For patch management particularly with EC2s, we use AWS Systems Manager Patch Manager.... fairly straightforward to setup once you configure a base image

obviously, it's not cloud-native... but if you are using AWS EC2 it works


> 20-30 Lambda functions for different services

Yes. This is the basis of privilege separation and differential rollouts. If you collapse all this down into a single server or even lambda you lose that. Once your service sees load you will want this badly.

> SQS and various background jobs backed by Lambda

Yes. This is the basis of serverless. The failure of one server is no longer a material concern to your operation. Well done you.

> Logs scattered across CloudWatch

Okay. I can't lie. CloudWatch is dogturds. There is no part of the service that is redeemable. I created a DyanmoDB table and created a library which puts log lines collected into "task records" into the table paritioned by lambda name and sorted by record time. Each lambda can configure the logging environment or use default which include a log entry expiration time. Then I created a command line utility which can query and or "tail" this table.

This work took me 3 days. It's paid off 1000x fold since I did it. You do sometimes have to roll your own out here. CloudWatch is strictly about logging cold start times now.

> Could this have been simplified to a single NodeJS container or Python Flask/FastAPI app with Redis for background tasks? Absolutely.

Could this have been simplified into something far more fragile than what is described? Absolutely. Why you'd want this is entirely beyond me.


> Once your service sees load you will want this badly. > [...] > The failure of one server is no longer a material concern to your operation.

Elsewhere in thread you say:

> The event volume is not particularly large as we tend to process things in batch and rarely on the edge of an event.

So the service is not actually under load, and it runs in batches so (temporary) failure is not actually a concern.

> This work took me 3 days. It's paid off 1000x fold since I did it.

Since Lambda was introduced less than 10 years ago, what you're saying here is that it'd be full time job for you for the past 10 years to maintain this (3000 days instead of three) if you have not gone the serverless way, which I find doubtful.

> Could this have been simplified into something far more fragile than what is described? Absolutely.

Considering the hyperboles in the rest of your comment, this sounds more like snark than a considered opinion.


I agree that cloudwatch is dogturds, but want to dive deeper for illustrative purposes:

Your dynamodb solution isn't foolproof. It has throughput limited to the partition granularity -> in your case the lambda name. It's also relatively expensive and fairly slow to query in bulk (DDB is designed for OLTP).

I don't have direct experience here, but I expect slapping grafana on top of any disk basked source is likely to be cheaper, faster, and have better ergonomics. Once your logging is too much for a disk to handle (this will be later than you would've outgrown ddb, but before you would've outgrown cloudwatch) then you can bring something fancy in.


> has throughput limited

The event volume is not particularly large as we tend to process things in batch and rarely on the edge of an event. I also wouldn't, for example, log API requests using this mechanism. We're nowhere near this being an issue as 20-30 lambdas is not a particular problem for us. Choose a good naming convention and build your own deployment infrastructure and it's no sweat.

> relatively expensive

Large object compression and/or offload to s3 is baked into our dynamodb interface library. Not that this matters as almost all log records end up being less than 4kb anyways.

> slow to query in bulk

Which is why time is part of the key. You're not often looking back more than an hour. There's bulk export back onto campus servers if you wanted that anyways. TTL is default 1 day. Running a "tail" is absurdly cheap, much cheaper than CloudWatch's laughable rate for their similar feature, a miss is 1/2 a read unit, and a hit is almost never more than 2.

> slapping grafana

I didn't need "observability." I need current state and recent deltas. This is particularly true when any changes are made. Otherwise my logs are pure annoyance and don't generally provide value. We optimized for the exceptionally narrow case we felt the cloud underserved in and left it at that.


> privilege separation and differential rollouts

What relation does any of those have with load?

(And also, why are people so kin on doing privilege separation by giving full privilege to a 3rd party and asking it to limit what each piece of code can do?)


I've used Kamal for side projects and startups. Easy to deploy, simple commands for logging and configurable.

Downside is its a one to one system. But I just use downsized servers.


For all the people who are saying you don’t need X and Y - what is the simplest way to deploy a web app using TLS on a VPS/VM?

Let’s say I’ve got a golang binary locally on my machine, or as an output of github actions.

With Google Cloud Run/Fargate/DigitalOcean I can click about 5 buttons, push a docker image and I’m done, with auto updates, roll backs, logging access from my phone, all straight out of the box, for about $30/mo.

My understanding with Hetzner and co is that I need to SSH (now i need to keep ssh keys secure and manage access to them) in for updates, logs, etc. I need to handle draining connections from the old app to the new one. I need to either manage https in my app, or run behind a reverse proxy that does tls termination, which I need to manage the ssl certs for myself. This is all stuff that gets in the way of the fact that I just want to write my services and be done with it. Azure will literally install a GitHub actions workflow that will autodeploy to azure container apps for you, with scoped credentials.


> For all the people who are saying you don’t need X and Y - what is the simplest way to deploy a web app using TLS on a VPS/VM?

Depends on your defintion of simplest. In terms of set-up probably someting like https://dokku.com/ . It's a simple self-hosted version of herokku, you can be up and running in literally minutes and because its compatable with herokku you can re-use lots of github action/ other build scripts.

In terms of simple (low complexity and small sized components) just install caddy as your reverse-proxy which will do ssl certs and reverse proxy for you with extremely little, if any config. Then just have your github action push your containers there using whatever container set-up you prefer. This is usually a simple script on your build process like "build container -> push container to registry -> tell machine to get new image and run it" or even simpler just have your server check for updated images routinely if you don't want to handle communication between build script and server. That's the bare minimum needed. This takes a bit longer than a few minutes but you can still be done within an hour or two.

Regardless of your choice it shouldn't take more than 1 working day, and will save you a lot of money compared to the big cloud providers. You can run as low as €4.51/month with hetzner and that includes a static IP and basically unlimited traffic. An EC2 instance with the same hardware costs about $23 a month for comparison (yes shared vs dedicated vCPU, but even the dedicated offer at hetzner is cheaper, and this is compared to a serverless set-up where loads are spikey, which is exactly how we can benefit from a shared vCPU situation).


Re: securing SSH keys; Nowadays most password managers can store SSH keys and integrate nicely with your SSH agent, making it essentially equivalent to logging in with a password. I use KeepassXC[1], and the workflow consists of opening the database using my master password, then just `ssh machine`, so in my book it's at the same level of comfort as a web interface for your cloud provider

[1] https://keepassxc.org/docs/KeePassXC_UserGuide#_setting_up_s...


True, I see the allure of not thinking about draining connections. But I also enjoy having full access to the container and I don't really need scaling up and down features

If you don't like ssh you can have a gitlab runner on your VM which will redeploy your stuff on git push / git tag / whatever you want


You can pretty easily self host a GitLab instance, host a kubernetes runner for your images and use Tailscale for SSH keys.

This will most certainly cost you more than $30, but you can do it.


Use Caddy.

It does automatic certificates.


Completely agree.

Scaling (and relatedly, high availability) are premature optimizations[0] implemented (and authorized) by people hoping for that sweet hockey stick growth, cargo culting practices needed by companies several orders of magnitude larger.

[0] https://blog.senko.net/high-availability-is-premature-optimi...


No. YAGNI

In my time at my current job we've scaled PHP MySQL and Redis from a couple hundred active users to several hundred-thousand concurrent users.

EC2+ELB, RDS (Aurora, Elasticache). Shell script to build a release tarball. Shell script to deploy it. Everyone goes home on time. In my 12+ years I've only had to work off hours twice.

People really love adding needless complexity in my experience.


> People really love adding needless complexity in my experience.

No, people love thinking their experience is the same as everyone else's.

Have you ever worked in healthcare? Do you have any idea what sort of requirements there are for storing sensitive information?

>n my 12+ years I've only had to work off hours twice.

Well that settles it. Then no one on the planet should need cloud infra if yuo didnt.

And please, please don't tell me you've spent the last 12 years at the same place and have the gall to extend that to all software development.


That is a misinterpretation of what was said. I did not say all complexity is needless nor did I claim to have the one panacea.

I presented my story of how we've actively kept our architecture simple, and noted we've had very few issues. I did not say our architecture is the architecture for everyone.

Then I said

> People really love adding needless complexity in my experience

If the complexity is legally mandated, as in healthcare, it's by no means "needless". Legal compliance is a need.

If the complexity is justified, has merit or value, it's not "needless".

However, I've known a fair number of people who work on complicated kubernetes driven architectures that give them non-stop grief, and whose user base max out at ten-twenty active users.

My point is just don't make things more complex than they need to be.


> No. YAGNI

Sounds pretty absolute to me. I mean, when asked "Does your startup need complex cloud infra" (which is a loaded question) and you say "No. YAGNI" that seems pretty unequivocal and not really fair to say I misinterpreted it.

> My point is just don't make things more complex than they need to be.

I agree. I just don't care for the absolute language (that I and others use sometimes). It made learning when I was just getting into this field really tough. My answer to that exact same question would be "It depends".


I've seen more than one start up go tits up because they were too focused on designing "Google scale ready" infrastructure.

And it has to be cloud agnostic because we can’t get locked in!

I like the cloud but it is overused and misused a lot imo.


While this advice is good for micro-SaaS, it’s only good for micro-SaaS. If you’re at any other kind of startup, your revenue is expected to grow by double digits.

Your little startup will become large, and fast.

That hacked together single server is going to bite you way sooner than you think, and the next thing you know you’ll be wasting engineer hours migrating to something else.

Me personally, I’d rather just get it right the first time. And to be honest, all the cloud services out there have turned a complex cloud infrastructure into a quick and easy managed service or two.

E.g., why am I managing a single VPS server when I can manage zero servers with Fargate and spend a few extra bucks per month?

A single server with some basic stuff is great for micro-SaaS or small business type of stuff where frugality is very important. But if we shift the conversations to startups, things change fast.


No product I’ve ever worked on has been successful enough to require the optimizations that microservices can provide.

Part of the reason they weren’t successful was because my managers insist on starting with microservices.

Starting with microservices prevents teams from finding product-market fit that would justify microservices.


Go super minimal:

Postgres for everything including queuing

Golang or nodejs/TypeScript for the web server

Raw SQL to talk to Postgres

Caddy as web server with automatic https certificates

- No docker.

- No k8s.

- No rabbitmq.

- No redis

- No cloud functions or lambdas.

- No ORM.

- No Rails slowing things down.


Nice, but MySQL is even simpler than PostgreSQL.

I like the power of Postgres and I use many features and I find it simple.

The goal is not for the technologies used to be simple or boring.

The overall architecture is simple, the technologies used are powerful.


Thank you. I appreciate the answer.

I recently moved to data engineering role where everything uses GCP services (think BigQuery, DataProc, Cloud Storage, ...) and wondered is all that was really necessary?

What would be the simple yet robust infra for data eng? Not thought a lot about it for now, so I am curious if some of you have would have any insights.


The same thing that happened to devops from 2017-2024 (see: https://logical.li/blog/devops/) is happening with dataops. Hype train and jargon based decisions are taking place.

In the past years I was solving a data pipeline mess on a project which also had a devops AWS mess. First thing I was told was "what we need is a data lake".

Decisions are sticky so take context into account.


Simple answer. NO.

Everyone is building like they are the next Facebook or Google. To be honest, if you get to that point, you will have the money to rebuild the environment. But, a startup should go with simple. I miss the days when RAILS was king just for this reason.

The added complexity is overkill. Just keep it simple. Simple to deploy, simple to maintain, simple to test, etc. Sounds silly, but in the long run, it works.


“Scalability” seems to be perceived to be the most important thing for startups. It’s dream-driven development.

Also "scalability" is multi dimensional. I've seen, in the same company, infinite scalability in one downstream system whereas the upstream system it depended on was manually feed by fragile human-driven processes because there was not time to fix it. And at the same type the daily ops were "brain frying" because the processes were not automated and not streamlined and the documentation was ambiguous.

So, you had technical scalability in one system but if the customer base grew quickly every other bottleneck would be revealed.

There is more to business operations than technology, it seems.


We often forget that scalability doesn't mean just scaling up. It also means scaling down to avoid wasting money on overprovisioned infrastructure when you don't need it.

All businesses need to think about scalability, regardless of their size. If you're a startup, you likely want to be frugal with your infra costs, while still having the ability to quickly scale up when you need it. Those "simple" approaches everyone loves to suggest have no way of doing this.


A single Hetzner bare metal server is going to be a few times cheaper than all of these scalability gimmicks while offering a significant productivity.

A single server of any kind is not a proper production environment, unless you're building a toy or demo service. You want at least one application and one database server, since they have different operational requirements. You might even want to have a separate web server, so that you can isolate your internal network from the internet. This is all web hosting 101, and has been standard practice for several decades.

But wait, don't you want some form of redundancy/failover in case one of these servers catches fire? Alright, let's double this then. Make sure to setup your load balancer as well, which should probably run on a separate server.

But wait, don't you also want some kind of staging environment, so that you can certify your releases before deploying them to production? Alright, let's double this again.

And so on, and so on... Eventually you'll end up rebuilding the same features of those complex gimmicky tools, but do a much worse job at it, and you'll also have to maintain your custom tooling mess.

Of course, if your company fails after a few months, none of this is worth considering. But if you plan to exist for the next few years, I would argue that your productivity would be considerably higher if you had just chosen that gimmicky tool from the start, or a very short time after it.


I really subscribe to this kind of thinking, only I am team Kamal[0] instead of Docker Compose. Kamal 2 is around the corner and I think this version might even convince those that passed on Kamal 1. It's still just Docker but in a nice packaging. I'll be also updating my handbook[1] for the next version.

[0] https://kamal-deploy.org [1] https://kamalmanual.com/handbook


I've been looking at using Kamal for a side project. Seems to be similar in spirit. Has anybody used it, and if so, what do you think?

https://github.com/basecamp/kamal


I am curious about this too but haven't had the time to give it a try. Looking forward to hear about experiences.

Worrying about whether your web or app servers need or should use cloud architecture belies the much, much bigger consideration of how and where to store your data. Specifically, the economics of getting that data out of where you decide to put it first. Everything follows that decision.

Want to run bare metal? OK, guess you're running your databases on bare metal. Do you have the DBA skills to do so? I would wager that an astounding number of founders who find themselves intrigued by the low cost of bare metal do not, in fact, have the necessary DBA skills. They just roll the dice on yet another risk in an already highly risky venture.


I guess what some people do not understand is that K8s was created internally at Google, for managing their services and handling millions of users.

For new projects that, with luck, will have a couple hundred users at the beginning it is just overkilling (and also very expensive).

My approach is usually Vercel + some AWS/Hetzner instance running the services with docker-compose inside or sometimes even just a system service that starts with the instance. That's just enough. I like to use Vercel when deploying web apps because it is free for this scale and also saves me time with continuous deployment without having to ssh into the instances, fetch the new code and restart the service.


A single VM is all fine and well until your hacky go-fast code allows an issue with a single request to take down your service. Serverless requests are isolated and will limit the blast radius of your hacky code as you iterate quickly.

Never saw a single request taking down a whole server. Killed a worker and the connection timed out, but never saw it take down the whole thing.

Faulty input killing your logic - I saw this plenty, would Lambda really help here?


I've seen it plenty. A request to process an Excel file or generate a PDF etc. Basically anything generating or processing documents is a likely candidate. It might only affect a single application, but if you are running multiple apps on a box, it is often enough to cause an outage.

For me (where our BE consists of maybe 100 endpoints) we’ve found the sweet spot to be Google AppEngine. Incredibly simple to deploy, we don’t really need to manage infrastructure or networking (although you can if you want), decent performance, plays well with other GCP services, great logging and observability, etc

We’ve tried deploying services on K8s, Lambda/Cloud Run, but in the end, the complexity just didn’t make sense.

I’m sure we could get better performance running our own Compute/EC2 instances, but then we need to manage that.


Slight OT: I’m shocked at the complexity even for “simple” static hosting options.

I recently attempted to move to a completely static site (just plain HTML/CSS/JS) on Cloudflare Pages, that was previously on a cheap shared webhost.

Getting security headers setup, and forcing ssl, and www - as well as HSTS has been a nightmare (and still now working).

When on my shared host, this was like 10 lines of config in an .htaccess file before.


Yes, your startup needs complex cloud infrastructure when your organizational infrastructure can afford it in terms of money, other resources and time.

One domain, an idea, an easy-to-use development stack for a bootstrapped as well as funded startup is more than good enough to locate product-market fit.

Alway remember this quote by Reid Hoffman “If you are not embarrassed by the first version of your product, you’ve launched too late.”


If your stack is Node.js, I highly recommend SSTv3 [0], which uses Pulumi under the hood and thus lets you deploy to any provider you want, be that cloud or docker in Hetzner.

It's simple and can scale to complex if you want. I've had very good experience with it in medium size TS monorepos.

[0]: https://sst.dev


Exactly. Keep it simple. We're running 1 monolith FastAPI service on EC2 with ECS (1 instance). Very simple, easy to debug and develop. Plus we have few lambdas for special tasks (like PDF generation), which run rarely, but are needed. Frontends are Vue projects served from a public S3 bucket. This setup might work for many years.

> Pieter has built numerous successful micro-SaaS businesses by running his applications on single server, avoiding cloud infrastructure complexity...

From what I understand he employees a dedicated system administrator to manage his fleet of VPS (updates, security and other issues that arise) for 1000s of USD per month.


It doesn't matter how we build it if there are no users to use it. This is the real problem for many startups

I'll say the same thing I always say on these kinds of posts. Both of the following can be true:

- A lot of companies and startups can get by with a few modest sized VPSs for their applications

- Cloud providers and other infrastructure managed services can provide a lot value that justifies paying for them.


In my case (Experience with Azure Development), I definitely would use cloud infrastructure. Cloud providers abstract a lot of difficult things away, have ok-ish documentation and have a UI where I can easily find relevant information or do some debugging. With tooling I have more experience with I move away from the UI, but it's so easy just to get something up and running. The difficult thing is not getting each of these individual tools up and running, but handling the interactions between them and unfortunately I don't feel comfortable enough to do Networking, SSL, Postgres, Redis, VM management and building / hosting containers at the same time.

Costs in my case is not the highest priority: I can spend a month learning the ins and outs of a new tool, or can spend a few days learning the basics and host a managed version on a cloud provider. The cloud costs for applications at my scale are basically nothing compared to developer costs and time. In combination with LLMs who know a lot about the APIs of the large cloud providers, this allows me to focus on building a product instead of maintenance.


There are only relative few startups or non startups which need complex infrastructure from a technical point of view...

In reality, there is a strong bias in favor of complex cloud infrastructure:

"We are a modern, native cloud company"

"More people mean (startup/manager/...) is more important"

"Needing an architect for the cloud first CRUD app means higher bills for customers"

"Resume driven development"

"Hype driven development"

... in a real sense, nearly everyone involved benefits from complex cloud infrastructure, where from a technical POV MySQL and PHP/Python/Ruby/Java are the correct choice.

One of the many reasons more senior developers who care for their craft burn out in this field.


Building and operating your own car out of simple components is not simpler than buying a car off the shelf.

Operating a bunch of simple low-level infrastructure yourself is not simpler than buying the capabilities off the shelf.


Apples and oranges.

I'd say it is more like: Using a trolley to move some stuff across the street is more simple than using a fleet of drones.


Running your own Postgres on your own server — implementing and testing your own backups, optimizing your own filesystem, managing encryption keys, managing upgrades, etc — is not simpler than using Google Cloud SQL, which does all of this for you at an SLA you will not be able to achieve if you will be focusing on your business, which is what you should do as a startup.

Certainly you should not be running your own K8S cluster, but using Google Cloud Run is simpler than keeping your own server running. Even using Google Cloud Kubernetes Engine with autopilot is simpler than keeping your own server running.


Docker Compose Anywhere looks cool. Looks similar, on principle, to [CapRover](https://caprover.com/) which I highly appreciate.

It's still a badge of honor, bragging rights, for executives to declare that all their tech is in the cloud. Once this wears off we will get our fucking bare metal back.

This is quickly turning into a bozo badge. (Even Gartner will say so.)

So it's a risky thing to brag about right now.


I can't wait for the ooohhs and aaahhs when people start "going hybrid."

https://news.ycombinator.com/item?id=9581862 aww, yourdatafitsinram.com is now domain squatted.

https://yourdatafitsinram.net/ is up (and looks to be approximately the same as the old .com was)

It highly depends on what you are developing. Just because one guy (levels or whatever his name is) is doing it, doesn't mean it fits for everyone

The issue with a long running server is that if your traffic is low, you’re paying for idle time all the time. So I’d prefer a serverless solution.

...and the funny thing is that it is still cheaper than cloud native even being up all the time and provides a predictable cost per month, unlike serverless where you can have big surprises.

Check:

https://logical.li/blog/emperors-new-clouds/


Interesting. Expecting to read things I'd object to. But this is basically what I do, at least for smaller setups.

Cloud Native is the J2EE of the 2010s and 2020s.

It’s really brilliant. Sun would have been the one to buy Oracle if they’d figured out how to monetize FactorySingletonFactoryBean by charging by the compute hour and byte transferred for each module of that. That’s what cloud has figured out, and it’s easy to get developers to cargo cult complexity.


It looks like the author specifically talks about the infra for an early-stage startup that has not found product-market fit yet. If a startup has product for consumers and does find the product-market fit, then I'd imagine two pieces of infrastructure that is hard to come by: EC2, and S3. Yes, EC2, the grandpa's infra that people either ignore or despise. But really, anyone can learn how to set up and run a k8s cluster, yet very few companies can offer something like EC2: full abstraction of the underlying servers, worry-free of provisioning new servers, and robust and versatile implementation of dynamic autoscaling. After all, all the k8s shit won't scale easily if we can't scale the underlying servers.

And S3. S3 is just a wonderful beast. It's just so hard to get something that is so cheap yet offers practically unlimited bandwidth and worry-free durability. I'd venture to say that it's so successful that we don't have an open-source alternative that matches S3. By that I specifically mean that no open-source solution can truly take advantage of scale: adding a machine will make the entire system more performant, more resilient, more reliable, and cheaper per unit cost. HDFS can't do that because of its limitation on name nodes. Ceph can't do that because of it bottleneck on managing OSD metadata and RGW indices. MinIO can't do that because their hash-based data placement simply can't scale indefinitely, let alone ListObjects and GetObjects will have poll all the servers. StorJ can't do that because their satellite and metadata servers are still the bottleneck, and the list can go on.


Whatever happened to EC2 with web/worker autoscaling? Is it outdated or unfashionable?

The biggest issue with this is when you deploy multiple applications to a server (e.g. 5 apps on IIS or whatever) and one of them kills off the box when it behaves badly. You can auto-scale, but it takes time to provision new machines and until they are up, you are down. Once you've experienced this a few times, the desire to split out applications into micro-services gets pretty strong in order to limit the blast radius.

No matter what auto-scaling solution you pick, it'll take time to start fresh new instances.

Agreed, but if you are running on micro-services and an app crashes, you don't lose everything (hopefully). It's not enough to make me want to use micro-services everywhere, but it is a consideration.

I'd like to see things work on Erlang/Elixir in Production and whether that works better, since the BEAM is very good at preventing individual processes from dominating the server and is also very good at recovering from errors.


Ah there's your problem. Don't use IIS on Windows Server 2008!

We don't always get a choice.

I'm sorry, what? IIS? o_O

The web server that many of us are stuck with unfortunately ;)

Just unfashionable.

TLDR: I'm not too good with the infrastucture (and this couple of teams also), so you should also go on steam engines.

Of cource it highly depends on the skills of the team. In a startup there could be no time to learn how to do infrastructure well. But having an infrastructure expert in the team can significantly improve the time to market and reduce the developer burnout and the tech debt growth rate.


TL;DR

You shouldn't need to assemble a plane when your startups journey can be expected to only last a few kilometers and you really only need to carry a few boxes.


Betteridge's law of headlines is an adage that states: "Any headline that ends in a question mark can be answered by the word no."

But this one is phishing for a "no". That law explodes in contact with those.

A compromise people seem to overlook: Use a single Lambda with internal routing.

This is my preferred approach for lambdas. A larger Lambda that handles URL routing on the "API" level instead of individual endpoint level.

Is this using the Lambda as your entire service?

I'm sorry, a single Lambda for what exactly?

EC2 Linux VM with node, sqlite, let’s encrypt cert and a domain name.

In "serverless" defense, I'll put a one data point from myself. I built https://crates.live 4-5 years ago. I used a "complex" tech stack. A single page web app. Hosted in Github pages as static HTML/JS. For the server side, I used Cloudflare workers (Wasm) to run a GraphQL server (kind of).

The result: It's still up after 5 years. I never looked back after I created the project. I do remember the endless other projects I did that have simply died now because I don't have time to maintain a server. And a server almost always end up crashing somehow.

Another thing, Pieter Levels has successful small apps that relies more on centralized audiences than infrastructure. He makes cool money but it's nowhere near startup-expected levels of money/cash/valuations. He is successful in the indie game but it'll be a mistake to extrapolate that to the VC/Silicon Valley startup game.


To counter your point I have a site running since 2019 that is still up with no input from me or anybody, it’s a dynamic site too. It’s running on docker on a vps at digitalocean. If you build a rock solid configuration it will stand the test of time.

The classic HN catnip blog post:

1. New technology is bad and has no merit other than for resume.

2. Use old technology that I am comfortable with.

3. Insist that everyone should use old technology.


Fake dichotomy. It is not old vs new, it is simple vs complex. The fact the older technology is simpler is just a coincidence.

> The fact that older technology is simpler

hahahahahahahahahaha. Yes, back in the days when all you could do on a website is read the text.

Old technology was 1000% not simpler. What an insane & absolute statement to make in an enormous field just because you can't make a solid argument.


I worked on a startup recently that had gone all in on AWS infrastructure, Lambda functions, managed database, IAM security.

Man the infrastructure was absolutely massive and so much development effort went into it.

They should have had a single server, a backup server, Caddy, Postgres and nodejs/typescript, and used their development effort getting the application written instead of futzing with AWS ad infinitum and burning money.

But that's the way it is these days - startup founders raise money, find someone to build it and that someone always goes hard on the full AWS shebang and before you know it you spend most of your time programming the machine and not the application and the damn thing has become so complex it takes months to work out what the heck is going on inside the layers of accounts and IAM and policies and hundreds of lambda functions and weird crap.


Same here. The CTO was also engaging in resume-driven development. There is no rational discussion about what tech stack to use. Executives need to be able to point to a modern tech stack as a signal of their relevancy and competence. No one will be caught dead slinging bare metal and running on-prem databases right now. It's just the look.

I built out a POC and was running it on bare metal for serious workloads under my desk at GE (12-factor). Management practically scrambled to get me cloud access. My setup was ephemeral and could be easily reproduced anywhere. The software was easily deployed on, or integrated with, cloud services. I just shrugged.

I didn't care where my code ran, to them it was some epic priority to get it in the cloud and generate extra expenses.


I've seen the same thing. Massive infrastructure for a site that could run on a small VM. More time was spent configuring infrastructure, Terraform, debugging IAM roles than building the actual code...

No, it needs a simple cloud infrastructure.

> Even GCP VMs and EC2 instances are reasonably priced.

Really? EC2 instances are waaay overpriced. If you need a specific machibe for a relatively short time, sure, you can pick up one from the vast choice of available configurations, but if you need on for long-running workloads, you'll be much better of picking up one from Hetzner, by an order of magnitude.

For one of the many examples, see this 5-year old summary (even more true today) by a CEO of a hardware startup:

https://jan.rychter.com/enblog/cloud-server-cpu-performance-...


Why would anyone make it complex?

I would say WAF is also not that useful addition when you develop new applications. Especially if you use new frameworks and ORM.

Most of crap hitting servers is old exploits targeting popular CMS.

WAF is useful if you have to filter out traffic and you don’t know what might be exposed on your infra. Like that Wordpress blog that marketing set up 3 years ago and stopped adding posts and no one ever updated it.


Honestly what you need:

vulnerability scanning of your images.

Fargate

RDS


I dunno. I've seen a _lot_ of business ideas fail, which could have much less expensively failed using PHP and MySQL on shared cPanel hosting than they did using AWS/Azure/GCP.

Yeah, that won't scale to a million QPS, or even 10 QPS. But way more businesses fail because they never achieve 100 Queries Per Day, instead of failing because they fell over at 10 or 1,000 or 1,000,000 QPS.

I mean, hell, Twitter (back in the day) was famous for The Fail Whale.

Getting enough traffic is harder and more important than your "web scale architecture" for your startup. Making actual cash money off your traffic is harder and more important than your "web scale architecture" (ideally by selling them something they want, but making cash money through advertising or by impressing VCs with stories of growth and future value counts too).

There is precisely _zero_ chance that if you ever get within 2 or 3 orders of magnitude of "a million QPS" - that the code you and your cofounder wrote won't have been completely thrown away and rewritten by the 20 or 100 person engineering department that is now supporting your "1000 QPS" business.


No

yeah but I dont need Python Flask on my resume, I need docker, kubernetes and terraform on my resume

I need it on my resume for every 2 year stint and 2-3 people on the team to vouch for it

You’re saying “hey, let everyone know you worked on a tiny company’s low traffic product and how about you just don’t make half a million a year," all to save the company I work at a little money?

until companies start interviewing for that its a dumb idea, I’m rarely making green field projects anywhere and other devs also are looking for maintainers of complex infrastructure


I think there is a middleground that to me it seems like this over simplifies both sides of this.

For many of the "complex" things like lambdas there are frameworks like Serverless that makes managing and deploying it as easy (if not easier frankly) than static code on a VM.

Not every workload also scales at the same time, we have seen new things that got very successful and crashed right out the door because it could not properly scale up.

I agree that you don't need an over engineered "perfect" infrastructure, but just saying stick it on a VM also seems like it is too far of a swing in the other direction.

That ignores the cost side of running several VM's vs the cost of smaller containers or lambdas that only run when there is actual use.

Plus there is something to be said about easier local development which some things like Serverless and containers give you.

You may not need to setup a full k8s cluster, but if you are going with containers why would you run your own servers vs sticking the container in something managed like ECS.


> Does your startup need complex cloud infrastructure?

99.99% of the time. No.


First, Lex Friedman is the dumbest motherfucker in podcasting. No brains at all, terrible ignorant thoughtless interactuons: just awful in every way.

> But here's the truth: not every project needs Kubernetes, complex distributed systems, or auto-scaling from day one. Simple infrastructure can often suffice,

When the hell can we be done with these self compromising losers? Holy shit! Enough! It doesn't save you anything doing less. People fucking flock to bot-Kubernetes because they can't hack it, because they suck, because they would prefer growing their own far worse far more unruly monster. A monster no one will ever criticize in public because it'll be some bespoke frivolous home grown alt-stack no one will bother to write a single paragraph on, which no one joining will grok understand or enjoy.

It's just so dumb. Theres all these fools trying to say, oh my gosh, the emperor has no clothes! Oh my gosh! It might not be needed! But the alternative is a really running naked through the woods yourself, inventing entirely novel unpracticed & probably vastly worst less good means for yourself. I don't know why we keep entertaining & giving positions of privilege to such shit throwing pointless "you might not need it" scum sucking shits trying to ruin things like so, but never ever do they have positive plans and never ever do they acknowledge that what they are advocating is to take TNT to what everyone else is trying to practice, is collaborating on. Going it alone & DIY'ing your own novel "you might not need" to participate in a society stack is fucking stupid & these people don't have the self respect to face up to the tall dissent they're calling for. You'd have to be a fool to think you are winning by DIY'ing "less". Fucking travesty.


> First, Lex Friedman is the dumbest motherfucker in podcasting.

Amen.

> I don't know why we keep entertaining & giving positions of privilege to such shit throwing pointless "you might not need it" scum sucking shits t

Amen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: