Hacker News new | past | comments | ask | show | jobs | submit login
How we use HashiCorp Nomad (cloudflare.com)
318 points by jen20 on June 6, 2020 | hide | past | favorite | 160 comments



After trying out most of the kubernetes ecosystem in the pursuit of a declarative language to describe and provision services, Nomad was a breath of fresh air. It is so much easier to administer and the nomad job specs were exactly what I was looking for. I also noticed a lot of k8s apps encourage the use of Helm or even shell scripts to set up key pieces, which defeats the purpose if you are trying to be declarative about your deployment.


https://kapitan.dev/ is the one-stop shop that covers for true declarative configuration with either jsonnet, python (kadet) and jinja, amazing secret management with support of gkms, awskms, gpg, vault.

It is simpler than other tools, because you can get started without even touching jsonnet or python or anything else, when using our generators.

It does more than all the other tools combined, as it replaces helm+helmfile+gitcrypt or kustomize.

It's universal, so you can use it on non-kubernetes situations where other tools leave you high and dry.

With the new generator library, you can have 1 template and use it to configure many services. Check our examples at https://github.com/kapicorp/kapitan-reference

We have just released: * https://github.com/kapicorp/kapitan-reference a repo with examples for quick-start. It includes our generator as explained in https://medium.com/kapitan-blog/keep-your-ship-together-with.... * https://github.com/kapicorp/tesoro a secret "webhook controller" to seamlessly handle Kapitan secrets in your cluster. Better than sealed-secrets because there is no need to convert secrets and it supports KMS like google and aws together.

Get started with our blog: https://medium.com/kapitan-blog or join our kubernetes slack on #kapitan


Helm charts are declarative way of deploying app(s) and their accompanying resources.


Helm, however, is objectively terrible with its yaml-based templating language and zero practical modularity.


Indeed. Helm offers great features but it suffers from the kubernetes unnecessary complexity and by using golang templates in YAML.

When I started with kubernetes I converted my small Docker compose files to kubernetes files. Later I rewrote everything in helm charts. Now it's almost more YAML and golang templates lines than business logic lines in my applications.

I'm considering to go back to Docker compose files. It's simple, readable, and easy to maintain.


Highly recommend trying Jsonnet (via https://github.com/bitnami/kubecfg and https://github.com/bitnami-labs/kube-libsonnet) as an alternative. It makes writing Kubernetes manifests much more expressive and integrates better with Git/VCS based workflows. Another language like Dhall or CUE might also work, but I'm not aware of a kubecfg equivalent for them.

Jsonnet in general is a pretty damn good configuration language, not only for Kubernetes. And much more powerful than HCL.


If you like those, I'd take a look at Grafana's Tanka [0]. It also uses jsonnet but has some additional features such as showing a diff of your changes before you apply, easy vendored libraries using jsonnet-bundler, and the concept of "environments" which prevents you from accidentally applying changes to the wrong namespace/cluster.

[0] https://github.com/grafana/tanka


I looked at it, I don't like it for the same reason as I dislike many other tools in this space: it imposes its own directory structure, abstraction (environments) and workflow. I'm a fan of the kubecfg-style approach, where it lets you use whatever sort of structure makes sense for you and your project.

It's a 'framework' vs 'library' thing, but in the devops context.


I worked on this at previous gig - https://github.com/cruise-automation/isopod

Same use-case but uses Starlark dsl instead of jsonnet


Funny, I thought jsonnet was an even worse experience than templating yaml


The problem with templating YAML is that you're templating text, in a very sensitive to whitespace syntax. By definition, Jsonnet avoids that because it operates on the data structure, not on their stringified representation.


My experience with jsonnet varied: there's good jsonnet code, and there's bad one, too. Just like with any programming language, you have to apply good software engineering practices. Text templated YAML, however, is terrible by design.


Ive found the whole k14s eco system is pretty great to (ytt + kbld + kapp).


ytt is even more templating-yaml-with-yaml, so it all ends up being a bargain bin Helm. There's no reason to do this over just serializing plain structures into YAML/JSON/...


It avoids the whole set of issues you get with templating yaml with helm since its structure aware.

Also you can actually write pure code libraries in starlark (basically a subset of Python)


I'm aware, but I just don't understand the point of it. How is it in any way better than a purpose-specific configuration language like Dhall/CUE/Jsonnet? What's the point of having it look like YAML at all?


We just moved to the kubernetes ecosystem - specifically for integration with spot instances of AWS (and cut our costs by 60%).

But I hate the UX of kubernetes. Compose was beautiful.

I have hope - since compose files have become an independent specification. https://www.docker.com/blog/announcing-the-compose-specifica...

Kubernetes distro built around the compose file specification would be a unicorn.


https://kapitan.dev/ is the one-stop shop that covers for true declarative configuration with either jsonnet, python (kadet) and jinja, amazing secret management with support of gkms, awskms, gpg, vault. It can also render helm charts!

It is simpler than other tools, because you can get started without even touching jsonnet or python or anything else, when using our generators. It does more than all the other tools combined, as it replaces helm+helmfile+gitcrypt or kustomize. It’s universal, so you can use it on non-kubernetes situations where other tools leave you high and dry.

With the new generator library, you can have 1 template and use it to configure many services. Check our examples at https://github.com/kapicorp/kapitan-reference

We have just released: * https://github.com/kapicorp/kapitan-reference a repo with examples for quick-start. It includes our generator as explained in https://medium.com/kapitan-blog/keep-your-ship-together-with...

* https://github.com/kapicorp/tesoro a secret “webhook controller” to seamlessly handle Kapitan secrets in your cluster. Better than sealed-secrets because there is no need to convert secrets and it supports KMS like google and aws together. Get started with our blog: https://medium.com/kapitan-blog or join our kubernetes slack on #kapitan


It is. But k8s has no convenient way of parameterizing releases that can beat Helm. A simple stateless application needs:

- a deployment - a service - an ingress - a config map (or several) - a secret (or several)

It's even worse for stateful applications.

And each of the resource definitions is 60% boilerplate, 35% application-specific and 5% release- or environment-specific.

Helm would probably be a nice and neat tool if it had stopped at maintaining a simple map of variable names to values. But since applications need things like "if the user said SQLite, add a pvc, a configmap and a secret and refer to them in the ss, if she said Postgres, go pull another chart, deploy it with these parameters, then add this configmap and this secret and refer to them in the ss", Helm is an overcomplicated mess.


k8s has a very convenient built-in way of parametrizing releases called kustomize, it is supported by kubectl since quite a few versions ago.


Sometimes I even wish they could embed a JavaScript interpreter... After all, YAML is almost equivalent to JSON, which the perfect templating language for JSON is -- JavaScript tbh.

Or people have to keep inventing half baked things.


The problem isn't JSON or YAML: it's text templating serialization formats, instead of just marshalling/serializing them.


There is a tool called Pulumi that does exactly that, and it's excellent.


json is already valid yaml


> Helm charts are declarative way of deploying app(s) and their accompanying resources.

How do you make helm chart deployment declarative? `helm install` is not declarative (in my understanding `kubectl apply` is declarative and `kubectl create` is not. Let me know if my understanding of declarative is wrong). Thanks.


I think you’re confusing “declarative” with “idempotent”! The helm equivalent to “kubectl apply” is “helm upgrade —-install”.


Yes. Thank you.


`helm template | kubectl apply` ?


This is one way of doing declarative using helm, but this doesn't guarntee resource creation order. `helm install` creates resources in order while `kubectl apply` creates in order of alphabetically filename. This may lead to some unexpected issues.


Helm charts, round these parts, are a glorified yet byzantine templating system to spit out K8s manifests.


Maybe I was doing it wrong but every guide was verb based - “helm install X”. My declarative ideal ended up being a text file full of helm install commands and that wasn’t what I wanted.


Quick-start guides takes the easiest path to get something running, which is `helm install` in the Helm world.

If you want to have complete control of what you're pushing to the API, use Helm as an starting point instead, run `helm template` and save the YAML output to some file, publish it using `kubectl` or some other rollout tool. I recommend using `kapp` [1] for rollouts.

[1] https://get-kapp.io/


This is precisely how Spinnaker behaves with Helm. It renders the template into manifests and then deploys them.

Kinda loses the benefit of Helm hooks, but if one is using Spinnaker, there are other ways to do the same thing hooks can do.


Great writeup, even if I'm longing for more details about things like Unimog and their configuration management tool.

Pay close attention to the section where they described why they went with Nomad (Simple, Single-Binary, can schedule different workloads and not just docker, Integration with Consul). Nomad is so simple to run that you can run a cluster in dev mode with one command on your laptop (even MacOS native where you can try scheduling non-Docker workloads). I'd go so far as saying that it would even pay off to use Nomad as a job scheduler when you only have one machine and might have used systemd instead. You can wget the binary, write a small systemd unit for Nomad, then deploy any additional workloads with Nomad itself. By the time you have to scale to multiple machines you just add them to the cluster and don't have to rewrite your job definitions from systemd.


The biggest hurdle with adopting nomad is the kubernetes ecosystem and related network effects. Things like operators only run on Kubernetes, and they're driving an entirely new paradigm of infrastructure and application management. HashiCorp is doing their best to keep up while supporting standard kube interfaces like CSI/CNI/CRI, but I don't know how they can possibly stay relevant with Kubernetes momentum.

In my opinion, HashiCorp should look at what Rancher did with K3s and offer something like that, integrated with the entire Hashi stack. The only reason most people choose nomad is the (initial) simplicity of it (which quickly goes away once you realize how "on an island" you are with solutions for ingress etc). Deliver kube with that simplicity and Integration and it's a much more compelling story than what Nomad delivers today.


This is what Nomad is though... It works without their other products and also natively integrates into them. Deploying Vault, Consul, and Nomad gives you a very nice experience.

Also with Consul 1.8 and Nomad 0.11 you’ll get Consul Connect with Ingress gateways which solves some of those problems you mentioned.


Nomad uses consul as the kv store, so it doesn't work without their other products.


Please make an attempt to understand a technology before making comments like this. Nomad has no requirement to store kv's, nor secrets. It's an entirely separate thing from Consul and Vault. It simply integrates with Consul, and Vault. It even runs its own raft store, so each product's backend is totally separate (Vault dropped it's reliance on other backends as of 1.4 and can run it's own raft store now).

Nomad template stanzas are simply consul-template (https://github.com/hashicorp/consul-template#plugins), you could use `{{ with secret "" }}` and never touch Consul. You also have every function beyond service and the key* set of functions. You could build pretty static, or dynamic configurations using these blocks without ever touching Consul. On top of that, both Terraform templates and Levant work well for templating job specs out themselves, which contain template stanzas in them.

An example of something helpful would be if you wanted to drop a very small binary that changes with each deploy, you could use https://github.com/hashicorp/consul-template#base64decode and just change the contents each time you deploy the job.

If you wanted to use redis keys in your consul-template, simply drop a binary on each server as a plugin to consul-template, then for example: `{{ "my-key" | plugin "redis" "get" }}`.

Why not try running it yourself?

`nomad agent -dev` will get you a server running, then `nomad job init` will give you a full spec which you can run with `nomad job run example.job`.


I understand nomad, and what you're saying is not how anyone runs it. Directly from their documentation:

"Nomad schedules workloads of various types across a cluster of generic hosts. Because of this, placement is not known in advance and you will need to use service discovery to connect tasks to other services deployed across your cluster. Nomad integrates with Consul to provide service discovery and monitoring."

So one of the very basic features of an orchestrator, which is service discovery, requires consul. Sure, I can use it without consul to just start jobs that don't communicate, but obviously that's that the normal use case of it, and you can see that by looking at their issues list.


So basically, when CloudFlare was making the decision to adopt Nomad, they had already adopted Consul and had already built in-house a custom scheduler (unimog) for their customer traffic.

It's rather disingenuous to compare to Kubernetes-based installations by pointing out that Nomad is a single Go binary that's easy to install, because they're also running Consul and unimog. For what it's worth, kubelet (which schedules jobs, like Nomad) is also a single Go binary that's easy to install, as is kube-proxy (which helps services running on a node to send traffic to the right node, so, roughly analogous to Consul).

If anything, the author practically makes the case against Nomad. Practically speaking, the only reason why CloudFlare adopted Nomad is because the stars aligned on their stack. Most companies will actually reduce complexity by adopting Kubernetes instead, compared to running two different schedulers for jobs and externally-managed service discovery.


I agree with everything you said, but it’s important to note that the blogpost isn’t comparing nomad to K8s, but describing why nomad makes sense for cloudflare.

From cloudflare’s perspective, K8s is a riskier choice, precisely because it’s large and multifunctional. If Nomad (or Consul) upstream went south, CF would probably be able to maintain it without seriously growing their headcount. And they’d probably be playing the lead role in that ensemble.

If kubernetes upstream went south, the above statements would almost certainly not hold.

For them, picking a stack means picking something they need to stick to, potentially for the next decade or two. When your decisions are made with that long term in focus, the biggest, shiniest, and most exciting community is not always the right choice.

Like you said, it seems like this is the right choice for CF. It’s probably not the right choice for $TECH_STARTUP, for whom pinning the future of business critical infrastructure to Google and RedHat’s stewardship of K8s is the least of their concerns, and that’s OK.


Having worked extensively with both the k8s and hashicorp stack let me tell you that there is something to be said regarding the simplicity of the former.

First - companies will want service discovery for all managed resources. Consul is just outstanding, so there’s that.

On-prem, nomad is simple to manage and it gives you a form of freedom that k8s locks up in a black box.

After all, it’s a process scheduler, go ahead and schedule what you want. A container, raw exec or batch job. Add parameters to the batch job and dispatch them at will.

If you start from scratch or can move absolutely everything to k8s? Well sure, but not a lot of places you’d call “enterprise” isn’t in that position.


The single binary thing is a tired trope. Kubernetes has had hyperkube for many years, which is a single binary for many years.

But in really, a single binary should be at the bottom of your list of concerns.


Hyperkube is several binaries bundled in one container, no?


It's often _distributed_ as a container, but it ~~is~~ was a single monolithic go binary. However, while I was looking for supporting evidence to that claim, I discovered that it recently[1] was evicted from the github.com/k/k repo and the docker image that is now built uses a shell script named hyperkube and the Dockerfile just untars the underlying binaries into /usr/local/bin

I wasn't able to immediately track down the alleged new repo for hyperkube, though

1 = https://github.com/kubernetes/kubernetes/pull/83454


Do people that aren't cloudflare scale really see the need for kubernetes and/or Nomad?

Of the two Nomad seems much more sane because it does one thing only and is much simpler to manage and deploy.

That said, having have used it, we are mostly moving away from it. Consul + Docker/Docker-compose with systemd in "a service per vm" model has proved much easier to administrate to our scale (couple of datacenters, ~1k VM mark). It is actually so simple that is boring. Instead of fiddling with infrastructure the developers spend time solving business problems...

Our stack is really boring, developers can find their way around ansible and share galaxy roles to configure / provision infrastructure... Other cloud native projects like Prometheus and Fluentd give us all the visibility we need in a very straight forward and boring way.

p.s: Great article!


From my personal experience, yes, but not just for scale. I've used Kubernetes in a 300k person organization and now in a 35 person organization. Kubernetes isn't the simplest solution, but it's just reliability in a box. You can control the network, and the servers, and the load balancing all through the API which means for most things I can install a fully clustered solution with a handful of commands. Set the number of instances I want to run and forget the whole thing exists. Suddenly I'm cloud agnostic by default and the underlying K8s management is handled by the cloud provider.

In the 300k person org it was for scale and reliability. In the 35 person org it's so that 1 guy can manage everything without trying.


I see. My last work experiences have been introducing devops workflows into companies that already had a decent tech body (~200 people) and the traditional divide between IT/Dev.

I think the approach I mention ends up being a good compromise from both sides. If I were starting a company from scratch I would certainly consider it differently.


Imo kubernetes isn't about "scale". Folks like cloudflare often run a very limited number of services (proxy plus control plane), which can make features like quotas/affinities/limits less useful because 100% of the machine installed in the pop is dedicated to one process.

I see kubernetes as more useful when organizationally you have more heterogeneous services and separate dev/sre groups. This allows them to divide up responsibilities more easily.


Increases velocity as well.

I have a set of developers that can focus on their code and not have to worry about the infrastructure, TLS certs, DNS, storage or whatever. That all gets abstracted away.

For SRE, that gives one a common set of orchestration tooling one can develop against.


Interesting, I went the other direction recently, from systemd to Nomad.

I was motivated by a move away from config management in favor of commands wrapping Nomad API calls. For a "devs on-call" model this was preferable to being gatekeepers of PRs against config management.

To glue the whole thing together I got Consul Connect going in the Nomad jobs, so service config complexity was comparable to docker-compose.

Saying this out loud makes me realize it was the organizational model I was pursuing that led me here (so called "production ownership"). And I'm not a big config management fan ;). I take it you have dedicated Ops teams or Devs willing to learn Ansible well?


Our devs to their own ops using our blessed ansible galaxy roles + creating their own when necessary.

It is all pretty standardized, is rare that someone needs to create a new galaxy role for example.

We expect our teams to know their way around Ansible ( which hasn’t been a problem ), we have internal docs and training


I'm surprised Hashicorp hasn't repositioned this product.

Terraform was a huge breath of fresh air after Cloudformation. If you ask me, deploying apps via k8s is even better.

Everyone wants a free lunch and for me, momentum + cloud vendor support + ultimately the Nomad Enterprise features that come free with K8s made the choice easy.


> Terraform was a huge breath of fresh air after Cloudformation.

Humans shouldn't write CloudFormation templates. If you're doing that here is a important free clue: use CDK.

One advantage of CDK over Terraform is that the 'state' is the CloudFormation stacks themselves as opposed to Terraform's state system. Another is that you get a real programming language (your choice of several) using CDK.

The disadvantage of CDK is that it's AWS only.


TF can be often suffocating still. Pulumi runs circles around it, when it comes to user experience in writing complex, modular, composable and reusable configurations.


By going with a "fork and patch" model, rather than shelling out to the existing terraform providers, they are putting themselves in a constant race to overtake TF's popularity.

Their okta provider, for example, is based on some absolutely ancient, non-standard provider (https://github.com/pulumi/pulumi-okta/blob/v2.1.2/provider/g... ) instead of using the official one which receives regular updates: https://github.com/terraform-providers/terraform-provider-ok... ; and now I guess it's on me to do some git diff to find out if there are meaningful changes between the imported provider and the upstream one?

To say nothing of the more esoteric providers that one can cheaply build and place in the `.terraform.d/plugins` directory and off you go, in contrast to trying to find the dark magick required to use some combination of https://github.com/pulumi/pulumi-terraform-bridge and https://github.com/pulumi/pulumi-tf-provider-boilerplate but ... err, ... then one has to own that new generated pulumi provider code? So like a fork but worse? Maybe they should make a bot that tracks the terraform-provider topic on GH and uses their insider knowledge to generate pulumi providers for them.

Don't get me wrong: I anxiously await the death of TF, but until there is a good story to tell my colleagues about an alternative to it, they'll continue to use it.


Except now you have to deal with javascript. It's hard enough to turn some operators towards an IaC approach, but throwing Javascript at them isn't going to help.


Pulumi does not require JavaScript at all - it is one of a range of options, including JavaScript, TypeScript, Python, Go, C# and F# (currently).

Disclaimer: I worked on Terraform at HashiCorp and am a contributor to Pulumi also.


Pulumi is so good (even if it is just typed-SDK's over Terraform). Only sane approach to IaC IMO, using an actual programming language with types and IDE integration.

Really dig using with it Cloud Run.


It’s also not typed SDKs over Terraform. The best Pulumi providers are native and offer a much richer experience than the wrapped Terraform providers (see the Kubernetes support for example). I for one look forward to those rolling out across the board!


I actually have one burning question if you don't mind answering:

Pulumi has two different k8s modules, the standard one (pulumi-kubernetes) and "kubernetes-x" which has a much nicer API that reduces boilerplate:

https://github.com/pulumi/pulumi-kubernetesx

The second one doesn't seem to be used in the Pulumi tutorial content and has less stars/recognition.

Are these interchangeable? If so, why not use @pulumi/kubernetesx instead of the standard k8s module everywhere?


They’re not interchangeable - kubernetes-x builds higher level abstractions on top of the pulumi-kubernetes API which is faithful to the openAPI supplier by Kubernetes.

To my mind, the real power of having general purpose languages building the declarative model instead of a DSL is that libraries like k-x (or indeed AWS-x) can actually be built!


What language would be good for operators if not JavaScript?


Speaking as a 25-year sys/netadmin who writes a fair bit of code too: nearly anything.

I'd rather learn Go than deal with JavaScript. Python would be fine. Elixir would be great (I'd much prefer Erlang but there are only so many miracles I'm allotted in this lifetime). Perl, please.


Pulumi has support both for Python and Go along with TypeScript and JavaScript.

I never understand when people say operators don't like programming languages. Always seems weird to me that a person that constantly works with computers would prefer a static markup language instead of full programming language for managing the complexity of operating software in production environments.

My original comment was about this attitude. I personally think programming languages are much better than markup languages for almost everything. If you are presenting static content then markup languages are fine but if you're doing anything more than presenting static content then YAML just makes no sense.


A full programming language gives too much flexibility. In particular, this includes the flexibility allowing to shoot oneself in the foot.

A person who constantly works with computers is painfully aware of the people's ability to screw things up by an honest mistake when programming them. The more complex the thing, the easier it is to make a mistake, because human attention is finite.

Whenever you can limit the language to the task at hand, and build in guard rails that would prevent you from doing things you should not be doing, it's usually a productivity win, even if it precludes implementation of 3% of some most advanced and complex solutions. Reliability and simplicity go hand in hand, and reliability is often the thing devops people value very highly.


Having worked in several declarative config systems at a big tech company that tried to provide these guard rails. They errode!!

And it makes sense. As we demand more flexibility to not repeat ourselves (DRY) we get clever and add little features.

Suddenly you can stick an entire cluster config (or several!!) inside a lambda that returns the declared resources... And you've gone straight to hell. You're more than Turing complete by then and engineers are reinventing conditionals and loops with lambdas all over your config codebase.

Might as well have just started and left it at Python/etc.


I can build DSLs (domain specific languages) with programming languages. I can't build DSLs with YAML. And when someone says "You will use YAML because you can't be trusted with a programming languages" then I find that very disrespectful and condescending.

The solution to finite attention isn't less powerful tools. The solution is more powerful tools that augment operators and let them extend their finite attention spans into longer time horizons.

How is your description and viewpoint different from what I just described?


A good DSL (as opposed to a raw general-purpose language) would be ideal. It's the guard rails I'm after, when implemented well.

What I won't like is the "here are some classes and functions for accessing our API, go build the stuff somehow" approach.


Then we're in agreement.


When I tried it late last year, python was a second class citizen- if you wanted to use all the features, you had to use JS.


Which features were missing?


Cloudflare, make sure you upgrade to 0.11.3, the new scheduling behavior is awesome for large clusters.

Also a massive warning to anyone wanting to use hard cpu limits and cgroups (they do work in nomad it’s just not trivial), they don’t work like anyone expects and need to be heavily tested.


What problem did you observe regarding CPU limits? Is it latency due to CFS CPU period?


are you serious? you've got a major company touting a technology and a key component of it is broken?


I haven’t tested hard CPU quotas with Nomad but I suspect the issue mentioned above is due to cgroups/CFS and is also applicable to Kubernetes. See these slides for more details: https://www.slideshare.net/mobile/try_except_/optimizing-kub...


I recall that there's been some improvements to CFS to fix this since 2018.


There's been a ton of improvements over time. That doesn't make it any less foot-gun-y, unfortunately.

I had the exact same problem with Kubernetes (and even just straight up containers), I just want to make sure folks are extremely aware of how big of a footgun it is, and to really, really, really test it well.


Yes, you know that small organization, Linux; who manages cgroups that everything else calls through with.

I have only tested it with Docker, again, that company Moby with this very little used software.

How dare their core systems have any systems that are not extremely simple to understand and have caveats.


Ha, have you ever tried kubernetes?


Nice that they did a kernel patch to fix the pivot root on initramfs problem, but sadly looking at replies it won't get merged https://lore.kernel.org/linux-fsdevel/20200305193511.28621-1...


Article is missing the title's leading "How"


One point i never get for companies operating their own hardware: If your problem was having a number of well known servers for the internal management services and you then move to a nomad cluster or kubernetes to schedule them dynamically, you end up with the same problem as before to schedule the well known nomad servers or kubernetes masters. So is the only advantage here that the nomad server images update less often than the images of the management services?


Not associated with CloudFlare but I've built similar stuff with Nomad.

The cost/maintenance trade-off works when you have more SPOF management hosts than Nomad servers (5). You decrease host images down to 2, Nomad server and client, versus N management images.

Though it does sound a bit like they're using config management rather than pre-build images.

Bonus, Nomad servers are more failure resistant using Raft consensus versus any N management hosts. And for discovery I found the optimal pattern is to put all of the Nomad servers in a "cluster" A record for clients to easily join (pattern works well for Consul too)


Off topic but... I've always wondered how Cloudflare can not charge for the bandwidth. Even the free plan is super generous (CDN, SSL, etc).

How are they making a profit when AFAIK all other CDNs charge you for the bandwidth (and I assume they have to pay for to their providers)?


"2.8 Limitation on Serving Non-HTML Content

The Service is offered primarily as a platform to cache and serve web pages and websites. Unless explicitly included as a part of a Paid Service purchased by you, you agree to use the Service solely for the purpose of serving web pages as viewed through a web browser or other functionally equivalent applications and rendering Hypertext Markup Language (HTML) or other functional equivalents. Use of the Service for serving video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited."

In other words, as soon as you start using significant CDN bandwidth for images, video, audio, etc. they will contact you and ask you to upgrade your account.


See the second answer for a statement from cf https://webmasters.stackexchange.com/questions/88659/how-can...


Thank You! The answer was from 2016, and yet this is the first time anyone has posted this every time a similar question was raised on HN over the past few years.


I'm familiar with those terms but still.

There are people with free accounts moving GBs every month through their network and I imagine those free users must account for a very large percentage of their traffic.


I’d imagine at this point they are heavily peered in most markets, driven by said free users, so there isn’t a significant opex hit bandwidth -wise. Space/power opex plus network/compute hardware capex probably dominates their spend.


> How are they making a profit

They aren't. In Q1 they made a loss of $33M on $91M in revenue.


Oh wow.

In fact it seems they have been loosing more and more money over the years.

https://finance.yahoo.com/quote/NET/financials/

What are they counting on happening here? Obviously free customers won't switch to paid plans.


According to cloudflare.net, their GAAP gross margin was 77% in Q1 which is pretty good. Seems like they spend most on sales, marketing, and r&d.


CloudFlare does not charge for bandwidth? Their paid plans used to start somewhere around $2000.

Colocation providers charge hundreds of dollars to have an (allegedly) unmetered 100 Mbps network link.

AWS charges network usage per GB. It's the same flat number if you use half the capacity half the time or so.

What do all providers have in common? They charge enough to cover the costs of providing the service and make a profit.


One reason is "settlement-free peering".


So what you're saying is that Cloudflare is so big they can reach these sort of deals with connectivity providers where they don't pay for bandwidth themselves?


Short answer:

Well, to a certain extent. I didn't mean to imply that their monthly spend for bandwidth is zero -- I'm sure they aren't anywhere close to peering 100% of their traffic, they aren't in the DFZ, and, of course, they've gotta pay somebody (or, more correctly, several somebodies) to connect all of those datacenters together!

---

Long answer:

About six years ago, they described their connectivity in "The Relative Cost of Bandwidth Around the World" [0]:

> "In CloudFlare's case, unlike Netflix, at this time, all our peering is currently "settlement free," meaning we don't pay for it. Therefore, the more we peer the less we pay for bandwidth."

> "Currently, we peer around 45% of our total traffic globally (depending on the time of day), across nearly 3,000 different peering sessions."

Remember, that was about six years ago. I wouldn't be surprised if both their peers and peering sessions have increased by an order of magnitude since that article was published -- just think of all the datacenters that they're in today that weren't back then, especially outside of North America!

Additionally, they've got an "open" policy when it comes to peering, as well as a presence damn near everywhere [1,2]. Since they're "mostly outbound", the eyeball networks will come to them, wanting to peer.

Running an anycast network and being "everywhere" also has some other benefits. They perform large-scale "traffic engineering" -- deciding which prefixes they advertise where, when, and to who, and the freedom to change that on the fly -- so they've got tremendous control over where traffic comes in to and, perhaps more importantly, exits from their network (bandwidth is ~15x more expensive in Africa, Australia and South America than North America, example).

So, yes, CloudFlare is still paying for transit but, at their level, it's relatively "dirt cheap". Plus, in addition to the increases mentioned above, bandwidth is likely an order of magnitude cheaper -- at least -- than it was six years ago!

---

EDIT:

Two years later, in August 2016, CloudFlare published an update [3] to the article linked above. A few highlights:

> "Since August 2014, we have tripled the number of our data centers from 28 to 86, with more to come."

> "CloudFlare has an “open peering” policy, and participates at nearly 150 internet exchanges, more than any other company."

> "... of the traffic that we are currently able to serve locally in Africa, we manage to peer about 90% ..."

> ".... we can peer 100% of our traffic in the Middle East ..."

> "Today, however, there are six expensive networks that are more than an order of magnitude more expensive than other bandwidth providers around the globe ... these six networks represent less than 6% of the traffic but nearly 50% of our bandwidth costs."

---

[0]: https://blog.cloudflare.com/the-relative-cost-of-bandwidth-a...

[1]: https://www.peeringdb.com/asn/13335

[2]: https://bgp.he.net/AS13335#_ix

[3]: https://blog.cloudflare.com/bandwidth-costs-around-the-world...


Thanks for the info!


They have paid services an despite what cloud providers and CDNs might have you believe, bandwidth is exceedingly cheap.


"... here is the CPU usage over a day in one of our data centers where each time series represents one machine and the different colors represent different generations of hardware. Unimog keeps all machines processing traffic and at roughly the same CPU utilization."

Still a mystery to me why "balancing" has SO MUCH mindshare. This is almost certainly not the optimal strategy for user experience. It is going to be much better to drain traffic away from older machines while newer machines stay fully loaded, rather than running every machine at equal utilization factor.


I'm an engineer at Cloudflare, and I work on Unimog (the system in question).

You are right that even balancing of utilization across servers with different hardware is not necessarily the optimal strategy. But keeping faster machines busy while slower machines are idle would not be better.

This is because the time to service a request is only partly determined by the time it takes while being processed on a CPU somewhere. It's also determined by the time that the request has to wait to get hold of a CPU (which can happen at many points in the processing of a request). As the utilization of a server gets higher, it becomes more likely that requests on that server will end up waiting in a queue at some point (queuing theory comes into play, so the effects are very non-linear).

Furthermore, most of the increase in server performance in the last 10 years has been due to adding more cores, and non-core improvements (e.g. cache sizes). Single thread performance has increased, but more modestly.

Putting those things together, if you have an old server that is almost idle, and a new server that is busy, then a connection to the old server will actually see better performance.

There are other factors to consider. The most important duty of Unimog is to ensure that when the demand on a data center approaches its capacity, no server becomes overloaded (i.e. its utilization goes above some threshold where response latency starts to degrade rapidly). Most of the time, our data centers have a good margin of spare capacity, and so it would be possible to avoid overloading servers without needing to balance the load evenly. But we still need to be confident that if there is a sudden burst of demand on one of our data centers, it will be balanced evenly. The easiest way to demonstrate that is to balance the load evenly long before it becomes strictly necessary. That way, if the ongoing evolution of our hardware and software stack introduces some new challenge to balancing the load evenly, it will be relatively easy to diagnose it and get it addressed.

So, even load balancing might not be the optimal strategy, but it is a good and simple one. It's the approach we use today, but we've discussed more sophisticated approaches, and at some point we might revisit this.


Thanks for the detailed reply. It would be interesting to see your plots of latency broken down by hardware class and plotted as a function of load. I'd be pretty surprised if optimal latency was achieved near idle, since in my experience latency is a U-shape with respect to utilization: bad at 100% but also bad at 0% since it takes time to wake resources that went to sleep.

I'm sure your system has its benefits, I just get triggered by "load balancing" since it is so pervasive while also being a highly misleading and defective metaphor.


What Grafana dashboards do you all use/like for Prometheus + Nomad?


Actually, I think K8S is much simpler to operate than nomad.

With nomad, you have too many options, you can run your service/job as a process or within containers, but with Kubernetes, there is only one way to do things -- container, which is really a simple choice to make.

Besides, Kubernetes got etcd builtin, so I don't have to use deploy and maintain consul.

Last but not least, I still see containers mysteriously gone, and have no idea how nomad did that. With kubernetes, such thing never happened.


I love nomad and architected a large system that used it for several years. We are finally moving away from it, however, because of its lack of native support for autoscaling. I know there are third party solutions for it, but that doesn't work for us. I suspect a large number of k8s users could use nomad instead with way less overhead.


I'm curious if you tried the new autoscaler [1] that is in Tech Preview? Is there something it's missing for your scenarios?

[1] https://www.hashicorp.com/blog/hashicorp-nomad-autoscaling-t...


As you probably know, decisions like changing infrastructure architecture is a team decision and are made with the information available at the time. The nomad autoscaler just wasn't announced/released in time for us to seriously consider it. That coupled with the uncertainty around the future of Fabio made us look elsewhere. We remain staunch supporters of Hashicorp products overall. We will continue to use Terraform, Consul and a little Vault.


Hope they allow users to access documentation for past versions of nomad. Currently there is no easy way to find if particular configuration option mentioned in the documentation is available for not.


What can Nomad do that Terraform can't? I'm pretty sure you can blue-green rolling deploy services inside of Docker containers with Terraform too.


They’re completely different classes of tool - Nomad is a runtime component, Terraform is not.


but terraform can talk to the docker daemon and orchestrate containers.

what can you achieve with nomad that you can’t achieve with terraform?


- run a non-dockerised job

- have a restart policy outside of what the docker daemon offers

- automatically schedule across multiple machines based on constraints including affinity and anti-affinity, bin packing etc

- “dispatch” jobs

The Terraform Docker provider effectively provides a tiny subset of the functionality of Nomad, which provides similar functionality across many drivers, not just Docker.


I worked at a hedge fund that also used nomad. The problem, however, is not how well it scales or whatever, but the fact that all the accompanying deployment info and literature is for kubernetes, and k8s has far more features.

I like the quality of products from HashiCorp, but k8s is far, far, ahead of where nomad is.

What I really want is better integration for Terraform and kubernetes. The current TF k8s support leaves much to be desired, too many things are missing or broken and I find there are several bugs that result in deployment flapping (i.e., constantly re-deploying the same thing when there are no changes).


"Kubernetes but with less stuff" is a valuable niche that I'm glad someone is targeting.


That's a fair point, I guess it depends on your use case. The risk, however, is that the powers that be at HashiCorp one day decide to abandon Nomad once they realize it will never be a profit centre for them.


The CEO addressed a similar comment on Twitter recently. From what he's saying, it seems Hashicorp is standing by Nomad for the foreseeable future. https://twitter.com/mitchellh/status/1247581788706197504


It would be very odd if he said the opposite, since customers could be spooked. But saying it's going to be around is very different from saying it will always have a lot of resources on it going forward, especially compared to their cash cows of consul and vault.


Nit: Mitchell is their CTO. Dave McJannet is their CEO. (source: https://www.hashicorp.com/about/ )


Micro-nit: Mitchell is Co-CTO w/ Armon being the other


"standing by" until they're not. If their VCs start applying pressure to dump the unprofitable projects and seek more profits, their tune will change in an instant. At the end of the day these are for-profit entities, and the only thing that matters is profit.

The only reason they're pursuing this strategy is because they think they can get a piece of the kubernetes market. If that doesn't pan out, they will dump nomad like a bad habit.

For-profit companies aren't open source charities.


Nomad is open source (or, at least, a significant subset of it is). Anyone is able to continue to improve it, even if Hashicorp is no longer paying people to work on it.


That's not enough, Basho (creators of Riak database) also made it open source. In fact after they went out of business Bet365 even purchased their proprietary code and made it open source, but the database is still considered dead.


I used to do sales for an enterprise "open source" software product, so I get it, but the truth is as soon as someone stops paying people to keep the project going it will die.


Seems like some projects survive their parent company abandoning them like illumos (successor to opensolaris). It's not common, but also not impossible.


its hard to say illumos is in good shape compared to when opensolaris was being worked on by Sun, or Joyent however.


Who pays for k8s development?


Google. Commoditizing the cloud is their strategy to make GCP relevant.


Question was rhetorical, I still appreciate the effort though.


Not at all. Hashicorp literally pays for all development of nomad. They're the only commits short of a small number of PRs. Kubernetes commits are from a wide array of companies, and Google is only one.


And just as with Kubernetes and Google, Nomad development can continue outside of Hashicorp if Hashicorp no longer decides to support it. Which org has a longer track record of deprecating almost everything they release? Not Hashicorp, and frankly, I’ll always trust Hashicorp versus Google based on the historical behavior and forward incentives of both.


It's not quite the same. Hashicorp controls whether PRs get merged. Google does not control whether PRs get merged into kubernetes. There's a long list of companies that do, including IBM, redhat, Huawei, etc. Sure, you can fork it, but now you have a separate repo that requires people to know about it.


I’m not trying to convince you. If k8s works for you, that’s great! I choose to work at places that pick Nomad instead. Best tool for the job.


Agreed, thanks!


Crossplane https://crossplane.io might be what you’re looking for with its bunch of controllers and a nice composition API.

One of the best features is that you can bundle a CR to request a MySQL database and it will be satisfied with whatever config is in your cluster so that app only declares the need but not care how it’s done.

Disclaimer: I’m one of the maintainers of Crossplane.


Any chance you plan to integrate with Google's Config Connector? It's very similar to crossplane (but gcp specific).

https://cloud.google.com/config-connector/docs/overview


I think that'd be a bit challenging because Config Connector is highly opinionated, for example, Kubernetes Namespace corresponds to GCP project. Though it might be enabled to be used as part of a Composition when we support namespaced CRs to be used as composition member.


Every config connector resource can be annotated with the project for GCP resource to be created in. Namespece to GCP project mapping is encouraged, but not enforced, you are still free to create resources in multiple projects from a single namespace as well as in a single project from multiple namespaces.


Oh, thanks for letting me know!


Why would you want TF in k8s? That hurts my soul.

Isn't exactly declarative, nor really imperative, mostly but not entirely idempotent, and worst of all, it has faux-state that may or may not reflect the actual state of the services.

I used to love Terraform, but it broke my heart. Now we're done.


Which features is Nomad missing? Feature count comparisons are meaningless unless the features are tied to actual important use cases. Lots of software is encrusted with rarely used features that just add complexity.


The free version of Nomad is missing, IIUC:

- no quotas for teams/projects/organizations

- no preemption (ie higher priority job preempts lower priority job)

- no namespacing

So generally it's somewhat useless in organizations where there are multiple different teams that should be able to coexist on a cluster without stepping on eachothers' toes, or even where you want a CI system to access the cluster in a safe manner.


Preemption is going OSS! https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md

We fully intend to migrate more features to OSS in the future -- especially as we build out more enterprise features. As you can imagine building a sustainable business is quite the balancing act, and there's constant internal discussion.

(I'm the Nomad Team Lead at HashiCorp)


This is going to make our AI team very happy because they can just dump experiments into the cluster at low-priority so those'll be done when those are done.

It's also going to make the operators very unhappy because it'll be harder to monitor actual memory utilization (allocated memory vs memory really in use) in order to plan cluster extensions. Are there some tools around or work planned to make this kind of scaling and utilization easier?


I was going to say "yes!" because we do have some telemetry/metrics improvements coming up, but then I realized those won't address your use case. It seems like aggregating cluster resource usage by job priority is what we need for that. Our UI might be able to calculate that, but I don't think our metrics include job priority so there's no way for your existing tooling to display it.

Please file an issue or link me to an existing issue of you have time. This seems really compelling!


Enterprise version has all of those things ;)

For my personal use I don't need them and for a business Enterprise won't break the bank. You'd be surprised how much your management might be interested in having an escalation path if everything goes south and you need a hot fix. I can vouch that Enterprise support is worth it if you're on a small team that can't spend all day on this stuff


We run everything in Nomad for all our teams, but we don’t use any of the features you mention, and it’s not causing issues. We wrote some metrics for how much memory all the allocations are claiming, and we throw another EC2 instance in the cluster when we’re running out. Works pretty well. Preemption seems like a nice feature, but I imagine getting teams to coordinate their preemption values would be a political nightmare.


For me the ability to squeeze in some jobs in between the cracks at best effort priority is crucial (ie. any sort of high latency batch processing / experiments). I wouldn't want these extremely low priority jobs to compete with, I don't know, an actual customer facing service.

Also, we run on bare metal, so there really isn't a way to request extra capacity within seconds.


Preemption is coming in the next release in OSS :)


A few long-standing feature requests I've noticed a lot:

* No autoscaling

* Can't reserve entire CPU core: https://github.com/hashicorp/nomad/issues/289

* No way to run jobs sequentially: https://github.com/hashicorp/nomad/issues/419


Autoscaling is in Tech Preview as of March: https://www.hashicorp.com/blog/hashicorp-nomad-autoscaling-t...

Disclaimer: I work at HashiCorp in Product Management.


Nice! It's definitely been one of the bigger holes.

I hope you'll also consider sequential tasks in a single jobfile. Without it it's kind of awkward because you have to be passing shell scripts around just to run a setup step before your actual workload.


We’ve also started to implement this with a new task dependencies feature in the latest release. Check out: https://www.hashicorp.com/resources/preview-of-nomad-0-11-ta...

https://learn.hashicorp.com/nomad/task-deps/interjob

There’s more work going on there to further improve the feature in upcoming releases.

Disclaimer: I’m an engineer on the Nomad team.


For us it's not a feature missing, but the mindshare is low, and there's not really any prebuilt examples of how to run and maintain services long term reliably.


Does it have "easy" side-cars/init/post? Last time I checked those were missing for example.



Ill check that out, I really think K8s is becoming a bit too complex for many customers. Thanks!


The main missing feature is that it's not kubernetes.


You have an extra word there.



Thanks for sharing this excellent article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: