Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Datree (YC W20): Prevent K8s misconfigurations from reaching production
144 points by shimont on Oct 19, 2021 | hide | past | favorite | 26 comments
Hi HN, this is Shimon and Eyar of Datree (https://www.datree.io/).

When I was an Engineering Manager of Infrastructure at ironSource (NASDAQ:IS) for 400 developers, a developer made a mistake, causing a misconfiguration to reach production, which caused major problems for the company's infrastructure.

Mistakes happen all the time - you learn from them and hope to never make them again. But how can we prevent a production issue from recurring, or, how about a bigger challenge — how can you prevent the next one from the get-go?

In our case, we tried sending emails to our devs, writing Wikis, and hosting meetups and live sessions to educate our developers, but I felt that it just wasn’t driving the message home. How can developers be expected to remember to configure a liveness probe or to put a memory limit in place for their Kubernetes workload when there are so many things that a dev must remember? Infra just isn’t their primary focus.

Today, organizations want to delegate infra-as-code responsibilities to developers, but face a dilemma — even a small misconfiguration can cause major production issues. Some companies lock up infra changes and require ops teams to review all changes, which frustrates both sides. Developers want to ship features without waiting for infra. And infra teams don't want to “babysit” developers by reviewing config files all day long, essentially acting as human debuggers for misconfigurations.

That’s why I teamed up with Eyar to found Datree. Our mission is to help engineering teams prevent Kubernetes misconfigurations from reaching production. We believe that providing guardrails to developers protects their infra changes and frees up DevOps teams to focus on what matters most.

Datree provides a CLI tool (https://github.com/datreeio/datree) that runs automated policy checks against your Kubernetes manifests and Helm charts, identifies any misconfigurations within, and suggests how to fix them. The tool comes with dozens of preset, best-practice rules covering the most common mistakes that could affect your production. In addition, you can write custom rules for your policy.

Our built-in rules are based on hundreds of Kubernetes post-mortems to ensure the prevention of issues such as resource limits/requests (MEM/CPU), liveness and readiness probes, labels on resources, Kubernetes schema validation, API version deprecation, and more.

Datree comes with a centralized policy dashboard enabling the infra team to dynamically configure rules that run on dev computers during the development phase, as well as within the CI/CD process. This central control point propagates policy checks automatically to all developers/machines in your company.

We initially launched Datree as a general purpose policy engine (see our YC Launch https://news.ycombinator.com/item?id=22536228) in which you could configure all sorts of rules, but the market drove our focus toward infrastructure-as-code and, more specifically, Kubernetes, one of the most painful points of friction between developers and infrastructure teams.

When we adjusted to a Kubernetes-focused product, we pivoted our top-down sales-driven model to a bottom-up adoption-driven model focused on the user.

Our new dev tool is self-served and open-source. Hundreds of companies are using it to prevent Kubernetes misconfigurations and, in turn, are helping the tool improve by opening issues and submitting pull requests on GitHub. Our product is well suited for self-evaluation and immediate value delivery. No demo calls — just 2 quick steps to try the product yourself!

TechWorld with Nana did a deep technical review of our product, which can be viewed at https://www.youtube.com/watch?v=hgUfH9Ab258.

We look forward to hearing your feedback and answering any questions you may have. Thank you :)




I can't believe how complex a tool has to be for several startups to get funding purely to try stop it doing bad things. Kudos to Datree for making the most of this, but feels like something's wrong with k8s for this to be such a thing.


Have you ever tried to set up HA environments before Kubernetes and friends, let alone anything that can do failover and some form of auto-scaling? Kubernetes might have some unwarranted complexity, but let's not pretend it was a walk in the park setting environments up with ad-hoc shell scripts, HA proxy, IP failover, manual package updates and more fun stuff. Tools like Kubernetes and their general abstractions are vastly better than anything I could come up with in a time before even Puppet and Ansible existed. You have to take into consideration all that orchestrators like Kubernetes are trying to replace before you can meaningfully criticise their complexity.

Sadly enough it's all YAML, but at least it's all YAML and not random mostly undocumented arcane configuration formats...


> Have you ever tried to set up HA environments before Kubernetes and friends, let alone anything that can do failover and some form of auto-scaling?

Those were simpler days for sure.

1) Set up multiple fileservers w/ DRDB or some sort of clustered filesystem like glusterfs 2) Set up multiple database servers in a cluster 3) Set up multiple web servers, all of which can connect to the NAS and DB 4) Optionally set up a cache layer like Varnish 5) Set up a hardware load balancer (or even varnish/nginx/haproxy)

Autoscaling on bare metal wasn't really a thing for most DCs, but setting up HA environments has been possible for a long time. Things were actually a lot less complicated because there weren't as many options imo. You'd generally overprovision to handle your peak loads, or maybe buy a few more servers before busy seasons/days and cancel them later.

Managing a bare metal kubernetes installation feels like it has a lot more moving pieces and ways that it can break. Cloud providers and managed services do take away a lot of the burden though.


My first job was managing bare metal for a variety of companies including fairly large multinational labs, all the way down to plugging in the RAM and fiddling with the routers, mostly for VoIP systems but also data backups. I didn't architect any HA environments but worked on some.


The problem is right here in the blurb. It's not that k8s itself is complex. Ops is complex. Companies used to have entire dedicated IT departments full of sysadmins, storage engineers, network engineers, and security engineers with decades of experience configuring, deploying, maintaining, and monitoring servers, networks, hypervisors, and data centers.

Newer companies just push this onto their application developers, expecting them to figure this stuff out on top of being developers, "full-stack" now meaning you need to understand everything down to filesystems, overlay networks, container runtimes. This is not a reasonable expectation. Nobody can be an expert in everything.

Of course, I'm not sure full automation can really replace human expertise, either.


It's true that a developer can't be an expert in every aspect of the stack but that's where services like Datree or many of AWS' services for example come in, they bring the domain expertise and require the developer to only be familiar with the subject. The experts moved to be domain experts, working for the companies that develop the tools.

You don't really need a resident storage expert in every company, since most companies have similar needs.


But suddenly you need to keep up with a ton of different tools and if they break you either dig into the topic or you're f**ed. This is such a productivity killer it's crazy.


IMO Kubernetes is of the level of complexity of a programming language or an OS. It's just at a larger scope, and we don't have a lot of things in that space, so we don't have well-defined concepts like "language" or "OS" to encompass them.

There are basically entire industries dedicated to stopping programming languages from doing bad things (static analysis vendors, auditing consultancies, formal verification tool vendors, etc.) and to stopping OSes from doing bad things (application-focused monitoring tools, security-focused intrusion detection tools, policy enforcement / device management vendors, etc.), and we don't really say "Wow, something is wrong with C++" or "Wow, something is wrong with Linux." We understand that they are high-power tools and you do want additional tools to focus that power.

(Well, to be fair, I say something is wrong with C++, but my preferred solution to that is even more complicated programming languages :) )


Don't we? :). I say[0] something is wrong with programming itself, on a more fundamental level than the evolutionary history of C++.

I can't give a coherent and detailed analysis of it yet[1] - but I have this growing feeling that we're drowning in accidental complexity all across the board, at every abstraction layer. Like an inverse iceberg - where we see this whole, humongous mountain of tooling required to build and maintain software systems, but you can't shake the impression that we should be able to do the job with just the bit that's sticking above the waterline.

Speaking of k8s being "of the level of complexity of a programming language or an OS", I bet there's some formal way to show some isomorphism here - them being different incarnations of the same abstract structure. It's another kind of feeling I get when jumping up and down the software stack[2]. Maybe one day we'll figure it all out.

--

[0] - https://news.ycombinator.com/item?id=28568053

[1] - But I am collecting observations and trying to mull the problem over in my subconscious mind.

[2] - Like e.g. code is data is code; your config parser is an interpreter of a programming language. Often enough, it grows to look like a typical PL, then gets replaced by one[3]. If your config happens to describe infrastructure, at some point you might realize you're writing "function calls" for business logic that are implemented in terms of spinning clusters up and down. Or e.g. the realization that DCOM is essentially microservices, so for some two decades or more, every Windows installation had something similar to k8s deep inside its bowels.

[3] - https://mikehadlow.blogspot.com/2012/05/configuration-comple...


That may be true but programming languages are there to give you the power to develop any idea into a working software solution. I think K8s differs because I see it as something that simplifies infra and abstracts infra vendor specific concepts. The complexity in K8s doesn't add power, just confusion ;)


Kubernetes does simplify and abstract - just like C++ simplifies and abstracts compared to writing assembly. :) In both cases, the scope of what you can do expands significantly, which means that the systems you build with the higher-level tool are significantly more complex, which means that you see much more confusing C++ programs and Kubernetes deployments than assembly programs and five-lovingly-handcrafted-servers deployments.... but that's because you can successfully start with a more complex idea and make it happen.


I know! I think that the fact that developers are dealing more and more with infra is very empowering but on the other hand brings new challenges. It is no longer Dev VS OPS, but now Devs also need to learn infra best practices, so tools like ours help them :) thank you for your Kudos! <3


Not much different than Windows, Linux, etc (operating systems)

K8s basically implements a distributed OS that schedules across machines instead of CPU cores

Not-too-long-ago AWS had a significant outage due to misconfigured ulimits (Linux os)


As someone who works in a similar space (K8s configuration management and IaC), I'm curious what drove you to develop a CLI tool for enforcing policies as opposed to something that is able to integrate with K8s more closely such as OPA Gatekeeper or Kyverno?

As I understand, the primary users of policy tools are platform teams, infrastructure teams, or some other entity who needs to able to create, manage, and enforce policies over domains that they're responsible for.

When I look at Datree from the POV of a platform team, I see a tool that I must trust dev teams to use to enforce policies.

Yes, I can hide my K8s cluster behind a CI/CD pipeline that runs Datree, but this is limiting for organizations that actually want to let its dev teams access its K8s clusters directly or run workloads that themselves can create K8s resources (e.g. operators).

By contrast, OPA Gatekeeper or Kyverno do not have such limitations because they allow policies to be enforced at the cluster itself.

Both also allow platform teams to create new policies and detect if there are any K8s resources _already_ in the cluster that are in violation of the new policies (i.e. Day 2 operations).

Lastly, both even offer CLI tools for dev teams to use to detect issues earlier during development.

I would argue though that dev teams are actually secondary to platform teams in terms of who to focus on when building policy tools since platform teams usually have more of an interest/responsibility in enforcing policies and therefore more of a say in what policy tools to adopt for an organization.

Hence, I was curious why you started with a CLI tool which seems to be more of a dev-centric approach rather than platform-centric.

Also, more specifically, what makes Datree a better option over OPA Gatekeeper or Kyverno?


Hey, this is a great question.

We are big believers in "shift-left" and trying to fix/avoid issues as early as possible. We started with a CLI tool as it is agnostic and can be run in the devs IDE like VSCODE, in the terminal and finally in the CI/CD process.

We love OPA and think that GateKeeper is a good solution, but we want to provide feedback as early as possible. While Gatekeeper will block a deployment to the Kubernetes cluster at the end of the development process.

As a developer myself I would rather be notified for an issue as early as possible and not find our about it in the very last second before it goes live to production.

We might add support similar to GateKeeper in the future, but we wanted to be shift-left first :)

I hope this answers your question Thank you


Looks like this is a policy-as-code tool in the vein of Terraform Cloud's Sentinel Policies, or more generally Open Policy Agent, but specifically targeted at k8s use cases.

From the custom rules overview https://hub.datree.io/custom-rules-overview , though it is docs WIP, I noticed these are defined as YAML/JSON somehow. That's a contrast to HashiCorp's Sentinel https://docs.hashicorp.com/sentinel/concepts/language and OPA's Rego https://www.openpolicyagent.org/docs/latest/policy-language/ . Is this an intentional design decision?


We just released support for custom rules :) from interviewing our users, we decided to start with [0] JSON Schema as it is very easy to write rules using it and you do not have to learn rego.

Having said that, we might add OPA .rego support in the near future :)

What is the desired way for you to write custom policy rules?

[0] - https://json-schema.org/


Is it me, or is the dashboard only for paying customers?

What does this add compared to Polaris by fairwinds ?


The dashboard is also offered as part of our freemium offering :) we offer 1000 policy checks per month for free. Including the dashboard.

In terms of what we offer compared to Polaris: We offer pre-defined policies that comes out of the box along with the ability to write custom rules for your policy by your self.

Take us for a spin and let me know what you think! thank you


FYI - Polaris open source has both these things :)

(Disclosure - I'm a maintainer)


K8s manifest are yamls that translated to tens of K8s actions that change production environment .Misconfigurtions that can be prevented by integrating datree cli into CI/CD cycle can save hours of production unreliability. For me it's must have phase in K8s release flow.


Thank you Varchol :) Also you can use Helm which might help on top of Kubernetes manifests


Why is this whole thread dead? Just because of the blatant appearance of astroturf or what?


I killed all the booster comments. We tell YC startups not to do that. All startups, of course, but especially YC startups.

Sometimes it happens inadvertently (e.g. users find out about the thread and rush in to 'help'), but obviously we want the discussion here to be substantive.


Hey, some of our friends were over eager to help hehe :)

I look forward to hearing your feedback! Thank you


Thanks, makes sense, appreciate the reply!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: