Hacker News new | past | comments | ask | show | jobs | submit login
The case of the recursive resolvers: What happened during Slack’s DNSSEC rollout (slack.engineering)
132 points by usrme on Nov 29, 2021 | hide | past | favorite | 47 comments



From the described mistakes two come from lack of understanding how exactly DNS works. But I agree it's in fact hard, see [1]).

1. "This strict DNS spec enforcement will reject a CNAME record at the apex of a zone (as per RFC-2181), including the APEX of a sub-delegated subdomain. This was the reason that customers using VPN providers were disproportionately" - This is non intuitive and maay people are surprised by that. You cannot create any subdomain (even www.domain.tld) if you created "domain.tld CNAME something...". Looks like not every server/resolver enforces that restriction.

2. "based on expert advice, our understanding at the time was that DS records at the .com zone were never cached, so pulling it from the registrar would cause resolvers to immediately stop performing DNSSEC validation." - like any other record, they can be cached. DNS has also negative caching (caching of "not found responses". Moreover there are resolvers that allow configuring minimum TTL that can be higher that what your NS servers returns (like unbound - "cache-min-ttl" option) or can be configured to serve stale responses in case of resolution failures after the cached data expires [2]. That means returning TTL of "1s" will not work as you expect.

[1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye-pow... [2] https://www.isc.org/blogs/2020-serve-stale/


My (basic and conservative) mental model that "in DNS, everything including the lack of presence of a thing can be cached" is why I'm very cautious before rolling out anything from DKIM to DNSSEC. A deep understanding of specifications is vital. I'm somewhat surprised an organization of Slack's scale didn't have a consultant on the level of "I designed DNSSEC" on hand for this.


DNS is a bit like network engineering, in that simpler errors has the tendency to have large impacts that prevent trial and error. Before working as a sysadmin I thought that doing experimental lab setups was something only researchers and student did, but when you have an old system up and running, it can be quite difficult to get in there and make changes unless you are very sure about what you are doing.

Like networking there can also be existing protocol errors and plain broken things that has for one reason or an other been seemingly working for decades without causing a problem. Internet flag day is one of those things that pokes at those problems, and maybe one day we will see a test for CNAME at the apex.


It's worth noting that this by itself is a reason not to do ambitious security things (and a global PKI is nothing if not ambitious) at the layer of DNS. It's an extension of the end-to-end argument, or at least of the the logic used in the Saltzer and Reed paper: because it's difficult and error-prone to deploy policy code in the core of the network (here: the "conceptual" core of the protocol stack), we should work to get that policy further up the stack and closer to the applications that actually care about that policy.

The Saltzer and Reed paper, if I'm remembering right, even calls out security as specifically one of those things you don't want to be doing in the middle of the network.

See also: Zero Trust / BeyondCorp.


When people start to implement security at the BGP layer, which will likely occur some time soon, we will see things break. We will also see BGP fail if we don't do anything as the protocol is ancient, got an untold amount of undefined behavior between different devices and suppliers, and is extremely fragile.

There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative. DNS, BGP, IP, UDP, TCP, and HTTP to name a few are seeing incremental changes, and the cost is preferable over the alternative of doing nothing. Ambitious security things would be much less costly if we had working redundancy in place, which is one of those things that flag day tend to illustrate. Good redundancy and people won't notice when HTTP becomes HTTP/2 that later becomes HTTP/3. It also helped development at google that when they added QUIC, they controlled both ends of the connection.


> There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative.

See second-system effect:

> https://en.wikipedia.org/wiki/Second-system_effect


Yep - in this, as in many things in life, expert knowledge is knowing what experiments and tests you should be doing as much as which ones you can avoid.


> I'm somewhat surprised an organization of Slack's scale didn't have a consultant on the level of "I designed DNSSEC" on hand for this

If it takes a designer of DNSSEC to implement it, then how should I, a peasant implement DNSSEC for my infra?


From what I can tell, the problem was not caused by DNSSEC directly. It was caused by:

1. A bug in Route 53 which caused wildcard record not to work with DNSSEC signing. Anyone not using Route 53 would not have had any problems with DNSSEC.

2. Slack decided to revert the DNSSEC rollout, but botched the process badly, effectively locking themselves in the trunk and throwing away the key. If they hadn’t tried to revert the DNSSEC rollout, or if they had been a bit more deliberate and careful while doing it, this would not have happened.


The issue wasn't really lacking deliberateness or care but rather that they had adopted a culture of over automation. Why are they using Terraform to make changes that are rare and critical instead of the GUI? The whole point of a GUI is to make computers easier and safer to use, but if you insist on accepting everything then you lose that.


Additional discussion, indirectly and spurred from this, is here:

https://news.ycombinator.com/item?id=29381778

That thread, which is big, is probably the right place to take general discussion of DNSSEC itself, though I'll snipe DNSSEC here too. :)


Seems like an organizational failure, as they got conned by their 3PAO into believing that DNSSEC was a requirement for FedRAMP moderate when it's not. The disproof of this belief is that Google has FedRAMP High (for Google Cloud and Workspace) but does not use DNSSEC for google.com.


The ultimate arbiter of whether a cloud service gets used isn't FedRAMP, it's the Agency Authorizing Official. FedRAMP just makes much of the work reusable. With GCP, you can build something that obeys and uses DNSSEC without needing google.com to participate in DNSSEC.

Google Workspace is a good point though. I know there are many users of it in government... maybe some AOs are fine signing off on it even without the needed security controls, which is an option they have in their discretion with and without FedRAMP.


If you use https everywhere, you will have a server certificate with the hostname embedded in it. This is how TLS knows you’re talking to the right server.


In addition to the other note that DNSSEC is _not_ required for FedRAMP certification (it's even discouraged by cloud.gov! https://cloud.gov/docs/compliance/domain-standards/ ), this is some weirdly intellectually dishonest phrasing (linking to tptacek's article Against DNSSEC: https://sockpuppet.org/blog/2015/01/15/against-dnssec/ ):

> While we are aware of the debate around the utility of DNSSEC among the DNS community, we are still committed to securing Slack for our customers.

The argument is specifically that it doesn't provide that security. At least it's neat to see actual begging the question in the wild, I guess.


FedRAMP is designed to provide reusable cybersecurity work against the NIST security controls that your Federal agency's Authorizing Official deems your Federal IT system must implement.

Those security controls come from a document NIST SP 800-53, 2 of which (that Slack linked to in the linked post-mortem), SC-20 and SC-21, effectively seem to me to conspire to require DNSSEC. Both of these are included as part of the "Low" baseline of security controls, so they are effectively required for all Federal IT systems unless your Agency Authorizing Official wants to walk on the wild side.

So even if you get a FedRAMP certification, if you do it without fully implementing SC-20 and SC-21, that just means your customer needs to either convince their Agency Authorizing Official to sign off on an ATO despite the missing SC-20 and SC-21 security control, convince them to sign off on some sort of Plan of Action and Milestones where Slack will commit to fix this in the future (which is just kicking the can down the road), or somehow manage to implement the same effect completely within the customer end without help from Slack. All you would have done is to spend a lot of money on FedRAMP paperwork without making it appreciably easier for potential customers who have to deal with compliance regimes to buy your product.

Cloud.gov's argument is valid but all they posted is that they don't implement SC-20 or SC-21 for their government customers, and that the OMB M-08-23 mandate for DNSSEC is no longer operative (not that no other DNSSEC mandate applies). Indeed they even give explanation for how their customers should work to enable it (presumably by refusing to use the non-DNSSEC compliant .app.cloud.gov services and instead using only their DNSSEC-compliant custom domains).

FWIW I fully agree with tptacek's arguments against DNSSEC, and will note that I recently stopped being able to navigate to literally the entire .mil on my Linux host until I disabled DNSSEC in systemd, for reasons that are still unclear to me even now.


> intellectually dishonest phrasing

Not everyone agrees with the linked argument. For example, I disagree that browsers can't take advantage of DNSSEC, since many are using DoH, and the rest of the article reads like someone complaining that we need to wait for the perfect protocol or nothing at all.

That's the thing about a debate... it's got arguments on both sides.


I mean, I agree with you and don't find the language disingenuous (I felt like it was more of a tell that the people working on this cursed project weren't super read into DNSSEC and DNS security in general, which isn't a knock; it's a boring thing to keep up with, especially when the best-practice answer is so simple --- just don't bother with DNSSEC).

But I'd also say that DoH (1) largely obviates any need for DNSSEC (the last-mile DNS problem is the only on-the-wire DNS security problem that needs solving) and (2) doesn't enable DANE in browsers, which is what people are talking about when they talk about DNSSEC intersecting with browsers in any way other than randomly making sites fall off the Internet.


It's fine to disagree with the linked argument, but you actually have to do so. This is them presupposing that "securing Slack for [their] customers" requires DNSSEC -- it's not engaging with the argument at all.


I know we’ve all collectively accepted that DNSSEC is a terrible, complicated blight on the world but I still find it incredible that that an organisation with Slacks resources and access to expertise can’t make it work.


You say Slack, and I agree, that's telling, but you have to add to that AWS itself, which had a DNSSEC bug in its wildcard record support as well. Slack and AWS together couldn't make this feature work. Further: the open source tooling Slack (like most places) relies on for deployment is also DNSSEC-hostile: one of their problems is that Terraform's Route53 provider doesn't safely disable DNSSEC once enabled. It's a mess everywhere you look.

I think another interesting question here is why Slack bothered in the first place. As was pointed out on the other DNSSEC thread today: practically nobody in the technology industry uses DNSSEC in the first place. Presumably, Slack did DNSSEC (they don't anymore!) in service of FedRAMP compliance. Why? Slack has one of the most popular products in all of computing. What bad thing was going to happen if they said "nah, we're going to go with Cloud.gov's recommendation and not this FedRAMP document"?


> Presumably, Slack did DNSSEC (they don't anymore!) in service of FedRAMP compliance. Why? Slack has one of the most popular products in all of computing. What bad thing was going to happen if they said "nah, we're going to go with Cloud.gov's recommendation and not this FedRAMP document"?

As just one example, it's tremendously difficult, if not impossible, to sell your cloud-based SaaS to Navy customers if you have open FedRAMP compliance issues that you aren't at least working to address.

I say "compliance" instead of "security" for a reason as well, as "compliance" truly runs the show in Navy cybersecurity. And if you want to sell to that market (and it's hardly just Navy who runs this way), it's easier to check the checkboxes than it is to argue about whether NIST is right or cloud.gov is right.


Gotta be Fedramp compliant to do business with the US government. Even worse, you have to be Fedramp compliant to work with anyone who works with the US government. From a business (if not an engineering) standpoint, there's plenty to gain in going through the motions


As was pointed out downthread, there are tech companies that are "more" FedRAMP compliant (FedRAMP "High") without DNSSEC support.

(Kenn White points out on Twitter that some of this may be due to grandfathering --- though, the FedRAMP DNSSEC requirement is pretty old.)


I don't know about FedRAMP, but with other government requirements, the easiest way to get an exception was to fail badly at implementing the retirement.

When the DOD tried to mandate Ada, lots of projects were bid as Ada, then switched to C++ at the very first sign of any trouble whatsoever. I would 100% believe it if someone told me that this horrible rollout could be leveraged into an exemption from needing DNSSEC


We had to do DNSSEC (for a couple of "system relevant" services) too.

Was it a hard requirement? No, but the fat fingered audit companies really like to tick that "should" box green and would be more lenient with other debatable findings, so it was suddenly "in our best interests" to comply.


It's a business decision. Good luck selling software subscriptions to federal agencies without FedRAMP compliance.

I'm pretty surprised that slack doesn't have a more robust testing network. Is it really that hard to set up another DNS on Route53 for staging these changes? Idk, but that type of thing is the least you can do if you want some FBI agents to discuss active investigations on your chat platform...


None of this, none of it at all, has anything to do with Slack's ability to safely host conversations from the FBI. Whatever challenges they have with that are entirely orthogonal to this stupid performative stunt DNS configuration.

(There's a whole thread here, and more on Twitter, getting into the actual details of what FedRAMP and NIST require here, and engaging with the fact that Slack is the only large tech company in the past several years to have attempted to flip the DNSSEC switch on.)


I work for a company that maintains DNSSEC on our FedRAMP deployment. It's not unreasonable to ask for signed DNS records if feds are going to hit them.

Your blog post makes the supposition that DNSSEC is only being pushed as an alternative means of security to CA for TLS. While it makes a compelling case that this isn't realistic, there are other security concerns that occur from the compromise of DNS records. If the government is going to use a DNS record, it should be signed by a zone owner.

Slack is actually a good use case for this security enforcement, because they maintain a handful of domains that are extremely authoritative for their messaging service[1]. If you can't maintain a security protocol on four domains that are crucial to the operation of your service, you maybe aren't cut out to supply software for the government.

1: https://slack.com/help/articles/360001603387-Manage-Slack-co...


I've done security work for products deployed at DOD and in other sensitive agencies, and had firsthand experience with USG infosec, and the idea that the USG sets any kind of useful standard for infrastructure security is risible.

Unfortunately, the GSA product market is its own bubble, as is people who work in IT for the USG in any capacity, and so it's easy to see how people with limited exposure to modern industry practice --- experiences almost wholly gated through vendors that snake through the GSA acquisition process --- might believe themselves to be operating several levels above where they actually are.

I would take Slack's security practice --- their infrasec, their corpsec, their software security, the whole shebang --- over anything done in any USG agency. Slack is better at this than their USG clients are, full stop. And Slack, while strong, is far from the S tier of industry security teams.


Well, the Slack security team seems to think that DNSSEC is important. Even for their workspace domains.

I just want to hammer home the point that requiring service providers to get their DNS records signed by DNS zone owners is a reasonable ask for USG software service vendors. Even if DNSSEC isn't capable of securing the whole internet.


DNSSEC is utterly unimportant. Practically no major security team on the Internet enables it --- not Amazon's, not Google's, not Facebook's, not Microsoft's, not Apple's, not Oracle's, not IBM's, not Cisco's. The argument that DNSSEC is somehow necessary for secure infrastructure is an extraordinary claim, and it requires extraordinary evidence.


Well, the operational requirements of commercial entities may be different than the federal government. Many of the companies that you mentioned offer FedRAMP services (with maybe the exception of Apple and Facebook), and they probably reckon with the spec on some level, even if they are not employing it internally. It is also pretty clear that Slack is going to implement it soon - they are going through all this trouble to allow to provide their signature every workspace get's a subdomain feature on FedRAMP. They really don't have to do that. Or maybe they do, in which case I would argue that it is probably good practice to be able to interrogate the DNS records they maintain.

Either way, this argument is starting to become political. Is Facebook a role model for cybersecurity, and keeping data out of the wrong hands? Or do NIST researchers know better? Neither - the government outlines its security requirements, and private companies play ball to compete for their business. And if a federal agency wants to be able to prove a DNS record's authenticity, even if it is maintained by a vendor, even if that isn't sufficient to secure their infrastructure, that's their prerogative.


Facebook is better --- more competent, more effective --- at cybersecurity than the US Government by a factor of $lots.


Because FedRAMP compliance is required for many US federal (and now some state) customers, which Slack can charge a premium.


No tech company is infallible. All of them have outages, some lasting hours, even days.

Complex systems can and will fail. Try to do better, of course, but let’s acknowledge that perfection will always exceed our grasp. The world will continue to turn regardless.

One day it might just be your turn to break production.


The subtext here isn't that Slack is bad at this (they are not), but that DNSSEC is somehow intrinsically unsafe (it probably is).


I agree with your points about DNSSEC (disclaimer: I have not had the pleasure of having to implement it myself in infra), but was attempting to communicate that DNSSEC isn’t the only area of ops that folks get exposed to these sorts of unknowns or edge cases, and that no amount of resourcing enables you to avoid these issues. For Slack, it was DNSSEC. For Roblox, Consul. Facebook/Insta, software defined BGP. Akamai, DNS.

Perhaps I did not read the room appropriately. Mea culpa.


Did Roblox finally come out with their postmortem blaming Consul? As far as I know we just assumed it, but have had no update since October.



"It turned out that some resolvers become more strict when DNSSEC signing is enabled at the authoritative name servers, even while signing was not enabled at the root name servers (i.e. before DS records were published to COM nameservers). This strict DNS spec enforcement will reject a CNAME record at the apex of a zone (as per RFC-2181), including the APEX of a sub-delegated subdomain"

Slack's second attempt wasn't a DNSSEC problem. Slack depended on a permissive fallback of revolvers when encountering a plain DNS protocol error. It is similar to how some websites in the past relied on permissive browsers implementation when facing broken HTML/JS/CSS. Slack fixed their broken DNS as a result of this.

Slack's third attempt was not the fault of Slack but rather a software bug at Amazon. I would make the argument that Amazon's primary product isn't DNS services, but they did fixed their bug after this.

The general conclusion I get from the article is not that DNSSEC is broken, nor that is too complicated. It is that when doing changes with your core infrastructure to make it more secure, bugs that may have been laying dormant might pop up and bite. I am sure some people has had that experience in domains outside of DNS.


You are not wrong, but by steering clear of DNSSEC, Slack would not have had the outage they did.

What one can't ignore is the underlying chicken-and-egg problem that DNSSEC must overcome: Not many DNSSEC deployments and hence not much of it has been tested in the real-world, which results in colossal outages despite the attention of some of the most qualified engs, including the ones running one of the largest nameserver deployments in the world.

TLS and WebPKI has had a similar, perhaps even more painful route to ubiquity. So, this problem isn't unique to DNSSEC. What isn't working in DNSSEC's favour is, the world has not just moved on, but it has built solutions atop DNS' weaknesses, like it once did with IPv4 and NAT. Internet's strong network-effects coupled with its heterogeneity, make battling "the System" an even harder proposition.

See also: System design explains the world: Vol 1, https://apenwarr.ca/log/20201227


I know HN has collectively accepted but every time I'm associated with an organisation that pays for a penetration test it comes in as a high risk finding, so much so that I've given in to deploying it to avoid sitting with non-technical managers doing the "here's why I disagree" all over again. Outside of this group I definitely feel like I'm on my own in that view.


It's always DNS.


This is a dirty lie.

Sometimes it's BGP.


And sometimes (as in the Facebook outage), it's both!


In that case, it was AWS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: