Interesting choice to spend the bulk of the article publicly shifting blame to a vendor by name and speculating on their root cause. Also an interesting choice to publicly call out that you're a whale in the facility and include an electrical diagram clearly marked Confidential by your vendor in the postmortem.
Honestly, this is rather unprofessional. I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor's.
Clearly, a lot went wrong and Flexential needs to do their own postmortem, but Cloudflare doesn't need to make guesses and do it for them, much less publicly.
If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.
It might also be an effort to get out in front of the story before someone else does the speculating.
In any case, with at least three parties involved, with multiple interconnected systems… if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.
Edit to add: I for one am grateful for the information Cloudflare is sharing.
>If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.
It's been 2 days. I doubt PGE or Flexential even have root caused it yet, and even if they have, good communication takes time.
You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.
You also don't publicly share what "Flexential employees shared with us unofficially" (quote from the article) - what a great way to burn trust with people who probably told you stuff in confidence.
>if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.
They can do all of that without smearing people on their company blog. In fact, they can do all of that without even knowing what happened to PGE/Flexential, because per their own admission they were already supposed to be anticipating this, but failed at it. Power outages and data center issues are a known thing, and is exactly why HA exists. HA which Cloudflare failed at. This post-mortem should be almost entirely about that failure rather than speculation about a power outage.
> You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.
1. When you’re paying them the kind of money I imagine they’re paying and they don’t reply for 2 days, yea that’s crazy if true. I’d expect a client of this size could take to an executive on their personal number.
2. Telling the facts as you know them to be especially regarding very poor communication isn’t a smear.
They aren't telling the facts as they know them. Cloudflare themselves say that the information in the article is "speculation" (the article literally uses that term).
Publicly casting blame based on speculation isn't something you do to someone that you want to have a good working relationship with, no matter how much money you pay them.
The post you're replying to is pointing out that multiple days without reporting out a preliminary root cause analysis is so absurdly below the expected level of service here that it would prompt them to reconsider using the service at all.
2 days is outrageous here, I have to imagine whoever thinks that is acceptable is approaching this from the perspective of a company whose downtime doesn't affect profits.
Agreed. DC sends us notifications any time power status changes. We had a dark building event once, due actually to some similar sounding thing: power fail over caused some arc fault in HV that took out the fail over switchgear. We received updates frequently.
UPS failing early sounds like it may be a battery maintenance issue.
We have no idea what their contract is. But two business days without a reply isn’t exactly a long time. Especially if they are conducting their own investigation and reproduction steps.
My impression from reading the writeup is that CF did receive support and communication from Flexential during the event (although not as much communication as they would have liked), but hasn't received confirmation from Flexential about certain root cause analysis things that would be included in a post-mortem.
Two days without support communications would be a long time, but my original comment about the two day period is about the post-mortem. It's totally reasonable IMO for a company to take longer than two days to gather enough information to correctly communicate a post-mortem for an issue like this, and IMO its unreasonable for CF to try to shame Flexential for that.
Especially since it shouldn't matter why the DC failed — Cloudflare's entire business model is selling services allegedly designed to survive that. 99% of the fault lies with Cloudflare for not being able to do their core job.
So why spend so much time trying to shift blame to the vendor? They could've just started the article with something like:
> Due to circumstances beyond our control the DC lost all power. We are still working with our vendors to investigate the cause. While such a failure should not have been possible, our systems are supposed to tolerate a complete loss of a DC.
Because a small handful of decisions probably led to the Clickhouse and Kafka services still being non-redundant at the datacenter level, which added up to one mistake. But a small handful of mistakes were made by the vendor. Calling out each one of them was bound to take up more page space.
The ordering that they list the mistakes would be a fair point to make though, in my opinion. They hinted at a mistake they made in their summary, but don't actually tell us point blank what it was until they tell us all the mistakes that their vendor made. I'd argue that was either done to make us feel some empathy for Cloudflare as being victims of the vendor's mistakes, misleading us somewhat. Or it was done that way because it was genuinely embarrassing for the author to write and subconsciously they want us to feel some empathy for them anyway. Or some combination of the two. Either way, I'll grant that I would have preferred to hear what went wrong internally before hearing what went wrong externally.
Slightly less than half, and the bottom half, so that people just skimming over it will mostly remember the DC operators' problems, not Cloudflare's own. This is very deliberately manipulative.
It is of course possible they've shuffled things around since this was posted but it seems that the first part addresses their system failings.
5th paragraph to the 9th are Cloudflare's "we buggered up" before they get to the power segment. They then continue with the "this is our fault for not being fully HA" after the power bit.
Each to their own, I'm going to read it as a regular old post mortem on this one.
Yeah I agree. The data center should be able to blow up without causing any problems. That's what Cloudflare sells and I'm surprised a data center failure can cause such problems.
Going into such depths on the 3rd party just shows how embarrassing this is for them.
You are way off here, this is 100% on Flexential, they have a 100% Power SLA, that means the power will always be available, right? They also clearly hadn't performed any checks on the circuit breakers and this is a NEWER facility for them, they also didn't even have HALF of the 10hours for the batteries to charge the generators, they also DEFINITELY should have fully moved to generators during this maintenance, they clearly couldn't because they were MORE than likely assisting PGE. Cloudflare CEO is right on here, you pay for Data Center services to be full redundant, they have 18MW at this location and from what I can see they have (2) feeds? That I can't find? Do they? If (1) feed goes down the 2N they have should kick in and with generators there should be NO issues.
I actually disagree, and think that the post mortem clearly defines that there were things that were disappointing that happened with the vendor, _as well as_ things that were disappointing that happened internally. I don't think that it's unfair to point out everything in an event that happened; I do think it would be unfair to ignore all the compounding issues that were in the power of the vendor, and just swallow all of the blame for an event, when a huge reason that businesses even go through vendors at all is to have an entity responsible for a certain set of responsibilities that the business in question doesn't feel they have the expertise to do themselves. Which implies a relationship built on trust, and it's fair to call out when trust is lost.
And even though Cloudflare did put some of the blame, as it were, on the vendor, the post mortem recognizes that Cloudflare wasn't doing their due diligence on their vendor's maintenance and upkeep to verify that the state of the vendor's equipment is the same as the day they signed on. And that's ignoring a huge focus of the post mortem where they admit guilt at not knowing or not changing the fact that Kafka and Clickhouse were only in that datacenter.
Furthermore, we do not know that Cloudflare didn't get the vendor's blessing to submit that diagram to their post mortem. You're assuming they didn't. But for what it's worth as someone that has worked in datacenters, none of this is all that proprietary. Their business isn't hurt because this came out. This is a fairly standard (and frankly simplified for business folk) diagram of what any decently engineered datacenter building would operate like. There's no magic sauce in here that other datacenter companies are going to steal to put Flexential out of business. If you work for a datacenter company that doesn't already have any of this, you should write a check to Flexential or their electrical engineers for a consultancy.
And finally, the things that Cloudflare speculated on were things like, to paraphrase, "we know that a transformer failed, and we believe that its purpose was to step down the voltage that the utility company was running into the datacenter." Which, if you have basic electrical engineering knowledge, just makes sense. The utility company is delivering 12470 volts, of course that needs to be stepped down, somewhere along the way, probably multiple times, before it ends up coming through the 210 volt rack PDUs. I'm willing to accept that guess in the absence of facts from the vendor while they're still being tight lipped.
However, that's not to say I'm totally satisfied by this post mortem either. I am also interested in hearing what decisions led to them leaving Kafka and Clickhouse in a state of non-redundancy (at least at the datacenter level) or how they could have not known about it. Detail was left out there, for sure.
That isn't a voltage change where you'd use multiple transformers in sequence generally, let alone if it's at the same site for the main/primary feed. A redundant feed counts the same, just to be clear, it's more that some low-power/"control plane of the electrical switchyard" applications may use a lower voltage if conveniently available, even if that means a second transformation step from the generators/grid to the load.
That said, the existence of the 480V labeled intermediary does suggest they have a 277/480 V outside system, and a 120/208 V rack-side system.
It's replies like these that make companies not want to share detailed postmortems. It's not crazy for many things in a incident to go wrong and for >0 of them to be external. It would be negligent for Cloudflare to not explicate what went wrong with the vendor which, I would note, reflects poorly on them: who picked the vendor? If anything, I would have liked to hear more on how Cloudflare ended up with a subpar vendor.
(none of this takes away from the mistakes that were wholly theirs that shouldn't have happened and that they should fix)
> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
> The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.
> A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure.
So the root cause for the outage was that they relied on a single data center. I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.
> I find that pretty embarrassing for a company like Cloudflare, which powers such relevant parts of the internet.
Bah, who cares about such unimportant details, what's important is that ~dev velocity~ was reaaally high right until that moment!
> We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.
Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
> Complete and utter management failure. And customers apparently are sold what Cloudflare internally considers to be alpha quality software?
This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
I've been involved with some new service launches at AWS, and it's a strict requirement that everything goes through some rigorous operational and security reviews that cover exactly these issues before the service can be launched as GA. Feature-wise people might consider them "alpha", but when it comes to the resilience and security of the launched features, they are held to much higher standards than what is being described in this post-mortem.
Your operational reviews must be lacking at AWS then (surprise surprise) then because there are so many instances where something will be released in alpha yet the documentation will still be outdated, stale and incorrect LOL.
I think you misunderstand what's being talked about in this thread. "Operations" in this context has nothing to do with external-facing documentation, and instead refers to the resilience of the service and ensuring it doesn't for example, stop working when a single data center experiences a power outage.
"It stopped working because you did XYZ which you shouldn't have done despite it not being documented as something you shouldn't do" isn't different to a customer than a data center going down. For example, I'm sure the EKS UI was really resilient which meant little when random nodes dropped from a cluster due to the utter crap code in the official CNI network driver. My point wasn't that every cloud provider released alpha level software by the same definition but that by a customer's definition they all released alpha level software and label it GA.
> This has been my experience with AWS and GCP as well. Assume anything that's under 3 years old is not really GA quality no matter what they say publicly.
GCP run multi-year betas of services and features, so I'm doubtful there were still things not ironed out for GA. Do you have some examples?
Having worked at companies with varying degrees of autonomy, in my experience a more flexible structure allows for building systems that are ultimately more resilient. Of course, there are ways to do it poorly, but that doesn’t mean it’s a “complete and utter management failure”.
I’m going to leave out some details but there was a period of time where you could bypass cloudflare’s IP whitelisting by using Apple’s iCloud relay service. This was fixed but to my knowledge never disclosed.
There was a time when they were dumping encryption keys into search engine caches for weeks, and had the audacity to claim here, the issue was "mostly" solved. Until they were called out on it by Google Project Zero team...
There still exist many bypasses that work in a lot of cases. There's even services for it now. Wouldn't be surprised if that or similar was a technique employed.
> While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
It's amazing that they don't have standards that mandate all new systems to use HA from the beginning.
The combination of "newer products" and then having "our Stream service" as the only named service in the post-mortem is very odd, since Stream is hardly a "newer product". It was launched in 2017 and went GA in 2018[2]. If after 5 years it still didn't have a disaster recovery procedure I find it hard to believe they even considered it.
From what I was reading on the status page & customers here on HN, WARP + Zero Trust were also majorly affected, which would be quite impactful for a company using these products for their internal authentication.
Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.
The data plane ( which I mentioned) had no issues.
It's literally in the title what was affected: "Post Mortem on Cloudflare Control Plane and Analytics Outage"
Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.
Source: I watched it all happen in the cloudflare discord channel.
If you know anyone that is claiming to be affected on the data plane for the services you mentioned, that would be an interesting one.
Note: I remember emails were also more affected though.
> Those customers were impacted until the DC was back up ( 1-2 hours?) On the config plane.
Which was still like ~12+ hours, if we check the status page.
>Eg. The status page mentioned the healthchecks not working, while everything was fine with it. There were just no analytics at that time to confirm that.
What good is a status page that's lying to you? Especially since CF manually updates it, anyway?
>Source: I watched it all happen in the cloudflare discord channel.
Wow, as a business customer I definitely like watching some Discord channel for status updates.
This wasn't about status updates going to discord only.
There is literally a discussion section on the discord, named: #general-discussions
Not everything was clear in the discord too ( eg. The healthchecks were discussed there), that's not something you want to copy-paste in the status updates...
Priority for cloudflare seemed to get everything back up. And what they thought was down, was always mentioned in the status updates.
Oh, I just looked it up and I thought you mean that CF engineers were giving real time updates there. That's not the case.
However, I still fail to see your argument regarding Zero Trust and not being impacted. The status page literally mentioned that the service was recovered on Nov 3, so I don't understand what you mean by:
>The data plane ( which I mentioned) had no issues.
There's literally a section with "Data plane impact" on all over the status page, and ZT is definitely in the earlier ones. And this is given the fact that status updates on Nov 2 were very sparse until power was restored.
> Tbh. As far as I can see, their data plane worked at the edge.
Arguable, it's best to think of the edge as a buffering point in addition to processing. Aggregation has to happen somewhere, and that's where shit hit the fan.
? That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.
Cloudflare's data lives in the edge and is constantly moving.
The only thing not living in the edge ( as was noticed), is stream, logpush and new image resize requests ( existing ones worked fine) from the data plane
>That would mean their data is at the core cluster. That's not true or I haven't seen any evidence to support that statement.
You're being loose in your usage of 'data'. No one is talking about cached copies of an upstream, but you probably are.
Read the post mortem a bit more closely. They explicitly state that the control plane(s) source of truth lives in core, and that logs aggregate back to core for analytics and service ingestion. Think through the implications on that one.
That’s my interpretation as well. There is one central brain, and “the edge” is like the nervous system that collects signals, sends it to the brain, and is _eventually consistent_ with instructions/config generated by the brain.
Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.
>Sounds like chatgpt doesn't want your business and tuned thier cloudflare settings accordingly. Conveniently cloudflare is getting the blame, which is presumably part of what they're paying for.
The issue is fixed now. But as I mentioned CloudFlare still has a shit captcha, and the one for disabilities was broken as I mentioned.
Yep, it's easy to spot folks who have never configured Cloudflare's WAF when they suggest Cloudflare is blocking their browser of choice instead of the website itself.
As someone who was slightly affected by this outage, I personally also find this post-mortem to be lacking.
75% of the post-mortem talks about the power outage at PDX-04 and blames Flexential. Okay, fair - it was a bit of a disaster what was happening there judging from the text.
But by end of November 2 (UTC), power was fully restored. It still took ~30 hours according to the post-mortem for Cloudflare to fully recover service. This was longer than the outage, and the text just states that too many services were dependent from each other. But I'd wish they go into more detail here why the operation as a whole took that long. Are there any take-aways from the recovery process, too? Or was it really just syncing data from the edges back to the "brain" that took this long?
Also one aspect I am missing here is the lack of communication - especially to Enterprise customers.
Cloudflare support was basically radio silent during this outage except for the status page. Realistically, they couldn't do much anyway. But at least any attempt at communication would be appreciated - especially for Enterprise customers, and even more especially after the post-mortem blames Flexential for a lack of communication.
While I like Cloudflare since it's a great product, I think there are still a few more things that should be taken as a conclusion for CF to take away from this incident.
That being said, glad you managed to recover, and thanks for the post-mortem.
I'm not that surprised at the relative lack of detail, given how quickly they released this; I'm surprised they published this much info so quickly. Calling it a postmortem is a bit of a misnomer, though. I'd expect a full postmortem to have the kind of detail you mention.
> In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.
This paragraph similarly leaves out juicy details. Exactly what services fail if logging is down? Were they built that way inadvertently? Why did no one notice?
Even "we don't know why our data center is failing, but we're sending a team over to physically investigate now" would have been A+ communication in the moment.
Everything was on the status page since the start?
DC related updates:
> Update - Power to Cloudflare’s core North America data center has been partially restored. Cloudflare has failed over some core services to a backup data center, which has partially remediated impact. Cloudflare is currently working to restore the remaining affected services and bring the core North America data center back online.
Nov 02, 2023 - 17:08 UTC
> Identified - Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.
We will keep providing regular updates until the issue is resolved, thank you for your patience as we work on mitigating the problem.
Nov 02, 2023 - 13:40 UTC
As an enterprise customer, I would expect a CSM reaching out to us informing us about the impact, getting into more details about any restoration plans and potentially even ETAs or rough prioritization to resolution on them.
In reality, Cloudflare's support team was essentially completely unavailable on Nov 2, leaving only the status page. And for most of the day, the updates on the status page were very sparse except "we are working on it", and "We are still seeing gradual improvements and working to restore full functionality.".
Yet clearer status updates were only giving starting on Nov 3. However, I still don't think I heard anything from support or a CSM during that time.
1) Were you affected on the data plane? Which product?
As far as I can tell, while the outage was in the core dc's. The impact was minor.
2) Both examples were exactly from 2 November. Not 3 November.
3) What method of support did you try? I thought that their support was impacted ( email?).
The status page explicitly mentioned to get in contact with your account manager for some config changes on some products, if you wanted changes.
4) I have never heard of Enterprise customers being contacted by a cloud company during an outage.
Which company does that? Do you have an example?
5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".
Since most customers weren't affected and some others were minorly impacted.
There is not a single cloud company that does that.
ssl for saas -> custom hostnames are not working for new domains or changes to current ones.
also page rules -> redirects are not working for new rules or changes to current rules.
which are game-stoppers for our business.
we contacted via enterprise email support + ccing our managers and assigned engineers.
first they try to tell us product is working and sending us some details how to do that,this etc, after a couple of hours later they understand the issue is bigger than they thought and they said "the product is affected by api outage".
then in another email we asked them when this can be solved but only answer we got is "please follow status page for the updates".
and after a day, ssl for saas & ssl services took their places on status page.
for a day nobody notices if it's working or not except customers.
so as we understand these emails even the team internally haven't got any idea what is working and what is not!
>1) Were you affected on the data plane? Which product?
No, but we needed to make urgent changes.
>2) Both examples were exactly from 2 November. Not 3 November.
Both messages contain no clear messages about remediation and co. They also didn't state clearly which products were failed over. I noticed that at this point I could at least login to the dashboard, but most stuff was still severely broken, and I had no idea whether changes with the few semi-functional components were actually applied or not.
Updates to single products with a more clear status were given only at the end of November 2nd (UTC).
(Also one of the message states data centres - not just data center. Not sure what happened there).
>3) What method of support did you try? I thought that their support was impacted ( email?).
Emergency line + contacting our CSM.
The emergency line was shut down and replaced with voice mail (WTF?), and our CSM did not reply at all (or the message somehow made it to the wrong person, I'll find out next week, I guess).
So in our case, the communication was essentially non-existent, even though I raised a support case (or wanted to).
>4) I have never heard of Enterprise customers being contacted by a cloud company during an outage. Which company does that? Do you have an example?
I can remember of Datadog reaching out to us for their 2023-03-08 incident. Not sure if it was just our CSM being nice or someone did a support request on another communication channel, but looking back in history that came without asking + the post mortem. Same case when stuff happens such as vulnerabilities in one of their packages, they reach out to us proactively and notify us.
To be fair, this is a bit of a wishlist and definitely not necessary for a 30 minutes hickup, but for a 2 day outage... I don't know.
At the bare minimum, I'd expect at least their support team to be replying and not shutting down the communication channels.
>5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".
I don't know... At least at the time I raise an urgent support case about an issue, I expect to be kept up-to-date.
> Since most customers weren't affected and some others were minorly impacted.
What does it mean they were not affected? Yes, their core service was still functioning (thank god - after all they advertise a 100% (!) SLA on that), but you can see on same Discord channel you mentioned people failing to renew TLS certificates, people couldn't make Vercel deployments and more. So it did affect quite a bunch of downstream customers in their products, and they might also sell SLAs to their customers...
I cannot really comment on whether that just affected us, or if other customers had better support experiences here.
But I expect better in terms of communication here. Doesn't have to be as outreaching as I did in my last message, but stuff like shutting down the emergency line and not giving any comment is not really acceptable for an Enterprise contract.
Our policy ( playbook) in case of an issue is updating the status page as quick as possible and customers can subscribe on RSS.
There was one issue in the past where we wanted to inform the clients. But it's not easy, as only some were impacted and we decided against it.
5 minutes later ( it was out of our hands) it was solved...
Our playbook is too update the status page as soon as possible to inform the clients something is up and we are aware.
There shouldn't be too much info on it, since sometimes you just aren't 100% sure about what's exactly going on.
We also decided that we want provide durations on it, since you then create a commitment that's possibly dependent on external factors.
Tbh. I can completely understand the approach from Cloudflare here. With an issue, support is overwhelmed. That's why you use the status page ASAP.
Technical details happen in the post-mortem. When we can be sure if any data is lost ( normally, there is nothing lost though, but it's possible we need to requeue some actions)
=> this is when we can contact our clients and brought up to date.
Depending on the SLA it's included or eg. Is paid extra ( in a lot of times, an external provider fails and we can fix something from our end, eg. Resending some data)
I've got no knock on the status page. Cloudflare is disappointed in the lack of notification from their data center provider, and Cloudflare customers are disappointed in the lack of notification from their service provider.
Instead of defending what was done and calling that good enough, Cloudflare should use this as an opportunity to commit to reevaluating the strategy for customer outreach during major service failures. If that's what Cloudflare expects from its service providers, that's what Cloudflare should provide to its customers.
I love how thorough Cloudflare post mortem’s are. Reading the frank, transparent explanations are like a breath of fresh air compared to the obfuscation of nearly every other company comm’s strategy.
We were affected but it’s blog posts like these that make me never want to move away. Everyone makes mistakes. Everyone has bad days. It’s how you react afterwards that makes the difference.
I would generally agree with you, but this post mortem was 75% blaming Flexential even though it took them almost two days to recover after power was restored. The power outage should have been a single paragraph and then pivoted - DC failures happen, its part of life. Failing to properly account for and recover from it is where the real learnings for Cloudflare are.
It was more of an incident report. The efforts to get back online were mostly around Flexential, so it makes sense to dive in to their failings. That said, it is clear there were major lapses of judgement around the control plane design since they should be able to withstand an earthquake. That they don't have regular disaster recovery testing of the control plane and its dependencies seems crazy. I wonder if it is more that some of those dependencies they hoped to eliminate and replace with in-house technology and hedged their bets on the risk.
The issue is when you start having bad days every other day though.
We use and depend on CloudFlare Images heavily, it has now been down more than 67 hours over the last 30 days (22h on October 9th, 42h Nov 2 - Nov 4 and a sprinkle of ~hour long outages in between). That's 90.6% availability over the last month.
Transparency is a great differentiator between providers that are fighting in the 99.9% availability range, but when you are hanging on for dear life to stay above the one 9 availability, it doesn't matter.
They are a younger company than these other providers. Microsoft, Google, and AWS had their own growth pains and disasters. Remember when Microsoft deleted all the data (contacts, photos, etc) off all their customers Danger phones by accident and had no backup. Talk about naming their product a self-fulfilling prophecy.
> AWS was the public release of tooling that amazon had been bulding for almost 20 years at that point.
No, even at the onset AWS was an entirely-from-the-ground-up build. The only thing it could even be argued to sit on top of was the extremely crufty VMs and physical loadbalancers from the original Prod at that point, and those things were not doing anybody any favors.
I agree, but I also think that for security purposes they should leave out extraneous detail. Also, I know they want to hold their suppliers accountable, but I would hold off pointing fingers. It doesn't really improve behavior, and it makes incentives worse.
I really appreciate that they're going to fix the process errors here. But as they suggested, there's a tension between moving fast and being sure. This is typically managed like the weather, buying rain jackets afterwards (not optimal). I'd be curious to see how they can make reliability part of the culture without tying development up in process.
Perhaps they can model the system in software, then use traffic analytics to validate their models. If they can lower the cost of reliability experiments by doing virtual experiments, they might be able to catch more before roll-out.
Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.
Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.
It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.
I expect them to start reporting out what they know immediately, and update as they learn more. If they're not doing that, and indeed haven't reported anything in days, that is a huge failure.
Imagine if the literal power company failed, and took days to tell people what was going on. You can see why people are reading the postmortem that exists, rather than the one that doesn't.
Cloudflare vowed to be extremely transparent since the start of their existence. I'm very happy with the fact they have managed to keep this a core company value under extreme growth. I hope it continues after they reach a stable market cap. It isn't like Google that vowed not to be evil until they got big enough to be susceptible to antitrust regulation and negative incentives related to ad revenue.
What "security purposes"? Good security isn't based on ignorance of a system, it is on the system being good. We create a self fulfilling prophecy when we hide security practices because what happens is then very few will properly implement their security. Openness is necessary for learning.
Its weird that upon reading this post, I have less confidence in Cloudflare. They basically browbeat Flexential for behaving unprofessional, which, yes, they probably did. However the fact that this causes entire systems that people rely on to go down is a massive redundancy failure on Cloudflares part, you should be able to nuke one of these datacentres and still maintain services.
Very worrying is they start by stating their intended design:
> Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon
You need way more geographic dispersion than that, this control pane is used by people across the world. We are still on the intended design, not the flawed implementation by the way, which is wild to me.
> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
I don't understand why this would ever be done in this way. If Cloudflare is making a new product for consumers shouldn't redundant design be at the forefront here? I am surprised that it was even an option. For the record I do use Cloudflare for certain systems and I use it because I assume it has great failovers if events like this occur making me not have to worry about these eventualities, but now I will be reconsidering this, how do I actually know my cloudflare workers are safe from these design decisions?
> When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services.
Yeh I'll bet, its because Cloudflares core design is not redundant.
Really disappointed in this blog post trying to shift the blame to Flexential when this slapdash architecture should be the main problem on show. As a customer I don't care if Flexential disappears in an earthquake tomorrow, I expect Cloudflare to handle it gracefully.
I'm also a bit surprised about Hillsboro. The FEMA is assuming that when (not if) The Big One hits, everything west of I-5 is going to be toast.
Is placing the entirety of such a critical cluster in a known earthquake and tsunami zone a good idea? It looks like their disaster recovery to Europe didn't really work either...
Yeah. Moreover, looking at the map, the DCs around Hillsboro are terrifyingly close to each other.
By the way, assuming an ideal control plane (in contrast to data plane) would be 3 DCs at a distance of about 20-40 miles, are there any mitigation techniques so that a seismic event which destroys a single DC doesn't also sever the comms between the remaining two?
> However, we had never tested fully taking the entire PDX-04 facility offline.
That is a painful lesson, but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster.
You can point fingers at the facility operators, but at the end of the day you have to be able to recover from a dc going completely offline and maybe never coming back. Mother Nature may wipe it off the face of the earth.
This is a fair point. Imagine there had been a serious fire like OVH suffered or flooding that destroyed the data center. Would Cloudflare have been able to recover?
Most likely, yes. They have enough customer lock-in that enough customers would stick with them even if it took them a week to rebuild everything from in other DCs.
> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.
I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes.
I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well.
Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online.
The biggest key to implementing these types of plans is that when the shit hits the fan, you send a third of the people home - so they can come back in 10-20 hours are relieve those who are still there.
If you don't do that, you're still going to be scrambling.
Somewhat amazed at the structure of this article: after first discussing the third-party for 75% of blog post, the first-party recovery efforts were detailed in considerably lesser paragraphs. It’s promising to see a path forward mentioned but I can’t help but wonder why this was published instead of currently acknowledging their failure/circumstances and later on publishing a complete post-mortem after the dust fully settles (i.e. without speculation).
To make sure their stonk doesn’t drop at market open next week. Investors will read this (or get the sound bites) and shrug it off as some vendor issue rather than deep issue that will require months of rework (millions of dollars and thus impacting earnings)
Poor doc:
You had a high availability 3 data center setup that utterly failed. Why spend the first third of the document blaming your data center operator? The management of the data center facility is outside of your control. You gambled that not appropriately testing your high-availability setup (under your control) would not have consequences. You should absolutely discuss the DC management with your operator, but that's between you and them and doesn't belong in this post mortem.
Wow they REALLY buried this important part didn't they! This took a ton of scrolling:
"Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04."
There’s also the part where the disaster recovery site apparently fell over under the load (which, OK, is a thing that might happen) and they needed to code up limits on the fly (and that is not OK; I don’t have the slightest idea how one might test this, but if you’re building a “disaster” site it seems like you’d need to figure it out):
> When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services. We implemented rate limits to get the request volume under control.
This seems not to be mentioned in the bullet points at the end of the text (which are otherwise reasonable).
And now I’m curious—how do you design cold failover when the system is complex enough to be metastable[1] and you can’t afford to test it on live traffic? I can guess which techniques you could use to build it, it’s the design and testing part (knowing the techniques actually work in your situation) that’s the problem.
One other thing that seems to have gone completely unmentioned:
> Beginning on Thursday, November 2, 2023, at 11:43 UTC Cloudflare's control plane and analytics services experienced an outage. [... W]e made the call at 13:40 UTC to fail over to Cloudflare's disaster recovery sites located in Europe.
Why did the decision take so long? I can imagine it can’t be made lightly, but two hours seems like too much hesitation, even if there was an expectation that power would be restored imminently for most of that time. There has to be a (predetermined?) point when you hit the switch regardless of any promises. Was it really set that far?
> While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA).
I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA. But most companies really struggle with this, especially with ensuring that customers are never exposed to a/b/p software by default, requiring opt-in from the customer, allowing the customer to easily opt-out, and ensuring that using a/b/p software never endangers GA features they depend on. Building that out, if it's even on a company's internal Platform/DevX backlog, is usually super far down as a "wishlist" item. So I'm super interested to see what Cloudflare can build here and whether that can ever get exposed as part of their public Product portfolio as well.
> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.
Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.
Between the Pages outage and the API outage happening in one week, I was considering selling my NET stock, but reading a postmortem like this reminds me why I invested in NET in the first place. Thanks Matt.
> I really like the model where a single team in a company, with Product + Dev, can quickly ship, iterate on a new product, and prove market demand without going through layers and layers of internal bureaucracy (Ops/Infra, Security, Privacy/Legal, Finance approval for production-scale), with the main stipulation being that such work is marked as alpha/beta/preview, and only going through the layers of internal bureaucracy once it's ready to go GA.
Speaking from personal experience, what you're claiming as 'good', for CF meant SRE- usually core, but edge also suffered- got stuck with trying to fix a fundamentally broken design that was known faulty- and called faulty repeatedly- but forced through.
Nothing about this is desirable or will end well.
This reckoning was known and raised by multiple SRE near a decade before this occurred, and there were multiple near misses in the last few years that were ignored.
The part that's probably funny- and painful- for ex-CF SRE is that the company will do a hard pivot and try to rectify this mess. It's always harder to fix after, rather than building for, and they've ignored this for a long while.
I'm not sure if you understood my argument? I'm arguing that it's fine to ship a "fundamentally broken design" as long as the company makes abundantly clear that such software is shipped as-is, without warranty of any kind, MIT-license-style. Ramming that kind of software through to GA without unanimous sign-off from all stakeholders (infra/ops, sec, privacy/legal, etc.) is fundamentally unacceptable under such a model. Maybe there's an argument to be made that such a model is naïve, that in practice the gatekeepers for GA will always be ignored or overruled, but I would at least prefer to think that such cases are examples of organizational dysfunction rather than a problem with the model itself, which tries to balance between giving Product the agility it needs to iterate on the product, Infra/Sec/Legal concerns that really only apply in GA, and Ops (SRE) understanding that you can't truly test anything until it's in production; the same production where GA is.
> We need to use the distributed systems products that we make available to all our customers for all our services so they continue to function mostly as normal even if our core facilities are disrupted.
>> Super excited to see this. Cloudflare Workers is still too much of an "edge" platform and not a "main datacenter" platform, at least because D1 is still in beta and even if it wasn't, Postgres is far more feature-ful, and that pulls more software into a traditional single-datacenter model. So if Cloudflare can really succeed at this, then it'll be a much stronger statement in favor of building out software in an edge-only model.
On the other when a company dogfoods its own products you end up in a dependency hell like AWS apparently is in where a single Lambda cell hitting full capacity in us-east-1 breaks many services in all regions.
I'm sure there is a right way to manage end to end dependencies for 100% of your services past, present, and future but increasingly I'm of the opinion that it's not possible in our economic system to dedicate enough resources to maintain such a dependency mapping system since that takes away developer time from customer facing products that show up in the bottom line. You just limp along and hope that nothing happens that takes out your whole product.
Maybe companies whose core business is a money printing machine (ads) can dedicate people to it but companies whose core business is tech probably don't have the spare cash.
They don't always know it, but all large systems are moving gradually towards dependency management system with logic rules that covers "everything", physical, logical, human and administrative dependencies. Every time something new not covered is discovered, new rules and conditions are added. You can do it with manual checklists, multiple rule checkers, or put everything together.
I suspect that in end it's just easier to put everything into single declarative formal verification system and see if new change to the system passes, transition between configurations passes etc.
This is such an interesting way of putting it. I think this has been the subconscious reason I've been gravitating towards defining _everything_ I manage personally (and not yet at work) with Nix. It's not quite to the extent you're talking about here, of course, but in a similar vein at least.
Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon. The three data centers are independent of one another, each have multiple utility power feeds, and each have multiple redundant and independent network connections. The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted, while still close enough that they could all run active-active redundant data clusters.
If the three data centers are all around Hillsboro, Oregon, an earthquake could probably take out all three simultaneously.
Wikipedia's entry for Hillsboro: "Elevation 194 ft (60 m)"
Between that, and being ~50 miles inland - I'd say there's ~zero threat of Cascadia quakes or tsunamis directly knocking out those DC's. (Yeah, larger-scale infrastructure and social order could still be killers.)
OTOH - Mt. St. Helens is about 60 miles NNE of Hillsboro. If that really went boom, and the wind was right...how many cm's of dry volcanic ash can the roofs of those DC's bear? What if rain wets that ash? How about their HVAC systems' filters?
50% of roads and near 75% of bridges damaged on the west coast and the I5 corridor.
Refer to PDF page #93 where over 70% of power generation is highly damaged on the I5 corridor and 60% in the coastal areas with 0% undamaged.
Highly damaged - "Extensive damage to generation plants, substations,
and buildings. Repairs are needed to
regain functionality.
Restoring power
to meet 90% of
demand may take
months to one year."
"In the immediate aftermath of the earthquake, cities
within 100 miles of the Pacific coastline may experience partial or complete blackout. Seventy percent
of the electric facilities in the I-5 corridor may suffer
considerable damage to generation plants, and
many distribution circuits and substations may fail,
resulting in a loss of over half of the systems load
capacity (see Table 22). Most electrical power assets
on the coast may suffer damage severe enough as
to render the equipment and structures irreparable"
Good backups generators at their colo's could handle the lack of utility power for days to weeks. More & better generators could be hauled in and connected.*
The two big problems I'd see would be (1) Social Order and (2) Internet Connectivity. DC's are not fortresses, and internet backbone fibers/routers/etc. are distributed & kinda fragile.
*After all the large-scale power outages & near-outages of recent decades, Cloudflare has no excuse if they lack really-good backup generators at critical facilities. And with their size, Cloudflare must support enough "critical during major disaster" internet services to actually get such generators.
And most of thier SREs. Spending 30 hours to recover from the worst natural disaster in recorded history is slightly diffrent then from a ground fault on a single transformer.
>While there were periods where customers were unable to make changes to those services, traffic through our network was not impacted.
They're just going to straight up lie like that? We definitely weren't able to get "traffic through [their] network" through the outage at many different random points.
So if the CF team is under the impression traffic was not impacted, dig deeper.
I think overall Cloudflare did a decent job on this. Clearly the DC provider cocked up big time here, but Cloudflare kept running fine for the vast majority of customers globally. No system is perfect and it’s only apocalyptic scenarios like this where the vulnerabilities are exposed - and they will now be fixed. Hope the SRE guys got some rest after all that stress.
> We are a relatively large customer of the facility, consuming approximately 10 percent of its total capacity.
I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?
CF is probably a lot smaller than you realize, especially per data center. As someone who works at a different CDN, I am guessing they only have a few hundred machines per data center around the world. That is way too small to be able to run your own DC.
I have no idea how many DCs they have or operate in. Where does "300" come from?
> Colo is much more flexible, cheaper and quicker to start. Definitely since they sit close to the end-user on the data plane.
I understand that, but it has the disadvantage of reduced control and observability - particularly in the event of an outage such as that described in the blog post.
I kind of assumed that top-tier cloud platforms like AWS/Azure/GCP operate out of dedicated DCs, and that CF are similar because of their well-known scale of operations. Since my original comment has been downvoted†, someone presumably thinks this it was a naive or trivial question - although I don't understand why.
(† I don't much care about downvotes, but I do take them to be a signal.)
Probably most of us follow Cloudflare a bit more closely.
They want DC's close to every big city. I think most of us knew that they can't launch > 300 DC's in such a short amount of time.
The many amount of DC's is mentioned a lot ( social networks, blogs, here).
There is a distinction between eg. AWS / Azure / ... Which work with a couple of big DC's, while cloudflare operates more spread across more locations.
You're comment did made me realize it may may not be that clear from an outsider viewpoint though ( fyi, I'm an outsider too)
> I'm surprised that CF are renting space in colocation facilities. I would have expected a business of their size to have their own DCs. Is this common practice for cloud providers?
Google for one has both. Some GCP regions [0] are in colos, while others are in places where we already had datacenters [1]. We also use colo facilities for peering (and bandwidth offload + connection termination).
I'm under the impression that most AWS Cloudfront locations are also in colo facilities.
I'm a little surprised too; I figured they would have their own DCs for their core control plane servers. Colos for their 300+ PoPs makes sense, though.
I asked a sales rep once about services going out and how that would affect CF For Teams. They said it would be virtually impossible for CF to go down because of all their data centers around the world. Paraphrasing, “if there’s an outage, there’s definitely something going wrong with the internet.”
FWIW, I'm a Cloudflare Enterprise customer and we had zero downtime. Only thing that was temporarily unavailable was the cloudflare dashboard.
I feel like a lot of people in this thread are commenting under the impression that all of Cloudflare was down for 24 hours when in reality I wouldn't be surprised if a lot of customers were unaffected and unaware of the incident.
I wouldn't even have known of the outage had it not been for HN..
2nd this. We had zero downtime on anything in production. The only reason we knew is because we are actively standing up a transition to R2 and ran into errors configuring buckets.
Even honest engineers cannot foresee the exact cascading consequences effects of such outages. Sales reps are not paid to be either competent on such issues nor to be honest.
While debatably unprofessional to blame your vendor, I found this read to be fascinating. I'm sure there are blog posts that detail how data centers work and fail but it's rare to get that cross over from a software engineering context. It puts into perspective what it takes for an average data center of this class to fail: power outage, generator failure, and then battery loss.
I think what it really does is emphasise how common it is for crap to hit the fan when things go wrong - even with the best laid plans.
The DC almost certainly advertises the redundant power supplies, generator backups and battery failover in order to get the customers. But probably doesn't do the legwork or spend the money to make those things truly reliable. It's a bit like having automated backups - but never testing them and discovering they're empty when they're really needed.
I'm ultimately glad this happened because it very effectively helps illustrate how we are assigning a centralized gatekeeper to the internet at the infrastructure level and why it's a bad thing.
Contrary to others here, I find the postmortem a bit lacking.
The TLDR is that CF runs in multiple data centers, one went down, and the services that depend on it went down with it.
The interesting question would be why those services did depend on a single data center.
They are pretty vague about it
Cloudflare allows multiple teams to innovate quickly. As such,
products often take different paths toward their initial alpha.
If I was the CEO, I would look into the specific decisions of the engineers and why they decided to make services depend on just one data center. That would make an interesting blog post to me.
Designing a highly available system and building a company fast leads to interesting tradeoffs. The details would be interesting.
> why they decided to make services depend on just one data center
In my experience, no engineers really decided to make services depend on just one data center. It happened because the dependency was overlooked. Or it happened because the dependency was thought to be a "soft dependency" with graceful degradation in case of unavailability but the graceful degradation path had a bug. Or it happened because the engineers thought it had a dependency on one of multiple data centers, but then the failover process had a bug.
Reminds me of that time when a single data center in Paris for GCP brought down the entire Google Cloud Console albeit briefly. Really the same thing.
> In my experience, no engineers really decided to make services depend on just one data center.
Partially true in this case; I can't speak to modern CF (or won't, moreso) but a large amount of internal services were built around SQL db's, and weren't built with any sense of eventual consistency. Usage of read replicas was basically unheard of. Knowing that, and that this was normal, it's a cultural issue rather than an "oops" issue.
Flipping the whole DC data sources is a sign of what I'm describing; FAANG would instead be running services in multiple DC's rather than relying on primary/secondary architecture.
Everywhere I've worked requires a DR drill per service, but I've never seen anything where the whole company shuts down a DC at once across all services.
But probably we should. It's an immensely larger coordination problem, but frankly, it's probably the more common failure mode.
Isn't this sentence a bit further down more clear?
> This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.
and
> It [PDX-04] is also the default location for services that have not yet been onboarded onto our high availability cluster.
This is good reminder to myself to transfer domains registered on Cloudflare to another provider and only use Cloudflare for DNS or vice versa. I was effectively locked out of making any changes to domains registered and DNS hosted on Cloudflare during the entire outage due to single point of failure (Cloudflare) on my part.
Classic distraction maneuver. This postmortem is a prime example of tech porn that diverts attention from the main issue: many at Cloudflare didn't do their job properly.
The electricity provider is fine, it's Flexential that looks incredibly opaque and non-communicative in a stressful situation.
While Cloudflare should have been better prepared for this, it seems to be amateur hour in that particular Portland data-center. Other customers (Dreamhost, etc) were impacted too, and I can't imagine they don't also have some very pointed questions.
A lot of mud-slinging on here about HA setup and CF's dealing with the problem but I can only assume people are armchair experts with no real experience of HA at the scale of CF.
"So the root cause for the outage was that they relied on a single data center.". No. Root cause was that data centre operator didn't manage the outage properly and didn't have systems in place in which case they could have avoided it + some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.
"Cloudflare has a shit reputation in my eyes, because their terrible captchas". You don't like one product so they have a shit reputation? Enough said.
"but unless you are physically powering off the dc or at least disconnecting the network from the outside world you are not testing a real disaster." If you have ever had to do this, you know that it is never a good feeling. On-paper, yes, you should try your DR but in reality, even if it works, you lose data, you get service blips, you get a tonne of support calls and if it doesn't work, it might not even rollback again. On top of that, it isn't a case of just disconnecting something, most problems are more complicated. System A is available but not system B. Routers get a bad update but are still online, and on top of all of that, you would need some way to know that everything is still working and some problems don't surface for hours or until traffic volume is at a certain level etc. If you trust that a data centre can stay online for long periods of time and that you would then be able to migrate things at a reasonable rate if it doesn't, then you have to trust that to an extend.
All-in-all, CF are not attempting to blame someone, even though a lot is down to Flexential, the last paragraph of the first section says, "To start, this never should have happened...I am sorry and embarrassed for this incident and the pain that it caused our customers and our team."
> some systems knowingly and unknowingly had dependencies on the centre that went down because CF did have systems in place to allow that centre to fail.
I mean you're contradicting yourself in the same sentence. Had CloudFlare had such a system in place that would allow that particular center to fail, there would be no outages in the service. The truth is that they didn't account for it , and because they missed it, that center became a single point of failure which is what brought the whole CloudFlare service down. Power outage was just a trigger to discover a weakness in their system design and not a root cause.
Many of these comments sound like they’re coming from some mythical alternate universe where bugs don’t exist and people and orgs have 100% flawless execution every time.
It reminds me a little of someone sitting at a sports bar yelling about a “stupid” play or otherwise criticizing a 0.0001% athlete who is playing at a level they can’t possibly fathom.
> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning
A minor point but this feels like not the most efficient way to manage an emergency. Having some form of staggered shifts or other approach versus just having everyone pile on. If a lot of knowledge resided in specific individuals so they are vital to an effort like this and cannot be substituted then that seems like a risk in it's own.
They did after two hours. After the first they assumed the generators would be back but then ran into the breaker issue which caused the full day delay.
My question too, although possibly it seemed as a greater risk first to fail over. BTW, is there any unexpected GDPR implication of that? Assuming that fail over means restoring US backups in EU.
Why would any supplier want to do business with Cloudflare now? You have 1.8MW of datacenter space to lease, you have a few interested parties, how could you not view Cloudflare as a huge reputational risk? Why even do the business? Why not lease that space to someone else? Moreover, why renew the existing deals with Cloudflare?
Does Cloudflare have a plan to move 200+ racks in Oregon if that supplier decides just not to renew that deal whenever it comes up next? Are Cloudflare claiming they were able to build a technical plan which gets their architecture away from this site being a SPOF before the deal is up to renew, or is the CEO making a gamble here again?
Cloudflare have demonstrated their willingness to create reputational issues for suppliers by publicly shaming two of them recently, and here, in only about 2 days from incident. One interpretation of this blog would be Cloudflare are a very unreasonable customer and one who is willing to post incomplete or informal information from their suppliers. Cloudflare also chose to focus the first half of a lengthy postmortem on blaming the supplier and only then on their own culpability for the outage, despite it clearly being a shared responsibility.
One of the diagrams Cloudflare have posted is clearly marked "Proprietary and Confidential". Do Cloudflare have permission to post that? It's not clearly stated that they do. Should other suppliers expect when the sh*t hits the fan that any sensitive information they've shared will be part of a blog?
Most of the "Lessons and Remediation" section is stuff Cloudflare could have worked on at any point in advance of a major incident, and Cloudflare's senior management have quite clearly chosen not to prioritize that work until today, when forced to by this major incident.
When signing large deals, Cloudflare will frequently have to complete 'Supplier Disclosures', and they are also making claims through industry-standard certifications [1] like ISO, SOC and FedRAMP. Most of those will ask questions about the disaster recovery and business continuity plans and Cloudflare will have (repeatedly) attested they are adequate, something that this blog clearly demonstrates was a misrepresentation of their true capabilities.
Will there be an SEC disclosure coming out of this considering it could have material impacts on the business, which is publicly traded? Was there any requirement that the SEC disclosure come first, or be concurrent with a blog?
You can run ClickHouse cluster across multiple datacenters. It will survive the failure of a single datacenter while being available for writes and reads, and the failure of two out of three datacenters while being available for reads. It works well when RTT between datacenters is less than 30 ms. If they are more distant, it will still work, but you will notice a quite high latency on INSERTs due to the distributed consensus.
I've run a ClickHouse cluster with hundreds of bare-metal machines distributed across three datacenters in two countries at my previous job. It survived power failures (multiple), a flood (once), and network connectivity issues (regular). This cluster was used for logging and analytics :)
The guy I genuinely feel sorry for is "an unaccompanied technician who had only been on the job for a week".
Regardless of any and all corporate spin on the issue, a newbie was dumped into an event at the worst possible time.
I really really really hope he gets a decent bit of counselling to make sure that is fully aware that the issue with the data centre had NOTHING to do with him unplugging the coffee maker to plug in his recharger for his iPhone. Absolutely nothing at all.
Whenever we design and build a data center (in addition to disaster recovery, business continuity, etc.), we always test all possible scenarios of power interruptions and never assume that the power utility will get in touch with us in advance for a pre-planned maintenance, including the outcome AND the time to restore etc., it’s kind of shocking that you just assumed UPS will last 10min but never put it on test.. especially with size of CF!
For me it's basically summed up as "we didn't test turning the power off" and making sure things worked the way we planned.
Yes it is hard and very expensive to do these types of tests. And doing it regularly is even more $$$ and time.
As most customers we seem to be okay with a cheap price hidden behind a facade of "high availability" since I don't really want to pay for true HA. Because if I knew the real cost it would be too expensive.
> It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators.
Are the data centers compensated or anything for this? I'd imagine generator-only might cost more in terms of fuel and wear-and-tear/maintinaince/inspections.
edit:
> DSG allows the local utility to run a data center's generators to help supply additional power to the grid. In exchange, the power company helps maintain the generators and supplies fuel
I'm not very well versed in this space but I've been told Progressive Insurance in Cleveland, OH has a similar (sounding) agreement. According to PGE's website, they basically pay for everything https://portlandgeneral.com/save-money/save-money-business/d...
Of the incident? Someone on my team called me about 30 minutes after it started. It was challenging for me to stay on top of because it was also the same day as our Q3 earnings call. But team kept me informed throughout the day. I helped where I could. And they handled a very difficult situation very well. That said, lots we can learn from and improve.
What I find bizarre is that the Cloudflare share price jumped when the outage happend!
Having read the post mortem, I do not think it could have been handled any better. I think the decision to extend the outage in order to provide rest was absolutely correct.
I always enjoy reading these reports from Cloudflare as they are the best in the business.
I was surprised we didn't get a single question about it from an analyst or investor, either formally on the Q3 call or on any callbacks we did after. One weird phenomenon we've seen — though not so much in this case because the impact wasn't as publicly exposed — is that investors after we've had a really bad outage say: "Oh, wow, I didn't fully appreciate how important you were until you took down most of the Internet." So… ¯\_(ツ)_/¯
There's a class of investor (and their trade bots presumably) that sees outrage over a service outage as proof the provider is now mission critical, hence able to "extract value" from the market.
Some one should make a webseries about this indecent. It will be a nice story to tell.
Name: Mordern Day Disaster.
Directed by : Mathew prince
Releasing on : 25th December at Netflix
Based on a true story.
We should absolutely blame them, just as "victims" of ransomware should be blamed. Hardening against system failure is the same process as security hardening.
To preface: I am not qualified to talk about this in the slightest.
> Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04
What do you mean you discovered? How could you not know? Surely when you were setting this high availability cluster up years ago and migrating services over, you double checked that all crucial dependencies had also been moved, right? And surely, since you had been "implementing" this for four years now, you've TESTED what would happen if one of the three DCs went completely offline, right???
> And surely, since you had been "implementing" this for four years now, you've TESTED what would happen if one of the three DCs went completely offline, right???
They discussed this. They had been running tests were they disabled the high availability cluster in any of (and of two of) the three DCs. That test didn't involve disabling the rest of the (non-HA) services from PDX-04 DC (oops).
appreciate the status updates and the quick report. as a person that handles the tiniest datacenter, it's impossible to predict every potential event. best you can do is to recover as quickly as possible and learn the lesson.
believing that this doesn't or can't happen to another vendor is being naive.
it has happened to all of them and it'll happen again. can only hope it's super rare.
TLDR: Major flooding, riots, earthquake, asteroid, nuke goes off in Portland and Cloudflare is down because they decided they could put their entire control infrastructure in a single location for ease of use.
I am really upset about this situation on behalf of CF however why don't they think about generating their own electricity with renewable energy sources?
It doesn't change anything fundamentally. A complex product is only as good as the weakest link. I have worked with various employers, some world leaders at the time. All of them had seriously weak links.
Honestly, this is rather unprofessional. I understand and support explaining what triggered the event and giving a bit of context, but the focus on your postmortem needs to be on your incident, not your vendor's.
Clearly, a lot went wrong and Flexential needs to do their own postmortem, but Cloudflare doesn't need to make guesses and do it for them, much less publicly.