The title of this is post is incorrect. It's not just Confluence that is down, but JIRA and all Atlassian Cloud services for impacted customers.
Some customers are also told they should expect to be down for up to another two weeks, making the outage a total of 3 weeks or so [1]. 400 companies are impacted by this outage and at least one of them is a YC company, Bitrise, who are still down, Atlassian is not telling them anything specific, and data loss is a possibility for them [2].
The outage is strange to last this long as it contradicts Atlassian's own admission of how they plan for resillience and their statements of how quickly they can restore data for customers. They specifically say how they utilize multiple DCs, have zero data loss scenarios and their recovery time objective (RTO) is 6 hours for their Tier 1 products (like JIRA). They are failing this objective big time, which questions what is happening behind the scenes. [3]
The outage is especially ironic given that Atlassaian have discontinued their self-hosted Server product and are forcing all customers to move to the Atlassian Cloud - this very cloud that is down for many customers. They stopped selling Server licenses in Feb 2021, disallowed changing tiers in Server products this February, and are stopping support for Server products in Feb 2024.
This long of an outage for software which companies depend on is mind-boggling. The only similarly long outage I can recall is the 2011 Playstation Network Outage which lasted for 23 days.
Good luck with the restoration, which sounds like it's manual work. Hopefully we get a public postmortem from Atlasssian: they are not the company known to publish these, though.
They don’t even back up that often, why bother having a six hour target for your daily backups, I mean really… I’d rather you loose less of my data thanks.
The RTO for products is listed at 6 hours. RPO is the maximum data loss. RTO is how long the recovery process will take. This is saying they should be able to recover the products in less than 6 hours, to a point in time less than one hour before the incident.
I know it sounds pedantic but this bit me in the ass once. You are talking about a retrospective, not a post-mortem. I asked for a post-mortem because of a mistake with a product I was responsible for. The CEO set a meeting where he was asking pointed questions about if I still had faith in the product and finally asked straight out if I thought it would fail. Reason being he comes from pharma where post-mortem is an autopsy. It's not performed unless you have a corpse to dissect. I explained that in software engineer we have post-mortems about why coffee ran low. He was not amused.
Some customers are also told they should expect to be down for up to another two weeks, making the outage a total of 3 weeks or so [1]. 400 companies are impacted by this outage and at least one of them is a YC company, Bitrise, who are still down, Atlassian is not telling them anything specific, and data loss is a possibility for them [2].
The outage is strange to last this long as it contradicts Atlassian's own admission of how they plan for resillience and their statements of how quickly they can restore data for customers. They specifically say how they utilize multiple DCs, have zero data loss scenarios and their recovery time objective (RTO) is 6 hours for their Tier 1 products (like JIRA). They are failing this objective big time, which questions what is happening behind the scenes. [3]
The outage is especially ironic given that Atlassaian have discontinued their self-hosted Server product and are forcing all customers to move to the Atlassian Cloud - this very cloud that is down for many customers. They stopped selling Server licenses in Feb 2021, disallowed changing tiers in Server products this February, and are stopping support for Server products in Feb 2024.
This long of an outage for software which companies depend on is mind-boggling. The only similarly long outage I can recall is the 2011 Playstation Network Outage which lasted for 23 days.
Good luck with the restoration, which sounds like it's manual work. Hopefully we get a public postmortem from Atlasssian: they are not the company known to publish these, though.
[1] https://twitter.com/kjartanmuller/status/1513462616030683138...
[2] https://twitter.com/gabornadai/status/1513481270411636738?s=...
[3] https://www.atlassian.com/trust/security/data-management