Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Is Down
1395 points by markoa on June 2, 2019 | hide | past | favorite | 593 comments
https://status.cloud.google.com

https://status.cloud.google.com/incident/compute/19003

Status page reports all green, however the outage is affecting YouTube, Snapchat, and thousands of other users.




Disclosure: I work on Google Cloud (but disclaimer, I'm on vacation and so not much use to you!).

We're having what appears to be a serious networking outage. It's disrupting everything, including unfortunately the tooling we usually use to communicate across the company about outages.

There are backup plans, of course, but I wanted to at least come here to say: you're not crazy, nothing is lost (to those concerns downthread), but there is serious packet loss at the least. You'll have to wait for someone actually involved in the incident to say more.


To clarify something: this outage doesn’t appear to be global, but it is hitting us particularly hard in parts of the US. So for the folks with working VMs in Mumbai, you’re not crazy. But for everyone with sadness in us-central1, the team is on it.


It seems global to me. This is really strange compared to AWS. I don't remember an outage there (other than s3) impacting instances or networking globally.


You obviously don't recall the early years of AWS. Half of internet would go down for hours.


Back when S3 failures would take town Reddit, parts of Twitter .. Netflix survived because they had additional availability zones. I can remember the bigger names started moving more stuff to their own data centers.

AWS tries to lock people in to specific services now which makes it really difficult to migrate. It also takes a while before you get to the tipping point where hosting your own is more financially viable .. and then if you trying migrating, you're stuck using so many of their services you can't even do cost comparisons.


Netflix actually added the additional AZs because of a prior outage that did take them down.

"After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day."

https://www.networkworld.com/article/3178076/why-netflix-did...


We went multi-region as a result of the 2012 inc. source: I now manage the team responsible for performing regional evacuations (shifting traffic and scaling the savior regions).


That sounds fascinating! How often does your team have to leap into action?


We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.


Do you do any chaos testing? Seems like it would slot right in, there.


I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey


Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering



they have invented the term, so probably yes :)


I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.


Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.


Derp, Site Reliability Engineering.

https://landing.google.com/sre/books/


I think you’re misremembering about Twitter, which still doesn’t use AWS except for data analytics and cold storage last I heard (2 months ago).


Avatars were hosted on S3 for a long time, IIRC.


I am not sure if a single S3 outage pushed any big names into their own "datacenter". S3 has still the world record of reliability that you cannot challenge with your inhouse solutions. You can prove it otherwise. I would love to hear a solution that has the same durability, avabiality and scalability as S3.

For the downvoters, please just link here the proof if you disagree.

Here are the S3 numbers: https://aws.amazon.com/s3/sla/


It's not so much AWS vs. in-house. But AWS (or GCP/DO/etc.) vs. multi/hybrid solutions. The latter of which would presumably have lower downtime.


I don't see why multi/hybrid would have lower downtime. All cloud providers as far as I know, though I know mostly of AWS, already have their services in multiple data-centers and their endpoints in multiple regions. So if you make yourself use more then one of their AZs and Region, you would be just as multi as with your own data center.


Using a single cloud provider with a multiple region setup won't protect you from some issues in their networking infrastructure, as the subject of this thread supposedly shows.

Although I guess depending on how your own infrastructure is setup, even a multi cloud provider setup won't save you from a network outage like the current Google cloud one.


Hum, I'm not an expert on Google cloud, but for AWS, regions are completely independent and run their own networking infrastructure. So if you really wanted to tolerate a region infrastructure failure, you could design your app to fail over to another region. There shouldn't be any single point of failure between the regions, at least as far as I know.


Why would you think that self-managed has lower downtime than AWS using multiple datacenters/regions?


Actually, I imagine that if you could go multi-regional then your self-managed solution may be directly competitive in terms of uptime. The idea that in-house can't be multi-regional is a bit old fashioned in 2019.


For several reasons, most notably: staff, build quality, standards, knowledge of building extremely reliable datacenters. Most of the people who are the most knowledgeable about datacenters also happen to be working for cloud vendors. On the top of that: software. Writing reliable software at scale is a challenge.


Multi/hybrid means you use both self managed and AWS datacenters.


Cannot challenge with your own inhouse solutions, you say?

Challenge Accepted... and defeated: https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastr...

but to be fair, storage is core to Dropbox's business... this is not true for most companies.

disclaimer: I work for Dropbox, though not on Magic Pocket.


> For the downvoters, please just link here the proof if you disagree.

> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

99.9%

https://azure.microsoft.com/en-au/support/legal/sla/storage/...

99.99%


>> Here are the S3 numbers: https://aws.amazon.com/s3/sla/

> 99.9%

(single-region)

There doesn't seem to be an SLA on S3-cross-region-replication configurations, but I am not aware of a multi-region S3 (read) outage, ever.

> https://azure.microsoft.com/en-au/support/legal/sla/storage/....

> 99.99%

99.99% is for "Read Access-Geo Redundant Storage (RA-GRS)"

Their equivalent SLA is the same (99.9% for "Locally Redundant Storage (LRS), Zone Redundant Storage (ZRS), and Geo Redundant Storage (GRS) Accounts.").


Azure is a cloud solution. The thread is about how a random datacenter with a random solution is better than S3.


Wow, he’s comparing the storages SLA of the two biggest cloud services in the world. Pedantic behavior should hurt.


> For the downvoters, please just link here the proof if you disagree.

https://wasabi.com/


How can they possibly guarantee eleven nines? Considering I’ve never heard of this company and they offer such crazy-sounding improvements over the big three, it feels like there should be a catch.


11 9s isn't uncommon. AWS S3 does 11 9s (upto 16 9s with cross region replication?) for data durability, too. AFAIK, AWS published papers about their use of formal methods to ascertain bugs from other parts of the system didn't creep in to affect durability/availability guarantees: https://blog.acolyer.org/2014/11/24/use-of-formal-methods-at...

This is a pretty neat and concise read on ObjectStorage in-use at BigTech, in case you're interested: https://maisonbisson.com/post/object-storage-prior-art-and-l...


You have to be kidding me. 14 9's is already microseconds a year. Surely below anybody's error bar for whether a service is down or not.

16 9's and aws should easily last as long as the great pyramids without a second worth of outage.

What a joke


The 16 9's are for durability, not availability. AWS is not saying S3 will never go down; they're saying it will rarely lose your data.


This number is still total bullshit. They could lose a few kb and be above that for centuries


It's not about losing a few kb here and there.

It's about losing entire data centers to massive natural disasters once in a century.


None of the big cloud providers have unrecoverably lost hosted data yet, despite sorting vast volumes, so this doesn't seem BS to me.


AWS lost data in Australia a few years ago due to a power outage I believe.


on EBS, not on S3. EBS has much lower durability guarantees


Not losing any data yet doesn't give justification for such absurd numbers


Those numbers probably aren't as absurd as you think. 16 9s is, I think 10 bytes lost per exabyte-year of data storage.

There's perhaps the additional asterisk of "and we haven't suffered a catastrophic event that entirely puts us out of business". (Which is maybe only terrorist attacks). Because then you're talking about losing data only when cosmic-ray bitflips happen simultaneously in data centers on different continents, which I'd expect doesn't happen too often.


This is for data loss. 11 9s is like 1 byte lost per terabyte-year or something, which isn't an unreasonable number.


This is why I linked the SLA page which you obviously have not read. There are different numbers for durability and availability.


For data durability? I believe some AWS offerings also have an SLA of eleven 9's of data durability.


11 9s of durability, barely two 9s of availability

I'm sure that's okay if you do bulk processing / time-independent analysis, but don't host production assets on wasabi.


I was asking numbers of reliability, durability and availability for a service like S3. What does wasabi has to do with that?


Always in Virginia, because US-east has always been cheaper.


I know a consultant who calls that region us-tirefire-1.


I and some previous coworkers call it the YOLO region.


The only regions that are more expensive than us-east-1 in the States are GovCloud and us-west-1 (Bay Area). Both us-west-2 (Oregon) and us-east-2 (Ohio) are priced the same as us-east-1.


I would probably go with US-EAST-2 just because it's isolated from anything except perhaps a freak Tornado and better situated on the eastern US. Latency to/from there should be near optimal for most eastern US/Canada population.


One caveat with us-east-2 is that it appears to get new features after us-east-1 and us-west-2. You can view the service support by region here: https://aws.amazon.com/about-aws/global-infrastructure/regio....


Fair point. It depends on what the project is.


And for those of us in GST/HST/VAT land, hosting in USA saves us some tax expenditures.


How?

At least in EU services bought from overseas are subject to reverse charge, i.e. self-assessment of VAT (Article 196 of https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02... ).

Though note that if you are an EU AWS customer, you are not buying from outside EU, you are buying from Amazon's EU branches regardless of AWS region. If Amazon has a local branch in your country, they charge you VAT as any local company does. Otherwise you buy from an Amazon branch in another EU country, and you again need to self-assess VAT (reverse charge) per Article 196.


My experience is with Canadian HST.

Since AWS built a DC in Canada, I’m paying HST on my Route53 expenses, but not on my S3 charges in non-Canadian DCs.

I’m not an HST registrant (small supplier, or if you’re just using services personally), so there’s nothing to self-assess.

Even if self-assessment was required, you get some deferral on paying (unless you have to remit at time of invoice?).


Makes sense.

I believe it works differently in EU (i.e. US DCs taxed) as per Article 44 the place of supply of services is the customer's country if the customer has no establishment in the supplier's country.


AWS is registered for Australian GST - they therefore charge GST on all(ish) services[0].

IBM/Softlayer, Rackspace, Google Cloud, Microsoft and I imagine everyone else large enough to count also does, too.

For Australian businesses, at least, being charged GST isn't a problem - they can claim it as an input and get a tax credit[1].

[0] https://aws.amazon.com/tax-help/australia/

[1] https://www.ato.gov.au/Business/GST/Claiming-GST-credits/


You know, normally you still have to pay that tax - just through a reverse charge process


Not the case in Canada if you’re not an HST registrant (non-business or a small enough business where you’re exempt).

Even if you did have to self-assess, better to pay later than right away.


Mostly because those sites were never architected to work across multiple availability zones.


Years ago, when I was playing with AWS in a course on building cloud-hosted services, it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage, so while technically all our VMs hosted out of other AZs were extant, all our attempts to manage our VMs via the web UI or API were timing or erroring out.

I understand this is long-since resolved (I haven't tried building a service on Amazon in a couple years, so this isn't personal experience), but centralized failure modes in decentralized systems can persist longer than you might expect.

(Work for Google, not on Cloud or anything related to this outage that I'm aware of, I have no knowledge other than reading the linked outage page.)


> it was well-known that all the AWS management was hosted out of a single zone, and there were several days we had to cancel class because us-east-1 had an outage

Maybe you mean region, because there is no way that AWS tools were ever hosted out of a single zone (of which there are 4 in us-east-1). In fact, as of a few years ago, the web interface wasn’t even a single tool, so it’s unlikely that there was a global outage for all the tools.

And if this was later than 2012, even more unlikely, since Amazon retail was running on EC2 among other services at that point. Any outage would be for a few hours, at most.


Quoting https://docs.aws.amazon.com/general/latest/gr/rande.html

"Some services, such as IAM, do not support Regions; therefore, their endpoints do not include a Region."

There was a partial outage maybe a month and a half ago where our typical AWS Console links didn't work but another region did. My understanding is that if that outage were in us-east-1 then making changes to IAM roles wouldn't have worked.


The original poster said that none of AWS services are in a single AZ, the quote you referenced says that IAMs do not support regions.

Your quote cd mean two things.

- that IAM services are hosted in one region (not one AZ)

And/Or

- that IAM is for the entire account not per region like other services (which is true)


Just this year an issue in us-east-1 caused the console to fail pretty globally.


Quite possibly, it has been a number of years at this point, and I didn't dig out the conversations about it for primary sourcing.


Where are you based? If you’re in the US (or route through the US) and trying to reach our APIs (like storage.googleapis.com), you’ll be having a hard time. Perhaps even if the service you’re trying to reach is say a VM in Mumbai.


I am in Brazil, with servers in southamerica. Right now it seems back to normal.


I have an instance in us-west-1 (Oregon) which is up, but an instance in us-west-2 (Los Angeles) which is down. Not sure if that means Oregon is unaffected though.


us-west-1 is Northern California (Bay area). us-west-2 is Oregon (Boardman).


Incorrect. GCE us-west1 is the Dalles, Oregon and us-west2 is Los Angeles.


What I said is correct for AWS. In retrospect I guess the context was a bit ambiguous.

(I will note that I was technically more right in the most obnoxiously pedantic sense since the hyphenation style you used is unique to AWS - `us-west-1` is AWS-style while `us-west1` is GCE-style :P)


EUW doesn't seem to be affected.


My instance in Belgium works fine


Some services are still impacted globally. Gmail over IMAP is unreachable for me. (Edit: gmail web is fine)


+1- imap gmail is down for me in Australia


Yes, same here in UK (for some hours now).


Quick update from Germany, both youtube and gmail appear to work fine


I’m from the US and in Australia right now. Both me and my friends in the US are experiencing outages across google properties and Snapchat, so it’s pretty global.


Fiber cut? SDN bug that causes traffic to be misdirected? One or more core routers swallowing or corrupting packets?


It seemed to be congestion in the North East US.


> including unfortunately the tooling we usually use to communicate across the company about outages.

There's some irony in that.


Edit: and I agree!

I’m not in SRE so I don’t bother with all the backup modes (direct IRC channel, phone lines, “pagers” with backup numbers). I don’t think the networking SRE folks are as impacted in their direct communication, but they are (obviously) not able to get the word out as easily.

Still, it seems reasonable to me to use tooling for most outages that relies on “the network is fine overall”, to optimize for the common case.

Note: the status dashboard now correctly highlights (Edit: with a banner at the top) that multiple things are impacted because Networking. The Networking outage is the root cause.


> the status dashboard now correctly highlights that multiple things are impacted because Networking.

this column of green checkmarks begs to differ: https://i.imgur.com/2TPD9e9.png


This is a person who's trying to help out while on vacation...can we try being more thankful, and not nitpick everything they say?


Thanks! I’ll leave this here as evidence that I should rightfully reduce my days off by 1 :).


The banner at the top. Sorry if that wasn’t clear.


While not exactly google cloud, G suite dashboard seems accurate: https://www.google.com/appsstatus#hl=en&v=status


For me, at least, that was showing as all green for at least 30 minutes.


AWS experienced a major outage a few years ago that couldn't be communicated to customers because it took out all the components central to update the status board. One of those obvious-in-hindsight situations.

Not long after that incident, they migrated it to something that couldn't be affected by any outage. I imagine Google will probably do the same thing after this :)


The status page is the kind of thing you expect to be hosted on a competitor network. It is not dogfooding but it is sensible.

Reminds me of when I was working with a telecoms company. It was a large multinational company and the second largest network in the country I was in at the time.

I was surprised when I noticed all the senior execs were carrying two phones, of which the second was a mobile number on the main competitor (ie the largest network). After a while, I realised that it made sense, as when the shit really hit the fan they could still be reached even when our network had a total outage.


> Not long after that incident, they migrated it to something that couldn't be affected by any outage.

Like the black box on an airplane, if it has 100% uptime why don’t they build the whole thing out of that? ;)


Was just reading it, they made their status page multi-region.


Even more irony: Google+ shown as working fine: https://i.imgur.com/52ACuiY.png


G+ is alive and well for G Suite subscribers, not the general users.


> including unfortunately the tooling we usually use to communicate across the company about outages.

So memegen is down?


I'm guessing this will be part of the next DiRT exercise :-) (DiRT being the disaster recovery exercises that Google runs internally to prepare for this sort of thing)


Well, lots of revenue is lost, that's for sure.


>nothing is lost

except time


Can't use my Nest lock to let guests into my house. I'm pretty sure their infrastructure is hosted in Google Cloud. So yeah... definitely some stuff lost.


You have my honest sympathy because of the difficulties you now suffer through, but it bears emphasizing: this is what you get when you replace what should be a physical product under your control with Internet-connected service running on third-party servers. IoT as seen on the consumer market is a Bad Idea.


It's a trade-off of risks. Leaving a key under the may could lead to a security breach.


I am pretty sure there are smart locks that don't rely on an active connection to the cloud. The lock downloads keys when it has a connection and a smartphone can download keys. This means they work even if no active internet connection at the time the person tries to open. If the connection was dead the entire time between creating the new key and the person trying to use the lock it still wouldn't work.

If there are not locks that work this way it sure seems like there should be. Using cloud services to enable cool features is great. But if those services are not designed from the beginning with fallback for when the internet/cloud isn't live that is something that is a weakness that often is unwise to leave in place imo.


FWIW - The Nest lock in question doesn't rely on an active internet connection to work. If it can't connect, it can still be unlocked using the sets of PINs you can setup for individual users (including setting start/end times and time of day that the codes are active). There's even a set of 9V battery terminals at the bottom in case you forget to change the batteries that power the lock.

This does mean you need to setup a code in advance of people showing up, but it's an under 30 second setup that I've found simpler than unlocking once someone shows up. The cameras dropping offline are a hot mess though, since those have no local storage option.


It may not be worth the complexity to give users the choice. If I were to issue keys to guests this way I would want my revocations to be immediately effective no matter what. Guest keys requiring a working network is a fine trade-off.


You can have this without user intervention - have the lock download an expiration time with the list of allowed guest keys, or have the guest keys public-key signed with metadata like expiration time.

If the cloud is down, revocations aren't going to happen instantly anyway. (Although you might be able to hack up a local WiFi or Bluetooth fallback.)


So can a compromise of a "smart" lock.

It's a fake trade-off, because you're choosing between lo-tech solution and bad engineering. IoT would work better if you made the "I" part stand for "Intranet", and kept the whole thing a product instead of a service. Alas, this wouldn't support user exploitation.


Yeah, my dream device would be some standard app architecture that could run on consumer routers. You buy the router and it's your family file and print server, and also is the public portal to manage your IoT devices like cameras, locks, thermostats, and lights.


You can get a fair amount of this with a Synology box. Granted, a tool for the reasonably technically savvy and probably not grandma.


I love my Synology, I wish they would expand more into being the controller of the various home IOT devices.


I don't use the features, but I know my Qnap keeps touting IoT so they might be worth checking out as well.

It's also my Plex media server, file server, VPN, I run some containers on there. I used to use it as a print server but my new printer is wireless so I never bothered


Don't be ridiculous. Real alternatives would include P2P between your smart lock and your phone app or a locally hosted hub device which controls all home automation/IoT, instead of a cloud. If the Internet can still route a "unlock" message from your phone to your lock, why do you require a cloud for it to work?


Or use one of the boxes with combination lock that you can screw onto your wall for holding a physical key. Some are even recommended by insurance companies.


At least you can isolate your security risk to something you have more control over than a random network outage.


Any key commands they have already set up will still work. Nest is pretty good at having network failures fail to a working state. They might not be able to actively open the lock over the network is the only change.


One of the reasons why I personally wanted a smart-lock that had BLE support along with a keypad for backup in addition to HomeKit connectivity.


Sure you can, but you'll need to give them your code or the master code. Unless you've enabled Privacy Mode, in which case... I don't know if even the master code would work.


You should have foreseen this when you bought stuff that rely on "the cloud"


Everyone talking about security and not replacing locks with smart locks seems to forget that you can just kick the fucking door down or jimmy a window open.


Or just sawzall a hole in the side of the house...


After you've cut the power, just to be safe? ;)


Except kicking the door down is not particularly scalable or clandestine


To bad we don't have google cars yet.


"Cloud Automotive Collision Avoidance and Cloud Automotive Braking services are currently unavailable. Cloud Automotive Acceleration is currently accepting unauthenticated PUT requests. We apologise for any inconvenience caused."


Our algorithms have detected unusual patterns and we have terminated your account as per clause 404 in Terms And Conditions. The vehicle will now stop and you are requested to exit.


Phoenix Arizona residents think otherwise


They weren't wearing Batman t-shirts were they?

http://www.ktvu.com/news/mistaken-identity-nest-locks-out-ho...


I wonder if in the future products will advertise that they work independently (decoupling as a feature).


holy shit lmao. I'm sorry that sucks.


and a nice Sunday afternoon


And lots of sales on my case


And the illusion of superiority over non cloud offerings.


I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.

There’s a fine line or at least some subtlety here though. This leads to some interesting conversations when people notice how hard I push back against NIH. You don’t have to be the author to understand and be able to fiddle with tool internals. In a pinch you can tinker with things you run yourself.


> I keep trying to explain to people that our customers don’t care that there is someone to blame they just want their shit to work. There are advantages to having autonomy when things break.

There are also advantages to being part of the herd.

When you are hosted at some non-cloud data center, and they have a problem that takes them offline, your customers notice.

When you are hosted at a giant cloud provider, and they have a problem that takes them offline, your customers might not even notice because your business is just one of dozens of businesses and services they use that aren't working for them.


Of course customers don't care about the root cause. The point of the cloud isn't to have a convenient scapegoat to punt blame to when your business is affected. It's a calculated risk that uptime will be superior compared to running and maintaining your own infrastructure, thus allowing your business to offer an overall better customer experience. Even when big outages like this one are taken into account, it's often a pretty good bet to take.


What does NIH stand for?


Not Invented Here


How come?


The small bare metal hosting company I use for some projects hardly goes down, and when there is an issue, I can actually get a human being on the phone in 2 minutes. Plus, a bare metal server with tons of RAM costs less than a small VM on the big cloud providers.


> a bare metal server with tons of RAM costs less than a small VM on the big cloud providers

Who are you getting this steal of a deal from?


Hetzner is an example. Been using them for years and it's been a solid experience so far. OVH should be able to match them, and there's others, I'm sure.


Hetzner is pretty excellent quality service overall. OVH is very low quality service, especially with the networking and admin pane.


hetzner.de, online.net, ovh.com, netcup.de for the EU-market.


Anywhere. Really.

Cloud costs roughly 4x than bare metal for sustained usage (of my workload). Even with the heavy discounts we get for being a large customer it’s still much more expensive. But I guess op-ex > cap-ex


Lots of responses, and I appreciate them, but I'm specifically looking for a bare metal server with "tons of RAM", that is at the same or lower price point as a google/microsoft/amazon "small" node.

I've never seen any of the providers listed offer "tons of ram" (unless we consider hundreds / low thousands of megabytes to be "tons") at that price point.


I've had pretty good luck with Green House Data's Colo Service and their Cloud offerings. A couple of RU's in the data center can host 1000's of VM's in multi-regions with great connectivity between them.


Care to name names? I've been looking for a small, cheap failover for a moderately low traffic app.


In the US I use Hivelocity. If you want cheapest possible, Hetzner/OVH have deals you can get for _cheap._


I've a question that always stopped me going that route, what happens when a disk or other hardware fails on these servers? beyond data loss I mean, like physically what happens who carries out the repair how long does it takes


For Hetzner you have to monitor your disks and run RAID-1. As soon as you get the first SMART-Failures you can file a ticket and either replace ASAP or shedule a time. This happened to me a few times in the past years it always has been just 15-30m delay after filing the ticket and at most 5 minutes downtime. You have to get your Linux stuff right through i.e. booting with a new disk.

If you don't like that you can order a KVM-VM with dedicated cores at similiar prices and the problem is not yours anymore.


Most bare metal providers nowadays contact you just like AWS and say "hey your hardware is failing get a new box.". Unless it's something exotic it's usually not long for setup time, and in some cases just like a VM it's online in a minute or two.


thanks!


Thanks a million. Those prices look similar to what I've used in the past, it's just been a long time since I've gone shopping for small scale dedicated hosting.


You weren't kidding, 1:10 ratio to what we pay for similar VPS. And guaranteed worldwide lowest price on one of them. Except we get free bandwidth with ours.


There are some whole argue that the resiliency of cloud providers beats on prem or self hosted, and yet they’re down just as much or more (GCP, Azure, and AWS all the same). Don’t take my word for it; search HN for “$provider is down” and observe the frequency of occurrences.

You want velocity for your dev team? You get that. You want better uptime? Your expectations are gonna have a bad time. No need for rapid dev or bursty workloads? You’re lighting money on fire.

Disclaimer: I get paid to move clients to or from the cloud, everyone’s money is green. Opinion above is my own.


Solutions based on third-party butts have essentially two modes: the usual, where everything is smooth, and the bad one, where nothing works and you're shit out of luck - you can't get to your data anymore, because it's in my butt, accessible only through that butt, and arguably not even your data.

With on-prem solutions, you can at least access the physical servers and get your data out to carry on with your day while the infrastructure gets fixed.


Any solution would be based on third parties, the robust solution is either to run your own country with fuel sources for electricity and army to defend the datacenters or rely on multiple independent infrastructures. I think the latter is less complex.


This is a ridiculous statement. Surely you realise that there is a sliding scale.

You can run your own hardware and pull in multiple power lines without establishing your own country.

I’ve ran my own hardware, maybe people have genuinely forgotten what it’s like, and granted, it takes preparation and planning and it’s harder than clicking “go” in a dashboard. But it’s not the same as establishing a country and source your own fuel and feed an army. This is absurd.


Correct. Most CFO's I've run into as of late would rather spend $100 on a cloud vm than deal with capex, depreciation, and management of the infrastructure. Even though doing it yourself with the right people can go alot further.


The GP's statement is about relying on third parties, multiple power lines with generators you don't own on the other side falls under it.

Fun related fact: My first employee's main office was in former electonics factory in Moscow's downtown powered by 2 thermal power stations (and no other alternatives), which have exact same maintenance schedule.


Assuming you have data that is tiny enough to fit anywhere other than the cluster you were using. Assuming you can afford to have a second instance with enough compute just sitting around. Assuming it's not the HDDs, RAID controller, SAN, etc which is causing the outage. Assuming it's not a fire/flood/earthquake in your datacenter causing the outage.

...etc.


Ah, yes, I will never forget running a site in New Orleans, and the disaster preparedness plan included "When a named storm enters or appears in the Gulf of Mexico, transfer all services to offsite hosting outside the Gulf Coast". We weren't allowed to use Heroku in steady state, but we could in an emergency. But then we figured out they were in St. Louis, so we had to have a separate plan for flooding in the Mississippi River Valley.


Took me a second.

I didn’t know the cloud-to-butt translator worked on comments too. I forgot that was even a thing.


Oh that’s weird, because it totally worked for me with “butts” as a euphemism for “people”, as in “butt-in-seat time” — relying on a third-party service is essentially relying on third party butts (i.e. people), and your data is only accessible through those people, whom you don’t control.

And then “your data is in my butt” was just a play on that.


I keep forgetting that I have it on, my brain treats the two words as identical at this point. The translator has this property, which I also tend to forget about, that it will substitute words in your HN comment if you edit it.

But yeah, it's still a thing, and the message behind it isn't any less current.


There is a cloud I've developed that is secure and isn't a butt :P

https://hackaday.io/project/12985-multisite-homeofficehacker...

I made IoT using cheap (arduino, nrf24l01+, sensors/actuators) for local device telemetry, MQTT, node-red, and Tor for connecting clouds of endpoints that aren't local.

Long story short, its an IoT that is secure, consisting of a cloud of devices only you own.

Oh yeah, and GPL3 to boot.


And reputation. With this outage the global media socket is going to be in gCloud nine.


and reputation.


Seems to be the private network. The public network looks fine to us from all over the world?


Not on my end. Public access in us-west2 (Los Angeles) is down for me.


Hmmm... why is our monitoring network not showing that?

Edit: ah, looks like the LB is sending LA traffic to Oregon.


Our Oregon VMs are up.


> but there is serious packet loss at the least.

Can confirm with Gmail in Europe. Everything works but it's sluggish (i.e. no immediate reaction on button clicks).


We are also hosted on GCP bit nothing is down for us. We are using 3 regions in US and 2 in EU.


What can be the reason for the outage? Can it be a cyber attack to your servers?


go/stopleaks :)


Hm, isn't releasing go links publicly also verboten? :)


This happened to Amazon S3 as well once. The "X" image they use to indicate a service outage was served by... yup, S3, which was down obviously.


One of the projects I worked on was using data URIs for critical images, and I wouldn’t trust that particular team to babysit my goldfish.

Sounds like Google and Amazon are hiring way too many optimists. I kinda blame the war on QA for part of this, but damn that’s some Pollyanna bullshit.


You're brave to jump on here when on holiday!

Shouldn't that outage system be aware when service heartbeats stop?

Could this be a solar flare?


Now is a good time to point out that the SLA of Google Cloud Storage only covers HTTP 500 errors: https://cloud.google.com/storage/sla. So if the servers are not responding at all then it's not covered by the SLA. I've brought this to their attention and they basically responded that their network is never down.


Ironically I can't read that page because, since it's Google-hosted, I'm getting an HTTP 500 error... but which means at least that service is SLA-covered...

Cloud services live and die by their reputation, so I'd be shocked if Google ever tried to get out of following an SLA contract based on a technicality like that. It would be business suicide, so it doesn't seem like something to be too worried about?



This should be voted higher up.

According to https://twitter.com/bgp4_table, we have just exceeded 768k Border Gateway Protocol routing entries, which may be causing some routers to malfunction.


Isn't it weird that it's happening now even though that number was surpassed nearly a month and half ago?


Different locations see different counts because of aggregation/de-aggregation.


Will this affect more than just Google? I haven't seen any outages from other cloud providers.


packet.net was hit. Specifically, also their San Jose DC. Internet only. It took less than an hour to recover. More than 20 minutes. I didn't ping it continuously, and I can say that the traceroute got stuck in Frankfuhrt (Where my ISP and their ISP first (as seen from me) meet).

I was actually surprised, as they tend to have excellent networking. Now I'm not nearly as distrusting as I was initially, knowing it was likely their ISP getting screwed by routing table overflow.


There goes 3 nines for June and for Q2. I guess everyone gets a 10% discount for the month? https://cloud.google.com/compute/sla


Remember to request the credit!

From that linked page:

"Customer Must Request Financial Credit

In order to receive any of the Financial Credits described above, Customer must notify Google technical support within thirty days from the time Customer becomes eligible to receive a Financial Credit. Customer must also provide Google with server log files showing loss of external connectivity errors and the date and time those errors occurred. If Customer does not comply with these requirements, Customer will forfeit its right to receive a Financial Credit. If a dispute arises with respect to this SLA, Google will make a determination in good faith based on its system logs, monitoring reports, configuration records, and other available information, which Google will make available for auditing by Customer at Customer’s request."


A couple more hours and everyone will get 25% off for June.


Does that apply to the rest of June?

Might be a good month to rebuild all your models ;)


The vultures are circling.


Ironically, the SLA page returns a 502 error.


The discount seems way too small.

I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.


Any cloud provider offering those terms would go out of business VERY quickly. Outages happen, all providers are incentivized to minimize the frequency and severity of disruptions - not just from the financial hit of breaching SLA (which for something like this will be significant), but for the reputational damage which can be even more impactful.


How often does amazon or google go down for ten minutes?

But let's work backwards from the goal instead.

If you charge twice as much, and then 20-30% of months are refunded by the SLA, you make more money and you have a much stronger motivation to spend some of that cash on luxurious safety margins and double-extra redundancy.

So what thresholds would get us to that level of refunding?


> Any cloud provider offering those terms would go out of business VERY quickly

Minimum spends and a 50,000% markup based on adding that term to your contract.


I think you're proving the parent comment's point. The number of businesses willing to pay a 500x markup is exceedingly small (potentially less than 1), and at that point the cost is high enough where it's probably cheaper to just build the redundancy yourself using multiple cloud providers (and, to emphasize, that option tends to be horribly expensive).


And all cloud providers will emphasize how you yourself should design your software and architect your infrastructure to be available in multiple regions to achieve the highest availability.


Just take the premium that you'd be willing to pay and put it in the bank -- the premium would be priced such that the expected payout of the premium would be less than or equal to what you'd be paying.

Besides, a provider credit is the least of most company's concerns after an extended outage, it's a small fraction of their remediation costs and loss of customer goodwill.


  Just take the premium that
  you'd be willing to pay and
  put it in the bank
In my country, when companies are hired to do overnight rail maintenance, they face very stiff fines if they over-run and delay trains the next morning.

The fines are large enough that (for example) companies will have a heavy plant mechanic on site who does nothing on the vast majority of jobs - they're just standing by, to mitigate the risk of a breakdown leading to such a fine. Some business analyst with a spreadsheet has worked out the heavy plant breakdown rate, the typical resulting delays, the expected fines, and the cost of having the mechanic on standby... and they've worked out it's a good business decision.

The purpose of having an SLA isn't to get yourself money when your provider fails. The purpose is to make costly risk mitigation a rational investment for your suppliers.


> The discount seems way too small.

> I would pay a premium for a cloud provider happy to give 100 percent discount for the month for 10 minutes downtime, and 100 percent discount for the year for an hour's downtime.

It takes a lot of effort (exponential) to reliably (I. E. Designed to fail-working) build something that is guaranteed to have this level of uptime at these penalties.

So I'm sure that I can build something that works like this, but would you pay me $100 per GB of storage per month? $100 per wall-time hour of CPU usage? $100 per GB of Ram used per hour? Because these are the premium prices for your specs.


You know this reminds me of a bad taste that Google Sales team left when I asked for some of my billing that I was unaware of running after following a quickstart guide.

AWS refunded me in the first reply on the same day!

GCP sales rep just copy pasted a link to a self support survey that essentially told me, after a series of YES or NO questions that they can't refund me.

So why not just tell your customers like it is? Google Cloud is super strict when it comes to billing. I have called my bank to do a chargeback and put a hold on all future billing with GCP.

I'm now back to AWS and still on a Free Tier. Apparently the $300 Trial with Google Cloud did not include some critical products, AWS Free tier makes it super clear and even still I sometimes leave something running on and discover it in my invoice....

I've yet to receive a reply from Google and its been a week now.

I do appreciate other products such as Firebase but honestly for infrastructure and for future integration with enterprise customers I feel AWS is more appropriate and mature.


The thing that worries me most about Google Cloud and these billing stories is that I’m assuming if you chargeback or block them at your bank then they’ll ban all Google accounts of yours - and they’re obviously going to be able to make the link between an account made just for Google Cloud and my real account.


They WILL absolutely block and suspend all accounts indefinitely. They have terminated accounts for credit card failing transactions.

I really wanted to try out their new autoML but I was paranoid of entering my credit card and getting banned from Google


oh man....so ALL of my gmail gets banned?

this is FUCKED. its aking to holding my youtube and google play accounts hostage.


For any Google service with billing, always set up an entirely new account, in incognito mode, without using the same recovery phone number or email address.

That way Google won't ban your main account for non-payment.

It's the only way, especially considering Google cloud has no functionality to cap spending.


They will link the accounts by IP (and other fingerprinting), and you will get banned.


My experience shows they do not.

For a ban, they need something concrete like using the same browser cookies, recovery email address or phone number.

IP isn't enough alone - you could be on shared WiFi.

Also, after account creation, you can log in from the same place without risk, or even use multi-login to log into both accounts at the same time.


I’m so tired of this widespread tracking. You can’t even buy a burner phone anymore because there exists multiple fingerprinting methods.

Phone and desktop OSs should grow a pair and create a virtualization protocol to randomize tracking info to keep PII anonymous.


I moved off of gmail after reading stories of people getting locked out of their google accounts and how difficult it made things due to email basically being an internet passport, so I sympathise with your fear here.

With that said, If I delivered you services and then you credit card chargebacked me, I'd cut all relations with you as well.


all your bases belong to us.


Are you seriously complaining about having to pay for using their resources? I understand that you're surprised some things aren't covered in the free trial or free credit or whatever, but getting $300 free already sounded a little too good to be true (I heard about it from a friend and was dubious; at least in Europe, consumers are told not to enter deals that are too good to be true), you could at least have checked what you're actually getting.

I think it's weird to say you get credit in dollars and then not be able to spend it on everything. That's not how money works. But that's the way hosting providers work and afaik it's quite well known. Especially with a large sum of "free money", even if it's not well known, it was on you to check any small print.


> Are you seriously complaining about having to pay for using their resources?

I didn't read it that way. I thought they were complaining about poor customer service that made it difficult to understand the bill or respond to it appropriately.


I read it that way too, but it's sort of understandable that a free tier user is not going to get the same "customer care" as someone who regularly leaves let's say 50K USD with them.


Google is well known for not caring about small shops, only if you are a multi million dollar customer with dedicated account manager you can expect reasonable support. That's been the case forever with them.


Does Amazon treat smaller customers any better? I am genuinely asking, as I have no clue.


Absolutely. I've seen them wipe a number of bills away for companies that have screwed up something. They definitely take a longer view on customer happiness than GCP. Azure also tends to be pretty good in this regard.


Depends.

AWS is mostly easy going.

Only some people at the partner programm can vary.

I had a guy who wanted to help me out even tho I was just a one person shop. After he left I got a woman who threw me out of the program faster than I could look.


Yes. 100%. We don’t pay AWS much but their help is top notch. We accidentally bought physical instances instead of reserved instances. AWS resolved the issue and credited us. I’ll prob never touch GCE. Google just isn’t a good company at any level.


I'm scared to put anything serious on GCE. One super bill from a DDoS, one tiny billing dispute, something else tiny and unpredictable, etc. Suddenly my entire 10+ years of gmail is gone. Or Google cancels my friend or SO's gmail for whatever Orwellian reason, suddenly my production app is inaccessible or is deleted.

There's too much liability. And no support.


Definitely. A previous small company I worked in had some S3 Snafu and AWS Support was super helpful.


I've got a personal account with an approximately $1/mo bill (just a couple things in S3) and a work account with ~$1500/mo AWS bill (not a large shop by any means) and I've always felt very positive about my interactions with AWS support


Yes


Their ecommerce and AWS both have fantastic help and followups (and also aggressive marketing).


If you buy their support (which isn't that expensive). Holy fuck it's good. You literally have an infrastructure support engineer on the phone for hours with you. They will literally show you how to spend less money for your hosting while using more AWS services.


>I asked for some of my billing that I was unaware of running

>I have called my bank to do a chargeback

You're issuing a chargeback because you made a mistake and spent someone else's resources? And you're admitting to this on HN? I'm not a lawyer, but that sounds like fraud and / or theft to me.


It’s not, read the terms of your credit card. It’s basically “I didn’t intend to buy this. I tried in good faith to contact the merchant for return and support. I was ignored. I’m contacting you.”

It’s pretty convenient for companies like Comcast and Google that have poor customer service.


Imagine leaving a 1000 watt space heater on by accident in a spare room for a month, then trying to get a "refund" from the power company because you "didn't intend" to purchase all that power you used. That's effectively what this is - signing a service agreement and forgetting to turn off a service you don't need, causing an irreversible loss of resources. You're not entitled to a refund for a service you agreed to pay for and actually used, just because you forgot about it.


My power company will literally give you a refund for this if you call them. I don’t think I’m entitled to a refund, but good customer service gives me one.

Of course, I get one free pass at that and if I did it over and over, I’m hosed. The difference is that my utility is regulated and has a phone number and a human whose job it is to talk to all customers.


GCE charged me for "Chinese egress" but doesn't provide me a way to block China via firewall or other methods. They have the ability to check and bill me for it but if I want to use the same logic for a firewall rule I'm on my own. That sounds like theft and or fraud to me.

OP sounds like they're just defending their selves from ambiguous draconian billing robots.


You may not like it, but that doesn't remotely resemble theft or fraud in any way.


What was the quickstart guide?


Anything created in-house at Google (GCP) is typically created by technically-proficient devs, those devs then leave the project to start something new and maintenance is left to interns and new hires. Google customer service basically doesn't care and also has no tools at their disposal to fix any issues anyway.

The infinite money spout that is Google Ads has created a situation in which devs are at Google just to have fun - there really is no incentive to maintain anything because the money will flow regardless of quality.

Source: I interned at Google.


Isn't it also that promotions at Google are based on creating new products/projects rather than maintaining existing ones? So engineers have a negative incentive to maintain things since it costs them promotions.


That explains the proliferation of chat services and why they all get actively worse over time until they're replaced


From what I’ve been told, the issue is that the people with political capital (managers, PMs, etc) are quick to move after successful launches and milestones. No matter how many competent engineers hang around, the product/team becomes resource and attention starved.


I'm not sure why you are downvoted - seems like a reasonable insight and explanation for the drop in quality and weird decisions Google is making recently.


It’s not insightful at all. Just one intern’s very brief observations of something way more complicated and nuanced than is deserving of such a dismissive comment.


I'll take brief comments that shed partial light on something over no comments at all and no insight at all.


I have mentioned this multiple time: Any criticism of Google is met with barrage of downvotes. I guess all the googlers hang around here and they are usually commenting with throwaways.


Google seems to be sort of like a sect of narcissists.


GCP status page is worthless as it's always happy and green when production systems are down and then they might acknowledge something an hour later


Just like AWS, then. "Some users are experiencing increased error rates" = "Everything has been down for hours"


"Everything is fine, unless you're Carl. There's a massage outage, but only at Carl's house. Sorry, Carl."


I'm also experiencing a massage outage. Please send masseuse.


Goddammit some (most?) days I can't type. "Massive"


I got the missive. Thanks.


I remember when S3 was down and the status was green because the updates for the status page with pushed via S3.


That's not just ironic, that's stupid. How do you count on S3 to update S3 status? Isn't that a huge design fault?


Yes and they fixed that. :)


Azure too. During the most recent outage a couple weeks ago their Twitter account acknowledged the incident an hour before the status page did.

So no matter where you go for your cloud services, you're guaranteed a useless status page. Yippee.


AWS is no better. Something from 2015 I remember: https://twitter.com/SIGKILL/status/630684777813684224?s=19


I swear most status pages are run by folks who aren't "there".


It’s an easy problem to fix as basically services should emit performance data, openly, and the status page should just summarize that. So if a service doesn’t report out, it’s assumed down or erroring out.

Having an excel file where people enter statuses is not very useful to me as a customer. That’s more like a blog.


I haven’t written a status page in a while, but the rest of my infrastructure starts freaking out if it hasn’t heard from a service in a while. Why doesn’t their status page have at least a warning about things not looking good?


In my experience public status pages are "political" and no matter how they start tend to trend towards higher management control in some way... that leads to people who don't know, aren't in the thick of it, don't understand it, and / or are cautious to the point that it stops being useful.


Not only political, but with SLAs on the line they have significant financial and legal consequences as well. Most managers are probably happier keeping the ‘incident declaring power’ in as few a hands as possible to make sure those penalty clauses aren’t ever un-necessarily triggered.


That’s fraud in other industries.


Same with most corporate twitter feeds. I’d like to follow my public transit/airport/highway authority, but it’ll be 10 posts about Kelly’s great work in accounting for every service disruption.

And No, I don’t want to install a separate app to get push notifications about service disruptions for every service I use.


A good Twitter account is a wonderful thing....the bad ones hurt so bad.


Ugh. I guess that just goes to show that any metric can be politicized.


It's just Goodhart's law in effect: If a status page is used as a target metric in an SLA the status page ceases to be a useful measure.


Status pages are the progress bars of the cloud.


I worked on the networking side for years.

Now the web development side and I'm all "Wait a minute...are there any progress bars that are based on, anything real!?!?!"

I should have known...


Digitalocean has the same issue: status pages are actually manually updated and no live data is fed into them.


Was noticing massive issues earlier and thought that maybe my account was blocked due to breaching from TOS as I was heavily playing with Cloud Run. Then I noticed gitlab was also acting up but my Chinese internet was still surprisingly responsive. Tried the status page which said everything was fine and searched Twitter for "google cloud" and also found nobody talking about it. Typically Twitter is the single source of truth for service outages as people start talking about it


I think this might be a static page they are hosting on Akamai?



They update the page manually.


Google Cloud is the number 4 most monitored status page on StatusGator and Google Apps is number 12. In addition, at least 20 other services we monitor seemingly depend on Google Cloud because they all posted issues as soon as Google went down.

It's always interesting to see these outages at large cloud providers spider out across the rest of the internet, a lot of the world depends on Google to stay up.


This feels like 80's.

When the mainframe is down terminals are useless.


Yep. The cloud is just a lot of cheap hardware acting together as a shitty mainframe.


Server hardware is actually quite expensive. End users "smart" phones are cheap hardware, running dumb software which renders them as terminals for the cloud. That's sad because smartphone hardware is quite capable of doing useful work.

(For instance, I have a 500GB MicroSD card in my phone which contains a copy of my OwnCloud)


"a lot of the world depends on Google to stay up."

Yup, I'm trying to check the Associated Press News right now and it's having trouble connecting to "storage.googleapis.com".


I guess we know what steam uses (the store at least).


I don't know about Steam, but I know Apple must use Google Cloud: https://www.apple.com/support/systemstatus/


Less than 1% of users are affected

Is there any reason to presume these statuses are correlated?

Apple's issue is

> Users may be experiencing slower than normal performance with this service.


I'm just assuming they are because it's been previously reported that Apple uses GCP (and also AWS).

https://techcrunch.com/2018/02/27/apple-now-relies-on-google...


Could be the only users who were affected were ones caught right in the failover between redundant clouds


Guess they don't eat their own dog food; no racks of proprietary Apple servers anywhere (unless they somehow run Darwin images in Google Cloud)


Can't tell if you know the answer to your own question and just can't talk about it due to NDA...


Apple runs Linux on the vast majority of the servers behind their cloud offerings.


No issues for me. Maybe they have a failover mechanism?


yeah, maybe it was coincidence. seems to be back up for me as well: https://steamstat.us


...and only the paranoid survive?


Just because you're paranoid, it doesn't mean they're not out to get you.


And thus was ruined hundreds or thousands of pleasant Sunday afternoons.

I don’t miss being on pager duty one bit. I see it looming in my headlights, sadly.


Spare a thought for the pleasant Australian early Monday mornings too! Always a rude awakening...


It's the Queen's birthday, a Monday off here in New Zealand...

... but not for everybody now.


So what happens when the crown changes? They change the holiday? Immediately? For the next year? Sounds like a bit of a nightmare.


The holiday is on the official birthday. The sovereign's actual birthday has been separate from the official birthday for centuries, so the holiday does not need to change.


Nah, it's not even her actual birthday. Different countries with the same queen even celebrate it on different days. Presumably it'll be renamed to "king's birthday" but the day kept the same when the monarch changes. Or done away with/re-purposed - there's a general feeling in Australia at least that once the queen dies there will be less support for the monarchy.


If you think that's a hassle, in Japan the calendar changes with the emperor:

https://www.theguardian.com/technology/2018/jul/25/big-tech-...


Australia celebrates the Queen's Birthday public holiday on different dates in different states already.


It’s not actually the queens birthday.

In Australia, many states have different dates for the queens birthday.

So not a nightmare at all.


The only response is to wait for Google to fix it.

Nothing you or I or the pager can do will speed that up.

I am aware some bosses won't believe that and I am not trying to make light of it. But there really isn't much else to do except wait.


Or you wait for Google or you are frantically trying to move everything you got to AWS.


If you wait, you get back to 100% with no effort or stress on your part.

If you try to be heroic, you get back to 100% with a bunch of wasted effort and stress on your part.

Because it will be fixed by Google, regardless of what you do or don't do.

After the incident is over would be the time to consider alternatives.


So, for some companies, failing over between providers is actually viable and planned for in advance. But it is known in advance that it is time consuming and requires human effort.

The other case is really soft failures for multi-region companies. We degrade gracefully, but once that happens, the question becomes what other stuff can you bring back online. For example, this outage did not impact our infrastructure in GCP Frankfurt, however, it prevented internal traffic in GCP from reaching AWS in Virginia because we peer with GCP there. Also couldn't access the Google cloud API to fall back to VPN over public internet. In other cases, you might realize that your failover works, but timeouts are tuned poorly under the specific circumstances, or that disabling some feature brings the remainder of the product back online.

Additionally, you have people on standby to get everything back in order as soon as possible when the provider recover. Also, you may need to bring more of your support team online to deal with increased support calls during the outage.


Multi-cloud for those times when you really need that level of availability and can afford it.


It's not even about being able to afford it. Some things just don't lend themselves to hot failover. If your data throughput is high, it may not be feasible or possible to stream a redundant copy to a data center outside the network.


All parts of the system should be copied (if you decided to build multi-cloud system), not just some of them.


Do you work at G?


Nope. I was more thinking of everyone else.


That feeling when you open https://console.cloud.google.com and see that you don't have your Kubernetes clusters and CloudSQL databases, but CTA to create first.


Gosh, this was so scary... I thought someone had hacked in and deleted everything...

I hope they come back. This is still pretty scary


Same, my Manager called my and said "everything is down".

So I wander over to my Firebase console, and there's no database loading. Thank god for twitter, and people also saying that they have the same issue or I would have for sure though we've been hacked.

I hope this is a good wake up call for everyone. I know that I'm going to think more about how we do backups and fail-safes


[I am the Cloud SQL tech lead]

This is a networking issue, and your data is safe. Cloud SQL stores instance metadata regionally, so it shares a failure domain with the data it describes. When the region is down or inaccessible, instances are missing from the list results, but that doesn't say anything about the instance availability from within region.


That's good to know. What confuses me is why they're saying "We continue to experience high levels of network congestion in the eastern USA", when I'm in us-west2 (Los Angeles) and none of my CloudSQL instances, nor is my k8s cluster showing up or contactable...


Same. I was thinking, oh, my db cluster must be having trouble recovering. Couldn't get any response through kubectl. Logged in to the cloud console and it looks all brand new, like I have no clusters setup at all.

Of course, this is 2 weeks after switching everything over from AWS.


And here I thought I was having a bad day with Google Play not loading


my vm instances are all still there, can even log in via SSH in the compute engine tab. looks like they got a reboot 15 min ago. just restarted some processes but lost my progress on about 12hrs of computing time, i'm guessing it's going to be hard to get a refund..


You'll get a 25 percent refund of all costs for the month if you ask support


Mumbai region here, and GKE seems to be fine. (Accessing it from Bangalore)


Have services running in europe-west1, everything loads and working fine for me. looks like europe-west1 is not affected.


Nest is down too, not surprising given they are part of Google. What I don't understand is why I can't still control my devices over my local network. Why does the system even require access to Google servers?


Let's make the following the tech mantra of the next decade: "the Internet Of Things can never possibly work until the LAN Of Things does".


the IoT trend really needs to stop


LANoT?


That made me rage the other day when I had an internet outage. I am honestly fed up with Nest esp after this new Google Nest branding, and just disconnected my thermostats from the internet. I rarely changed it from my scheduled settings via the app anyway.


Makes me wonder What your thermostats are doing on the internet in the first place.


It doesn't, but it sure was easier implimenting one connection through Google APIs instead of two: one through Google and one through local network.


I have not implemented additional APIs by running a local DNS server.


This is your yearly reminder to resist centralization of the internet.


Yes, I haven't read the Bitcoin is down or BitTorrent is down status yet.


Because Internet Exchanges and carriers never had network issues.


Because then at least you and your colleagues has a chance to get some work done.


It seems the AdWords anti-spam system is down, which means anyone can put a billion dollar bid on every keyword and get their ads showing on every Google search for every query.

Systems that fail 'open'...


Level(3) one of the biggest backbones has issues too, might be related

https://downdetector.com/status/level3


Funny how as soon as I realized that Gmail and Google Sheet aren’t working properly I rushed to HN to figure out what’s going on. I love this community!


The two Google Cloud networking incidents are:

Incident #19008 began at 2019-06-02 12:48. https://status.cloud.google.com/incident/cloud-networking/19...

Incident #19009 began at 2019-06-02 12:53. https://status.cloud.google.com/incident/cloud-networking/19...

Times are US/Pacific


Looks like they are having trouble updating their statuses. 19008 was supposed to be updated over an hour ago. Meanwhile, 19009 has the same comment posted three times. I'm guessing internal tools are barely working at best.


That is normal for Google status pages.

They don't want to admit fault or place blame because there can be legal and commercial ramifications, so they can only say canned responses.


Updating a status page, sure. They aren't going to say "JamesBondService is having issues because a bug was deployed", but they usually don't repeat the same message 3 times in the same minute and they are usually pretty good about sticking to an update "within SLA"


dang et al.: I’d make 19009 the URL for this (GCE Just reported first, but this is an outage / networking issue in the Eastern US).


#19008 mentioned by parent is not the GCE one, there are two networking ones: #19008 and #19009.

The #19008 (networking) says there will be an update by "13:30 US/Pacific", but as of 17:05 there is no update.

Similarly, #19003 (GCE) says there will be an update by "16:00 US/Pacific", but no update as of 17:05.

All the latest updates seem to only be in the third incident #19009 (networking).


They both seem to say the same thing....


I also noticed that Google search stop indexing news articles.

So I searched for "gmail down" on bing and I got some results [1]. But searching on Google for "gmail down" does not return any results [2].

[1] https://www.bing.com/news/search?q=gmail+down&qs=n&form=QBNT

[2] https://www.google.com/search?q=gmail+down&source=lnms&tbm=n...


And Gmail too doesn't feel very well today.

  [21:55:19] POP< +OK send PASS
  [21:55:19] POP> PASS ********
  [21:55:21] POP< +OK Welcome.
  [21:55:21] POP> STAT
  [21:55:21] POP< -ERR [SYS/TEMP] Temporary system problem.
  Please try again later.



At least G+ is working. /s


IMAP as well - for some considerable time now.


Yes experiencing IMAP auth/connectivity issues here as well, along with outgoing SMTP


Anyone using both AWS and GCP that can form an opinion on availability of both? As a GCP customer I am not very happy with theirs.


I use both services heavily at work. The networking in GCP is terrible. We experience minor service degradation multiple times a month due to networking issues in GCP (elevated latency, errors talking to the DB, etc). We've even had cases where there was packet corruption at the bare metal layer, so we ended up storing a bunch of garbage data in our caches / databases. Also, the networking is less understandable on GCP compared to AWS. For instance, the external HTTP load balancer uses BGP and magic, so you aren't control of which zones your LB is deployed to. Some zones don't have any LBs deployed, so there is a constant cross-zone latency hit when using some zones. It took us months to discover this after consistent denials from Google Cloud support that something was wrong with a specific zone our service was running in.

AWS, on the other hand, has given us very few problems. When we do have an issue with an AWS service, we're able to quickly get an engineer on the phone who, thus far, has been able to explain exactly what our issue is and how to fix it.


> We've even had cases where there was packet corruption at the bare metal layer,

I'd love to know how this happens in the modern world. I've seen it myself only once (not GCP, but our own network with cisco equipment.)

Is something in the chain not checking the packet's CRC?


Had something similar last year because of a core router fabric issue. A few years ago, there was a batch of new servers with buggy motherboards corrupting/dropping packets, can't begin to imagine how hard it was to diagnose.

That's in own datacenters, not cloud.


> can't begin to imagine how hard it was to diagnose.

Yeah, when it happened to me, it completely threw me for a loop. We had reports of corruption in video files, which started the debug cycle. It was shocking when we isolated the box causing the issue.

But I guess your bigger comment has to be right: About the only way to have this sort of error is at the hardware level, because basic CRC checking should otherwise raise some sort of alarm.


Keep in mind that hardware run with a firmware. What is called a hardware issue can actually be a software issue.

It wasn't just one box for us. Basically, the part number was defective (motherboard NIC), every single one that was manufactured. This affected a variety of things, since servers are bought in batch and shipped to multiple datacenters, damn impossible to root cause.

CRC can be computed by the OS (kernel driver) or offloaded to the NIC. I think it's unlikely for buggy CRC code to shipped to a finished product, it would be noticed that nothing works.


Just curious, is this on a specific region(s)?


GCP is incredibly bad at communicating when there are problems with their systems. Just terrible. Its only when our apps start to break that we notice something is down, then look at the green dashboard which is even more infuriating.


AWS is often the same way. No one seems to be good at communicating outage details.


I suspect there's a correlation between outages that are easy to detect and communicate and outages that automation can recover from so easily that you hardly notice.


I really don’t get this. There’s a huge number of complaints about poor communication from companies like Google and AWS during every outage. Yet they remain seemingly indifferent to how much customer trust they are losing, and the competitive edge the first one to get this right could gain.


I don't think they are losing any kind of customer trust.

Unless something is really fucked (like both GCP and AWS being down for us-east) incidents like these are not going to impact them at all.

The cost of either migrating to the other provider or, even worse, migrating to more traditional hosting companies is enormous and will require much more than "service was down for 2 hours in 2019". The contracts also cover cases like this and even if they don't, Google and Amazon can and will throw in some free treat as an apology.

On one hand I find this quite sad, but from a pragmatic point of view it makes sense.


If 20% of Google Cloud's customers leave after this outage because of poor communication they'll prioritise accordingly and apply all that nice SRE theories to their infra. But this isn't happening, because <various reasons>, so... who cares?


I mean, I care. All else being equal I’m not sure why you wouldn’t want good communication to your customers.


How much cloud spend do you control? That's the reality of how decisions are made.


Many millions of dollars per year. I care about how my providers behave when they have issues, and I can't see why you think it's not at all relevant.


> "why you think it's not at all relevant"

Nobody said this.

> "I care about how my providers behave when they have issues"

We all do.

As the other commenters stated, the communication is poor because the clouds are still growing rapidly and there's not much reason to be better. We might also be underestimating just how much more better service would cost and whether it's worth the revenue loss (if any). Are you really going to shift all of your spend overnight because of an outage? And where are you going to go?

The reality of these decisions is far more nuanced than it may seem and the current state of support is probably already optimized for revenue growth and customer retention.


Their dashboard does show red on GCE and networking right now, for what it's worth. https://status.cloud.google.com/


What aren’t these on separate systems? I never had the impression that google cheaps out on things but this sounds exactly like the sort of shit that happens when people cheap out. Not even a canary system?


The idea that Google spends big on expensive systems is a huge lie.

Google started using a Beowulf cluster that the founders wired themselves. From the very beginning, the goal of metrics collection was to optimize costs. While today it’s seen as the cash cow, the focus has always been on cheap components strung together, relying on algorithms and code for stability and making the least possible demands of underlying hardware.

To think that they won’t try to save money any time they can seems implausible.


AWS has what feel like monthly AZ brownouts (typically degradated performance or other control plane issues) with the yearly-ish regional brown/blackout.

GCP has quarterly-ish global blackouts, and generally on the data plane at that which makes them significantly more severe.


Are there any services that track uptime for various regions and zones from various providers? It's rare that everything goes down and thus the cloud providers pretend they have almost no downtime.


CloudHarmony used to track this at some level for free, but it looks like you now need to sign-up or pay to get more than 1 month of history?

The last time I looked at it (back when it showed more info for free, IIRC), AWS had the best uptime of the three big cloud providers, with Azure in 2nd and GCP in 3rd.

IIRC, the memorable thing was that, shortly afterwards, the head of Google Cloud made a big announcement that CloudHarmony showed that GCP had the best uptime when CloudHarmony showed that it actually had the worst. Google was calculating this by computing downtime = downtime per region * number of regions, but at the time, Azure had ~30 regions and AWS had ~15 vs. ~5 for Google and if you looked at average region downtime or global outage downtime, Google came out as the worst, not the best.


I can't imagine that being easy or cheap to make given the staggering number of product offerings across even the few big providers and how subtle some outages tend to be.


Obviously we don't know what the extent of the issue is yet, but afaik there has never been an AWS incident that has affected multiple regions where an application had been designed to use them (like using region specific S3 endpoints). GCP and Azure have had issues in multiple regions that would have affected applications designed for multi-region.


> like using region specific S3 endpoints

AWS had the S3 incident affecting all of us-east-1: “Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”

https://aws.amazon.com/message/41926/


That's one region, not the multiple region that OP mentioned


Services in other regions depended transitively on us-east-1, so it was a multiple region outage.


Which services in other regions? I remember that day well, but I had my eyes on us-east-1 so I don't remember what else (other than status reporting) was affected elsewhere.


There was a massive push after that to have everything regionalized. It's not 100% but it's super close at this point.


S3 buckets are a global namespace, so control plane operations have to be single-homed. As an example, global consensus has to be reached before returning a success response for bucket creation to ensure that two buckets can't be created with the same name.


The availability of CreateBucket shouldnt effect the availability of customers apps. This tends to be true anyway because of the low default limit of buckets per account (if your service creates buckets as part of normal operation it will run out pretty quickly).

The difference with Google Cloud is a lot of the core functionality (networking, storage) is multi region and consistent. The only thing thats a bit like that in AWS is IAM, however IAM is eventually consistent.


But isn't CreateBucket the single s3 operation where you need global consistency?


As far as I know bucket policy operations also require global consistency.


I find GCP quicker to post status updates about issues than AWS, but GCP also seems to run into more problems that span across multiple regions.

I'm overall happy with it, but if I needed to run a service with a 99.95% uptime SLA or higher, I wouldn't rely solely on GCP.


AWS has better customer service and I don’t remember the last time there was a huge outage like this besides the S3 outage


There was a terribly day 2-3 months back in us-west-2 where CloudWatch went down for a couple of hours and took out AutoScaling with it, causing a bunch of services like DynamoDB and EC2 to improperly scale in tables and clusters, and then 12 hours later Lambda went down for a couple of hours, degrading or disabling a bunch of other AWS services.


I've heard from people who have worked with both AWS and GCP that AWS has far better availability.


I've also heard similar from a teammate who previously worked with GCP. That said I know several folks who work for GCP and they are expending significant resources to improve the product and add features.



I see a lot of Nest complaints, isn't there any security issue if Nest goes down?


My whole house is covered by a Nest Secure alarm system. At the moment my Nest Guard is telling me "it's offline". But if someone breaks in the house, the alarm will still ring in the house, however I believe I won't be notified on my phone (can't even log into the Nest app), and the alarm monitoring center (operated by Brinks under a partnership with Google) probably won't be notified as well, which sucks. I'd love to test it right now, but my daughter is sleeping and I don't want the alarm to wake her up...

I wonder how often outages occur with other alarm monitoring companies. They certainly do occur, but customers don't have a lot of visibility into them.


Why did you install this in your house?


I see people complain about Nest and the only thing I can think of is "what on earth where you thinking to have a door or thermostat that doesn't function without internet?!"


Nest thermostats can function without Internet. You can walk up to it and adjust the temperature just like any old thermostat.


I live in a house with a Nest thermostat and Nest Secure and Nest x Yale door locks. The AC is on just fine, and the door unlocked just fine. (The door lock doesn't require an internet connection, unless you've enabled privacy mode for some reason.)


I thought there were versions that don't work offline; my bad if this is not the case.


Internet isn't the issue, it's the endpoint (Google)

It's reasonably easy to run a router/gateway that has a 4G backup to get the ping out. Whether the ping works...


Instagram is on GCloud? Facebook has a heavy investment in their own DCs, I thought they would have transitioned to them by now.


No, the graphs are highly misleading because they autoscale the y axis to the highest point and the they do very loose string matching to detect errors.

If you click through to something not on Google cloud, you see moderately elevated error rates (e.g., Instagram is up by 4x) but if you click through to something actually on Google, you see very highly elevated error rates (e.g., roughly 50000x for Snapchat).

If you read the "error reports", they actually report that Instagram isn't down (same for Twitter if you check Twitter). The error report detection seems to be just string matching. Here's an actual "error report" from downdetector that's the caused of allegedly elevated error rates:

> my twitter timeline: why isn’t snapchat working? anyone’s snapchat not working? snapchat’s being dumb. rip snapchat.

The Twitter "error report" is literally a report that Twitter isn't down.


Ah. I was looking at it on phone and it wasn't clear. Thanks for clarifying.


it's no clearer on desktop. that's actually pretty funny that the instagram graph spikes just because people are complaining about snapchat being down.

i always thought downdetector was doing something a bit more clever than just reporting the rate of tweets containing the word "instagram" or "facebook" or something. but apparently not.


Just because you run your own ram and compute doesn’t mean you run your object store, too. Running drives is a commodity now, and crypto is cheap since aes-ni.

No idea if this is what instagram does or not, just in general. Drives are hot and need lots of power and it’s expensive to out-S3 S3.


Object store won't cause Instagram to go down though, no? Assets might fail to load.


What's left of instagram without the assets? And some of the assets required to fully load are probably in object store as well.


Wow. Everyone depends on Google.


I was playing around this afternoon with appengine, and thought I broke one my projects when I started getting 502 back.

There appears to be some irregularities on consumer services as well that are of course certainly related, youtube was behaving a bit oddly for me.

The impact seems to be cascading down from just GCE to other services as well - that status page certainly does not reflect the reality of the situation. You can't even sign into GCP right now, and things that run on GCE, like appengine seem impacted.


Nest is down for me right now.

It's amazing how far-reaching outages can be these days.


There are dozens of us.


Code reuse is a wonderful thing, until it's not.


Code reuse? Couple if years ago some started to call servers "cloud" but the cloud is just that: computers managed by someone else. If the whole service layer relies on Three big companies, there is a problem..


Yep, I can no longer see my Cloud SQL database - it's as if I've never created one at all. Really hoping this is just an issue displaying it and that Google hasn't punted my infrastructure and backups.


Praying isn't working. Now, I'll try sobbing :(


Systematic problem solving. I like it


Reports claim that it's a network congestion issue primarily in the northeast United States. It seems doubtful that any data has been lost, but you probably can't see resources due to requests not getting through. I hope that hearing this helps you to feel better.


[I am the Cloud SQL tech lead]

This is a networking issue, and your data is safe. Cloud SQL stores instance metadata regionally, so it shares a failure domain with the data it describes. When the region is down or inaccessible, instances are missing from the list results, but that doesn't say anything about the instance availability from within region.


When talking about GCE being down please also mention what regions you are talking about


In this case it's a luck if any are working correctly, a problem is global with some exceptions.


seems to be some comments here of some regions functioning ok, although perhaps it’s not 100% in all regions


us-central1 us-west1 us-west2

is what I’ve heard so far. east seems to be OK, and Europe too


South America is down for me. But youtube works OK, I can watch Google IO and they talking about all Google greatness while their outage impact second class clients on their cloud.



Good that Google+ is up again


Phase 2 of mitigation was completed a few minutes ago and GCE is expecting improvement shortly.


It seems crazy to me that Google Cloud can have this level of instability but I, on the other hand, can never remember google.com going down.

Why are they operating one with a different networking infrastructure from the other?


They did have a major outage for a few minutes on 2013: https://www.theregister.co.uk/2013/08/17/google_outage/


So this is now the longest outage ever, right?


It's completely different mode of failure - when part of search is down somewhere, it's just routed to the same service in other location. If part of the GCP is down anywhere, some customers are affected.

Since original Google infrastructure was developed specifically for first kind of services, cloud org still has problems adopting it to its needs.


Sounds like a good design — but in the early days of google cloud — 2008 — you’d get the google homepage on a bad request


AFAIK google.com is running independently on a different infrastructure, but some of other services are relying on GCP hence the problem affects them too.


Right now this is at the 'good at least it's not mine to fix or worry about' kind of like 'and the reason I choose IBM' [1]. I can just sit back and wait for gmail to work correctly. Now at the point it starts to last what I would consider a long time well then I will have things to worry about.

One thing with gmail though. When it's down it's similar to a snow storm if you only do business in a city. Everyone is impacted and everyone understands a missed deadline is unavoidable.

[1] For those not old enough to know what I mean read this: https://www.ibm.com/ibm/history/ibm100/us/en/icons/personalc...


Looks like my Gmail is back, but I don't have any emails from while it was down. Yikes.

Edit: just got one email from the downtime, so perhaps my initial conclusion was incorrect


Whoa, if confirmed this would be really bad. Downtime happens, but data loss is much worse.


This is normal handling for SMTP - the sending server just tries again later.


So just got a new phone and trying to resync my 2FA for AWS but I can't because Gmail is down. Ffs.


Just 2 weeks after I migrated a DB cluster from Azure to Google Cloud thinking things would get better.


Just remember that "better" can be entirely based on when you pick the starting and ending points of your graph ;)


It's just like the stock market.


It's the yearly outage. You will be fine for next year.

Better than the monthly outage from Azure.


FWIW they might still be.


We're hosting an open global Zoom call for all engineers affected by the outage, join us at https://zoom.us/j/793450725


One click on this link and it instantly starts streaming your webcam footage to everyone in the chat room.


This is weird, I had Zoom installed a long time ago but uninstalled according to their instructions [1]. I'm a macOS user.

As soon as I clicked that link, the client downloaded a PKG file, installed itself and launched itself without asking me if I wanted to share my camera or audio.

I uninstalled according to their instructions again, searched for all "zoom" files in my disk and rebooted.

This leads me to believe that following their uninstall instructions is insufficient, and there are hidden files left on my computer.

Sorry in advance for the off topic message

[1] https://support.zoom.us/hc/en-us/articles/201362983-How-to-u...


Don't be sorry, same thing happened to me just now and I'm trying to figure out how they are installing locally from a URL click with no further input from me.

edit: Found this thread with details but no resolution it seems: https://apple.stackexchange.com/questions/358651/unable-to-c...


I think I found out: there's a daemon process that I somehow missed the first time around.

Deleting the .app file as instructed is not enough.

This StackExchange reply [1] showed me how to solve it, at least on macOS.

[1] https://apple.stackexchange.com/questions/358651/unable-to-c...


Do you have auto-run enabled for downloaded files?


Nope. Not even for safe (image, music, etc) files.

And even if I had, PKG files don't install themselves on macOS, they open the installer interface, AFAIK.

EDIT: And, as I mentioned, the file was downloaded by some Zoom client, not by my browser.


This is a local zoom setting, you can change it.


Use 'Turn off my video when joining meeting' in Zoom.


Put some clothes on.


Why? Do you expect to be able to do something about it or did you just want somewhere classier than Twitter to complain?


Does anyone know if this is a regional or global outage?

I can see my GKE clusters in one region but not in another, so I am guessing it's the former.

Looks like we'll need a cluster in each region going forward...


I can not access vimeo from EU, not sure if related to Google though.


To help others: which two regions?


us-central1 is up

us-east4 is down


us-west2 is down


I tried multiple times to setup a Google Wifi router today. Wifi would work but the app said it was offline. Perhaps I am not insane or incompetent after all


https://status.cloud.google.com/incident/compute/19003 Does status update mean only the status. Won't there be words like "sorry", "apologize" & "inconvenience"! Only PR is responsible for those words?


Apologies are unnecessary on status pages. You can be sure that they are sorry.


The status page took a while to show issues. My app was down, and Twitter knew google cloud was down before the official status page.


Its funny how searching for "Google Down" on Google and filtering for results within 1 hour yields nothing.


Is shopify on google cloud? i noticed they are having issues too


Sunday evening is a pretty big time for ecommerce. Losing 1000s of dollars of sales here :-(


Yes it’s affecting their system as well.


yes


I'd reported a networking issue to Shopify yesterday, which they said was resolved upstream (GCP?). Could it be?


Not sure if related, but I was going to a BBQ yesterday and myself and 3 other people got lost because Google Maps app glitched out, directing us to the wrong places. If you search twitter for #googlemaps tons of people have the same issue. Surprised no one has posted about it.


https://status.cloud.google.com/incident/compute/19003

>We will provide an update at 16:00 US/Pacific.

it's 16:22 and no updates were posted. a bit unprofessional..



0 issues at compute, reporting for europe-west3-b,


I appears that logging into the webmail solves at least POP mail problems. I tried my mail client and failed, then attempted a login to the webmail which worked. Gmail then asked me to confirm my recovery address and cellphone, which I did, and finally loaded the inbox page. I immediately attempted a connection through the POP client and this time it worked.

It might be something security related if it triggers a mandatory identity confirmation.

edit: I tried to send me a mail from another account and it worked but out of 4 or 5 mail checks at least two failed giving the same error.

[23:44:27] POP< -ERR [SYS/TEMP] Temporary system problem. Please try again later.

The problem seems much more complex.


The GCE console also affected, couldn't send a support ticket just getting errors.


That explains why my google home thing thinks it’s sub-zero even though I’m warm in shorts and a t-shirt.

https://pasteboard.co/IhBsyrsO.jpg


The last big outage, iirc, was Google didn't test their rollback procedures for router upgrades. I'll be very interested to hear if it's yet another change control problem that caused this outage.


I was finishing a university assignment with the deadline 90mins away.

I wanted to upload a video of the project to YouTube and add a link to it in the report. YouTube takes a long time to process the video, and then says it's unavailable.

I go to Vimeo: it's down.

I upload the video to Dropbox, and copy its link to the report.

But my report was a Google doc. And when I tried to export it as PDF (which I had not done yet) it couldn't do it. I never hated google more.

Eventually the video went through to YouTube, and I could export the PDF on the third try, but this really made me conscious of my dependance on Google.


Everything looking normal on our GKE / CloudSQL stuff (eu-west1)


gcloud tells me:

WARNING: The following zones did not respond: us-west2, us-west2-a, southamerica-east1-c, us-west2-b, southamerica-east1, us-east4-b, us-east4, us-east4-a, northamerica-northeast1-c, northamerica-northeast1-b, us-west2-c, southamerica-east1-b, northamerica-northeast1, southamerica-east1-a, northamerica-northeast1-a, us-east4-c. List results may be incomplete.

Luckily for us eu-west1 seems to be working normally.


Confirming issues on our end. I'm able to load up my console but when I go to Kubernetes Engine, I don't see my clusters. I'm monitoring closely on twitter


Can't wait for the postmortem!


My money is on config push.


I would question why they are doing a rollout on a weekend?


A config push on a weekend seems pretty unlikely. Given that it's apparently a network congestion issue and showed up on a weekend, my guess is that it's probably a bizarre networking hardware failure like the one that took out CenturyLink[1] last December.

[1] https://news.ycombinator.com/item?id=18789071


Was certainly an interesting alert when my Cloud Functions started reporting downtime. Among the many things that dip in and out on what seems like a monthly basis, I’ve not seen them just drop out in quite a while. Hopefully they get things sorted out. I can’t really imagine what it looks like internally when this level of outage is going on, but I want to think everyone is fairly collected


Once again HN proves to be the best status monitor.


This might be all in my head but I've been experiencing really bad latency for like an hour, while browsing, and then I read this.


https://downdetector.com/

Pretty much every service is down


It's frightening how many services rely on gcloud and which impact this downtime has


My builds are failing because it cannot download the chromium.

> Error: Download failed: server returned code 502. URL: https://storage.googleapis.com/chromium-browser-snapshots/Li...


And Thunderbird suddenly kept throwing prompts for my Gmail passwords, even after I sign back in. I hope it's related.


It says "could not reach imap.gmail.com" for me.


Google Colaboratory is back. At least I can access my github's notebooks and public notebooks from google drive.


G suite was broken for me 20 minutes ago (in Europe) but is working now. Perhaps things are starting to come back?


Github contribution graphs are also gone


Does Google also have some sort of listing on which consumer apps are particularly affected (i.e. Gmail, Hangouts, Docs, Sheets, etc).

The cloud components may be directly affected but for consumers, there's nothing which will provide info on what consumer facing services are getting some issues.


> We will provide more information by Sunday, 2019-06-02 12:45 US/Pacific.

I'm not seeing anything at 12:47.


They are updating the root cause issue, which is networking, here: https://status.cloud.google.com/incident/cloud-networking/19...

Next update is in about 25 minutes.



Cloud status dashboards seem to be hosted on the same cloud, which doesn't say much about redundancy.


AWS changed internals of Service Health dashboard after they couldn't update it when S3 went down in us-east-1 (https://aws.amazon.com/message/41926/)

edit: wording


Someone had to design the status page, and failed to anticipate the issue about how the status page depends on the systems it is reporting on.

Unlikely the rest of AWS, a cached web page does not require much complexity.


Majority of G Suite services are suffering service outages: https://www.google.com/appsstatus#hl=en&v=status


Keep on centralizing the internet with your stupid clouds and this is what happens.


Anyone experiencing issues with GCS? Seems highly intermittent and dependent on the location the request is coming from (maybe that's because it's a networking bug).

The status page says GCS is fine but that's highly unlikely.


It seems that it`s focused on the database stuff like firebase and firestore.


Apparently (from further up the thread) it's a network congestion issue causing extremely high rates of packet loss. I imagine pretty much anything that's homed in the effected regions will be degraded or unaccessible.


Took me a while to track latency issues to GCP. Wasn't expecting it. This also seems to affect some GAE instances and some of their products like google photos. At least according to my observations


I see this as well.


I just can't reach google apps on my HTC m9 since yesterday. I am in West Africa. My Whatsapp crashed too and I lost all my previous threads. Is my issue related to Cloud being down?


Playing "is google on this website or not?" just became so much easier, simply see if website works or not.

Scary stuff. What happens when Murphy's law decides to crash things even more?


Yeah I was having trouble accessing my Gsuite apps, had a couple of 502s, which led me to check HN. While it doesn't give me 502 now, it's abnormally slow.


Was trying to set up SSL on a GKE cluster today. Guess I'll have to wait for tomorrow if I want to be able to tell my mistakes apart from Google's.


u.s. west: all our cloud compute is inaccessible rn.... our API is down, can't ssh into the servers, and also can't see them on the dashboard.


This must be why GitLab is giving me shit. They recently made the switch. Wonder if there is any second guessing going on over there right now.


You can find some more data about GitLab availability after the move to GCP here https://about.gitlab.com/2018/10/11/gitlab-com-stability-pos.... As we're trying to stay transparent as always, we'll definitely let everyone know if are going/thinking about the change.



Using Google Docs to report problems with Google, including Google Docs. Irony at it's finest.


They do have an internal ticket as well, but Gitlab.com is also hosted on Google, so... https://gitlab.com/gitlab-com/gl-infra/production/issues/862


I was just thinking the same. One of their reasons (besides being paid a lot by Google) was stability.


Wondered why Snapchat was being weird today. Thought it was my pi-hole setup blocking something from working, but nope, it's Google!


In India, I could access Youtube, Gmail (web) and Google cloud console and GKE and Compute Engine instances in south-east asia region.


Looks like only GCE is down according to the status page now. I'm able to access my console for instances and GKE clusters.


They initially opened an incident against GCE (https://status.cloud.google.com/incident/compute/19003) then opened one against Networking (https://status.cloud.google.com/incident/cloud-networking/19...).

The networking incident looks like the one to follow for updates now.


I happened to be initializing a GKE pool upgrade just as this occurred. The upgrade is now stuck according to the console.

The interesting thing is that a couple of minutes before everything went wrong, kubectl returned a "error: You must be logged in to the server (Unauthorized)" error


My site runs on Google App Engine and its down as well.


btw, Google Analytics realtime is down as well.


I wasn't aware that outage and had small heartattack when I saw huge drop of visitors. I think other metrics are also affected.


Thanks, I noticed I didn't have any users after pushing an update and thought I broke something.


Weird for Twitter to still be up and fully functioning. I thought they migrated everything to GCP this/last year?


Not the main functionality of the service, just lots of data analysis tooling. nothing that end users would notice


Interesting. Thought I had read some posts of them migrating their data, but you could definitely be right.


You look up. A single ray of light has made its way to the earth. Some day, you hope, the sky will be clear again.


google services also f´ed up here in middle europe; cannot reach anything google related from hamburg, germany.



Seems to have coincided almost exactly with my Chromecast stopping displaying my photos (in ambient mode).


Ah, wonder if this is affecting Google's SSO. It was super slow when I was trying it just now.


No wonder I was trying Webcodesk right now and it's not working it's all firebase, yay!


So that's why YouTube was being weird. I thought it was an extension problem or something.


We are on region us-east1 and our systems are still up. Specifically, we are on us-east1-b.


youtube streaming is also down


I think nobody mentioned but Adsense also is not updating for couple of hours.


I've noticed problems on GDrive (GSuite) and YouTube as well. Connected?


FWIW, Youtube is fine for me, but I'm seeing intermittent errors saving updates to a document in GSuite. I had thought the latter error was a problem with the wifi where I am, until I saw this. Now I'm not so sure. HN is loading fine on the same wifi...


Couldn't load the support console to "me too" this one either!


There go the nine nines for this year? Is it more like four nines or so now?


Anything to do with China?


Hug your on-call engineer.


With Google Cloud incidents, most of the time whole regions fail, and with AWS generally only a region fails. Of course there would be exceptions, but Google Cloud does not make me feel safe as an outsider (and a user of multi-region AWS)


These things happen. That's OK. Here's what's not OK:

> We are investigating an issue with Google Compute Engine. We will provide more information by Sunday, 2019-06-02 12:45 US/Pacific.

The next update is at 12:59. Just ... no.


Working here but slow (I'm based in Central Canada).


So far the Ko list:

GCE, GKE, BQ, Pub/Sub, GAE

asia-south1 us-west1 us-central1 us-west2


Our GKE stuff in asia-south1 (Mumbai) is up, and the GKE console works fine here. (Bangalore).


Looks like Google Analytics isnt reporting stats either?


Snapchat is fixed but snap maps is still disabled


gitlab is slow too


Slow is understatement... some pages on gitlab.com take minutes to load, and jobs take tens of minutes to start.

EDIT: It's been like that since at least 12h ago though. Not sure if it's connected to Google Cloud?


Yes, it was definitely related to GCP services.

GitLab is no longer seeing errors and Google Cloud has resolved the issue as of 23:00 UTC yesterday. Any further information can be found on the issue at https://gitlab.com/gitlab-com/gl-infra/production/issues/862


The timeline seems a bit short, but ok. :)

(the problems with runners and the UI started at least at 2019-06-02 7:48 UTC, though they were hit-and-miss at the time)

Still, happy this is solved and we can use the (fantastic otherwise) service again!


Google Play is also experiencing massive issues.


I cant see any gke cluster in Brazil, or any VM.


I'm seeing that with northamerica-northeast1. I can't access anything over the network in that region and most of the GKE clusters and VMs in that region aren't listed in the console


A demo of “too big to fail [via antitrust]”?


Snapchat is back up but snap maps is down


So is Youtube.


could this be the result of another BGP hack ? cyberwarfare ? I am just speculating here big time.


GCP has been down since 11:50am and they acked it 35 mins later. They're great at leaving their customers in the dark.


Not much different from AWS, from what I've heard.


Yeah, Amazon is the master of having their status page read all Green while half of US-East is in the toilet


Definitely the case. Neither are super great at this. One issue is that issues that may 100% impact individual clients may only impact a vanishingly small amount of their overall service load. That mismatch between customer and provider experience is one of the ugly aspects of public cloud providers.


That's why AWS is all about their Personal Health Dashboard (PHD). They can post specific issues for your account in there. Also, they get to keep the public page looking nice and green to show to executives of prospective customers.


Also, it's one which gets hugely understated when people "move to the cloud".

especially if you use your bussiness for B2B services. Stuff like this could make you loose your bussiness, especially if some entity like google doesn't communicate and as a result, you do not have a answer for your own customers.

Medium sized private cloud providers are a lot better at this, considering the communication lines are a lot shorter.


On the flip side - a customer is more willing to be understanding if 'Google is down' instead of 'our server is down'.


gmail is down in Australia...


gmail also down/super slow atm for me (East Coast, USA)


Let's see if perfect leetcode skills will save the day. /s


vimeo.com is down.


Huawei just flexing its muscles. Nothing to see here, move along.


No, not really.


Ironically, I moved all of my objects off GCS today.


There is a loop in the spanning tree.....


forbes.com is down also?


My gmail it's down!


seems like only us down


Prediction: the final postmortem will say "someone pushed a bad config", just like most of the previous postmortems (and most of the internal postmortems as well, for Borg-based services). This is the cause of most other outages in other cloud providers as well. A really hard to solve problem.


Multiple regions seem to be affected though. Wouldn't it make more sense to start out config pushes with a single region before expanding it to avoid these types of wide outages?


It's a networking outage. Google is a well known user of SDN. Networks, by virtue of connecting things, necessarily affect more than one region.


So that's why I can't login to YouTube this morning...


just had rolld20 in the USA blow out a game I wonder if it is effected


Looks to be working in the UK


Network peering probs in Canada - looks like an unrelated problem


On days like these I’m glad I don’t use any of the affected services


What I've realized from this: Google doesn't have an official status page for GCP. There are a few unofficial ones, but nothing official that I could find.



I meant Twitter Status Page (in my defense it was 3 am). The only ones are: @gcpstatus and @gcp_incidents, both unofficial




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: