On one hand, this is really cool and I'd love this type of info out of AWS.
On the other hand, the whole point of using a cloud provider is to lower the amount of ops work, and this feels like I'm helping do ops work for Google, and paying them for the privilege.
This is purely additive. If you don’t want to look at these metrics, you don’t need to, but if they would be helpful to you, then you can. Speaking as a GCP customer, I’m very glad these are available. It means that I will more easily be able to tell the difference between issues in my service, and issues in the APIs I depend on.
> the whole point of using a cloud provider is to lower the amount of ops work
In a sense. You shift a lot of what ops work used to look like to someone else. No longer you have to worry that there is some issue with a fiber optic transceiver. This is someone else's problem.
This is an abstraction though, and abstractions leak. Even with the best teams and monitoring available, they are not going to catch it all. But at least you can see if there is an issue, instead of filing a ticket to even find out if there was an issue to begin with. And they won't able to tell you that "everything looks fine in our end".
At the end of the day, it is still their problem to fix.
I think this is a pretty bold move, I wouldn't want to expose internal data like this to end customers, unless there was complete confidence on the monitoring systems. Issues with data have the potential to generate an immense amount of grief from customers.
There have been multiple times that I have wondered how much latency a call out to some aws api was going to add. Something like this would give me a baseline, is it 20ms or 200ms without having to profile it myself. The ability to monitor service degradation is a value add.
The previous responses to this comment are spot on. I would add two things:
(1) When you write an app that relies on a cloud service and your app has trouble, you have to make the choice to either dig into your app or call the cloud vendor for help. If you make the wrong choice, you lose precious time in the resolution process. This data makes that triage process simple. If your indicators and the GCP service's indicators go bad at the same time, call Google. If the Google indicators look good and yours are struggling, look at your app.
(2) When you do call tech support from a service provider, there is often some back and forth to show that it is the service provider that is causing the issue. With this data, that back and forth goes way down. Show them the chart that illustrates the Google service level drop and everyone agrees on the data and support can jump right on the issue.
One can object saying, "Shouldn't the cloud vendor already know that there is a problem and just fix it and let me know?" That is true for the infrequent big things. But, for the smaller, but more frequent issues, there is a specific issue for some method in some service for some customers in some region, etc... When that happens, the few customers who are hit by that issue can have a hard time knowing it's happening because it does not show on the global dashboard and then have a hard time getting tech support to see that this is really a change in service for them. This data solves those issues.
On the other hand, the whole point of using a cloud provider is to lower the amount of ops work, and this feels like I'm helping do ops work for Google, and paying them for the privilege.