Transparent SLIs: See Google Cloud the way your application experiences it

robertp · on July 27, 2018

I think you have to hand it to Google, that is really next level deep analytics. I feel it would be hard for most companies without huge investment to do this (or maybe not).

It is like the reverse of a "customer survey" that instead of asking and getting an arbitrary number, instead of really detailed level of service performance.

wora · on July 28, 2018

People often consider to use this information for reliability and performance, but you can do much more with the data. For example, if a method has low latency, you can use short deadline with fast retry to improve reliability. If you see a sudden jump of certain usage, you can consider to use batching and caching to reduce your cost. If you see an unexpected usage of a service, you know someone introduce a new dependency in your system. Google teams often use the same data to understand how large services work and how they are correlated.

Disclosure: I worked on this feature at Google.

jonny_eh · on July 27, 2018

This is amazing, kudos to Google. It's astounding that most cloud services only show status info for their entire cloud (or big chunks of it). I don't care if 99.999% of your customers are fine, if the server running my servers is down for no reason.

haimez · on July 27, 2018

Any idea when stackdriver is actually going to be usable for production? The latency of events is way to high to drive alerts from and the UI has basically always had issues. Clicking though any link seems to have about a 50% chance of resulting in "this page was not found" which only means you have to somehow find a different navigation path to get to that page.

markcartertm · on July 28, 2018

We (The GCP Ops Management and Stackdriver teams) are working hard in multiple fronts to deliver innovation (Such as service graph highlighted in day 1 keynote at Next and GKE monitoring) and at the same time deliver first class scale and reliability. its a journey, but we have made a lots of improvements over the last 12 month, and will continue to raise the bar in the next 6-12 month. We have many very large customers as well as startups using Stackdriver as the core of their Ops and SRE command and control center. I can personally guarantee that our users getting great UX and reliability is top of mind for the entire team. I would appreciate it if you can flag to our team any time you see a page not found or any other experience in Stackdriver that you feel does not meet the bar - we listen and we will resolve bugs one by one to meet your expectations. we have an email list, bug tracker and a feature request forum all listed here https://cloud.google.com/stackdriver/docs/contact-us . you can also submit in context feedback from the Stackdriver and GCP consoles which will be reviewed by the team. finally, please feel free to DM me on Twitter @markcartertm . We care deeply and would love to hear and respond.

kenhwang · on July 27, 2018

On one hand, this is really cool and I'd love this type of info out of AWS.

On the other hand, the whole point of using a cloud provider is to lower the amount of ops work, and this feels like I'm helping do ops work for Google, and paying them for the privilege.

dantiberian · on July 28, 2018

This is purely additive. If you don’t want to look at these metrics, you don’t need to, but if they would be helpful to you, then you can. Speaking as a GCP customer, I’m very glad these are available. It means that I will more easily be able to tell the difference between issues in my service, and issues in the APIs I depend on.

outworlder · on July 28, 2018

> the whole point of using a cloud provider is to lower the amount of ops work

In a sense. You shift a lot of what ops work used to look like to someone else. No longer you have to worry that there is some issue with a fiber optic transceiver. This is someone else's problem.

This is an abstraction though, and abstractions leak. Even with the best teams and monitoring available, they are not going to catch it all. But at least you can see if there is an issue, instead of filing a ticket to even find out if there was an issue to begin with. And they won't able to tell you that "everything looks fine in our end".

At the end of the day, it is still their problem to fix.

I think this is a pretty bold move, I wouldn't want to expose internal data like this to end customers, unless there was complete confidence on the monitoring systems. Issues with data have the potential to generate an immense amount of grief from customers.

btashton · on July 28, 2018

There have been multiple times that I have wondered how much latency a call out to some aws api was going to add. Something like this would give me a baseline, is it 20ms or 200ms without having to profile it myself. The ability to monitor service degradation is a value add.

the-sceptic-dot · on July 28, 2018

This level of detail effectively holds them to a higher standard, not lower. I'd pay for that.

judkowitz · on July 28, 2018

The previous responses to this comment are spot on. I would add two things: (1) When you write an app that relies on a cloud service and your app has trouble, you have to make the choice to either dig into your app or call the cloud vendor for help. If you make the wrong choice, you lose precious time in the resolution process. This data makes that triage process simple. If your indicators and the GCP service's indicators go bad at the same time, call Google. If the Google indicators look good and yours are struggling, look at your app. (2) When you do call tech support from a service provider, there is often some back and forth to show that it is the service provider that is causing the issue. With this data, that back and forth goes way down. Show them the chart that illustrates the Google service level drop and everyone agrees on the data and support can jump right on the issue.

One can object saying, "Shouldn't the cloud vendor already know that there is a problem and just fix it and let me know?" That is true for the infrequent big things. But, for the smaller, but more frequent issues, there is a specific issue for some method in some service for some customers in some region, etc... When that happens, the few customers who are hit by that issue can have a hard time knowing it's happening because it does not show on the global dashboard and then have a hard time getting tech support to see that this is really a change in service for them. This data solves those issues.

OP9000 · on July 27, 2018