Hacker News new | past | comments | ask | show | jobs | submit login

Yeah. I remember years ago working at a shop that had hundreds of baremetal hypervisors in a bunch of colos. We had a problem where a clocks would sometimes skew massively (maybe dodgy hardware)? The thing was, ntpd would be able to correct it sometimes, but not always. It has a cut-off point where it throws its little arms up and says "I dunno!", and the skew just gets larger until someone manually fixed it. So we added a remediation step to the Sensu time check it if exceeded a threshold. Then one time we got blacklisted by the public ntpd servers we used in one region because we had a shitload of servers hitting them directly with ntpd and sensu checks. So we had to set up our own servers (and monitor them). And we still had occasional skew even though the remediation action would (eventually) force a correction, and these would cause odd failures with authentication or database replication. We eventually ditched ntpd and moved to chrony (which will continually adjust the clock regardless of the drift). But that took research, testing, puppet code, scheduled deployment, documentation etc. The whole episode was boring and stupidly time-consuming and wasn't even some cool thing that "moved the needle forward" for the company. It's just the fucking time on your servers. Now, take any number of stupid little things like this and sprinkle them over every single sprint, and see how the infrastructure/sre/devops/dogsbody team's promotion cycle works. "Why was the database upgrade delayed this time?"



I've never witnessed that kind of hardware problem precisely, but now just think about it: what would happen if the same situation happened on an AWS instance? How would you go about debugging anything and/or fixing it? It's not even sure you could diagnose the problem in the first place, let alone deploy a workaround. You'd have to send dozens of email to tech support who of course would say there is no problem because their machines have 999999999999999999% uptime and nothing could be wrong on their side, but hey they can sell you advanced support/engineering if you've got way too much money to help you find the problem in your code.

Some commenter mentioned the dark days of Oracle/Microsoft/SAP ruling over the server market. But at least these companies had the most basic decency to let you house/access your own hardware and diagnose things yourself if you had the skills. Now in the "cloud" you can just go to hell and suffer all the problems/inconsistencies, or rather your users can suffer since you as a sysadmin have zero control/monitoring over what's happening. Oh and bonus point: if users report some problems, they will be reproducible 0% of the time since there are great chances your users are connected to a different availability zone: yeah it's easy to reach more than 5 9's when "ping works from a specific location" counts as uptime for a global service.

So in the end, is it better to have a silly answer to "Why was the database upgrade delayed this time?"? Or is it better for the answer to be "i don't know, but upgrading the database cost us 37k€ in compute, 31k€ for storage and 125k€ for egress for backup of the previous db" ? I much prefer silly answers, but maybe that's because i don't have dozens of thousands of euros to be shorted of even if i wanted to :-)


> It's not even sure you could diagnose the problem in the first place, let alone deploy a workaround.

It's easy to detect: ntpdate -q will tell you the drift, your logging would tell you when ntpd gave up because skew was too large.

Correction would depend on why it was happening: you might be able to tell ntpd/chrony to adjust more frequently or to accept larger deltas, but at that point I'd also say that the best path would be pulling that instance from service and replacing it so it's not critical.

> You'd have to send dozens of email to tech support who of course would say there is no problem because their machines have 999999999999999999% uptime and nothing could be wrong on their side, but hey they can sell you advanced support/engineering if you've got way too much money to help you find the problem in your code.

Has that been your experience? Based on mine it would be more like “we confirmed the problem you showed and have contacted the service team” followed by an apologetic call from your TAM. I've opened plenty of issues over the years but have never had someone insist that something is not a problem because their systems are perfect — they'd basically have to be saying you faked the evidence in their report.


> ntpdate -q will tell you the drift, your logging (...)

Sure, that's if you got root on the machine. I was not talking about VPS hosting which we've been doing for a long time, but rather so-called "serverless" services.

> Has that been your experience?

That's been my experience with most managed hosting i've had over the years, though i've had some good experiences too. I've never dealt with Amazon but i'm assuming since you can't run diagnostics (since you have no shell) and like any other business it's likely their first level of customer is inexperienced and reads from a script, i'm guessing you're gonna have a bad time if you encounter weird/silent errors from your cloud services.

Some hardware errors are hard-enough to diagnose with root and a helpful customer support, i can't imagine without those.


While it's true that you can't just shell into a managed host, you can run diagnostics in many cases[1]. I would also say that it's _far_ less common for those services to have hardware issues — part of what they're doing is optimizing to things like health-checks and rebuilds easy since the places with durable state are very clearly identified and isolated.

I've never seen a hardware problem which the platform didn't detect a fault first in those environments (e.g. you see some requests have latency spike for a minute prior to a new instance coming online — it doesn't say “underlying hardware failure” but clearly something changed on the host or perhaps a neighbor got very noisy) and that includes things like the various Intel exploits where you could see everything rotate a week before the public advisory went out. I will say that I've had a few poor run-arounds with first-level support but I've never had them refuse to escalate or switch to a different tech if you say that first response wasn't satisfactory.

1. e.g. on AWS “Fargate” refers to the managed Firecracker VMs, as opposed to the traditional ECS/EKS versions using servers which you can access, but you can use https://docs.aws.amazon.com/AmazonECS/latest/developerguide/... to run a command inside the container environment.


I've seen lambdas run into that aplenty on AWS. The clock has skew between the lambda and S3, resulting in measurable signature mismatch errors. There's even settings on the S3 client in JS to sync times to account for the skew




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: