Sincere apologies to all GitHub users for the downtime this morning, and the brief outages last week as well. We take reliability very seriously, and will publish a full RCA in the near future.
When I tried to log in I was prompted to "upgrade my account" to a "Bitbucket Cloud" account. After doing so, all of my repositories were gone. It seems that my repositories remained on my "Bitbucket ""Regular""" account but that my email address was no longer associated with it, giving me no way of logging in to it. I emailed support 6 hours ago and have yet to get a response.
For the record, Azure DevOps did the same thing to me when we switched over to Azure AD. My account and repositories ended up in an entirely corrupt state. Support was eventually able to resolve most of it, but I’m still discovering problems.
This is 100% not the Dunning-Kruger effect. How on Earth would it be?
edit - from wikipedia: In the field of psychology, the Dunning–Kruger effect is a cognitive bias in which people assess their cognitive ability as greater than it is.
I interpreted GP as saying the aforementioned management has just enough cursory knowledge to want to apply the same hammer that worked on a simple system to that on a complex system, but not enough knowledge to realize the unknown-unknowns that they aren't even aware of.
This seems to be the third or so day in the past week I've had issues with GitHub around this time in the morning. They've typically been really good. I'm a bit surprised there hasn't been more talk about it on HN.
They seem to be doing heavy work on it. Now on Mobile you can't see repos in "Desktop Mode" which is unfortunate. I have to tell my browser to pretend to be in desktop mode. Plus the regex post from the other day seems to imply they are working on new things when somebody from GH replied in said thread. I don't mind improvements, but don't break production guys...
They also changed the "Group Membership" dialog to be paginated when you add a new person to an organization. We have over 200 groups so now I have to page through for ever new hire we add. There's not even a search option.
I'm sure the pagination might better for performance, but it's terrible UI.
They may have missed that one, because at the same time they introduced both pagination and search to the repository membership page, and boy did that help us on one of our repos with a few hundred direct collaborators, by the end we could only manage access through the API because the page didn't even load most of the time.
I see why web pages need pagination so the server or browser doesn't OOM, but there really ought to be 10000 entries per page, not 25 that most sites seem to like.
Ctrl+F on a list of 10000 entries is far easier than clicking through 400 ajaxy pages and trying to figure out some custom and buggy filtering system that probably doesn't allow regex.
Past 10000 records most sites probably ought to just let you export in something bigquery compatible anyway - Regular Joe isn't going to have more than 10000 of anything, and anyone who does can learn how to use proper data tools.
They really need beta.github.com to let people test changes that are not yet defacto. A uservoice type of thing, and the ability for people to join. I love to beta test and give feedback. Microsoft has used uservoice in the past, as has Sulake and other companies I've beta tested for (as a customer).
Edit:
Realized *.github.com takes you to your .github.io sites.
> Now on Mobile you can't see repos in "Desktop Mode" which is unfortunate.
Wait, what? On iOS Safari I can only see repos in desktop mode now (except the issue tracker which is responsive anyway). Which is a good thing. Not sure why you have the exact opposite experience?
(I do vaguely recall being asked if I would prefer desktop mode on my phone a while back, and I said yes.)
Intentions are meaningless. If you're providing a service, (especially charging money for said service), you can't break it because "you're hard at work".
When was the last time BitBucket had an outage? Personally I don't see a lot of difference between the two platforms; or GitLab (my primary now). Github probably has the best UI, but Gitlab's has gotten a lot better; and there are always self hosted solutions like Gogs.
I've been doing on-going client work for someone using BitBucket and for weeks it feels like every other day has an outage related to their pipelines (CI) feature (the thing I happen to be working on).
It was constant banners about service disruption. There's a lot of UI outage related issues too, like the pipelines page starting to show a new build but never updating any of the progress until you reload the page -- which sounds like some type of API outage somewhere. I'm not sure if that gets reported as an outage but it makes using the platform not fun.
I'm pleased I don't have to deal with BitBucket any more, but back a year or two it felt like it had an outage that impacted work at least once every six months. Sure that might not sound like much, but it was always a pain.
Plus of course the service was so damn slow that using it was a daily pain.
There's a yellow banner (that you can't even close) shown every few weeks, and it's usually related to the Pipelines being down, again. That often results in degrading functionality in other parts of the project too. And it's still slow as molasses. I wish I'll never have to use Bitbucket again, in the future.
Yes, I use it for personal projects. I also use a company-hosted version at work. The built in CI is great. I can't think of a reason, other than price for companies, why to use GitHub over gitlab. Both are great, but gitlab's built-in CI I think is easier to use and better integrated.
Not answering this directly, but the paper Meaningful Availability [0] released recently really changed my opinion on how to calculate and visualize availability. There's a discussion on HN as well [1].
That was insightful to some point. But, of course, the relevant metric is "expected availability" -- before I decide to go on the service; therefore, not the same as "customers served". If I have to think about downtime then my experience is degraded; moreso if I have to delay and batch planned interactions (all of which will later count as successful!)
[Edit: To the point: a high rate of randomly-timed failures is a kind of degraded experience, but not as critical as blocky patches of downtime. A 1% rate of randomly-timed failures is much much much preferred than having the service go out three straight days every February.]
Also: uptime is not the same as "customer delight". It's all about time.
> a high rate of randomly-timed failures is a kind of degraded experience, but not as critical as blocky patches of downtime
Do you think that's an accurate generalization for all software and business contexts? I think a novel insight about the paper is that windowed user uptime is able to visualize the differences. (See Figure 20 from the paper.)
It certainly is not - a trade might not care about 99% of the time, but the exact moment they want to do a trade the system must work.
Whereas of some GitHub request fails and I retry it's a minor annoyance, but in most cases I won't even know whether that was GitHub's flaw, my local system or some networking in between.
Our Uptime calculation is based on the percentage of successful requests we serve through our web, API, and Git client interfaces.
So when a customer finds a broken service it is in their financial best interest to repeatedly hammer the broken service and drive down the uptime calculation to trigger their rebate.
Just an observation, not a suggestion. I’d fire any customer I found doing this.
What do you mean? All modern enterprise analytics/monitoring solutions are going to be able to give you some kind of top-level "request success rate" metric. I assume they mostly just lean into whatever monitoring tooling they have set up. What kind of "calculation" are you imagining here? Like a very specific SRE formula for availability windows or something?
You want to start measuring the closet to your users. In most cases that would be some sort of load balancer. I don't think there's much we can do without going to the client side.
I mean that seems like a question to ask an account rep? I'm sure it's also probably not a hard and fast rule for every single customer, hence the ambiguity in the general language.
They obviously don't have beacons on the client side, I wonder if it's based on statistics, at this time of the day on a Tuesday we should be getting x requests but are getting only x/n.
Client side beacons would have to be implemented in the git client, and making parallel requests. I've seen no evidence of it myself, and I'd think people would freak out if found inside git's source.
It’s evolutionary pressure; software is malleable and potential functionality is limitless. Software companies that didn’t ascribe to this philosophy were repeatedly killed by ones that did until it became the status quo.
I mean, that's rather disputable: The Apollo 1 exploded, medical mistakes have a toll of 250.000 deaths per year in the US alone; among many other serious mistakes on vast different areas, I think unreliability is unfortunately a constant on the human race.
The interesting thing is that Git is entirely non-centralized, so in theory they could simply redirect to servers onto which the data has been mirrored.
Git is, but the APIs and all the services they provide around it aren't.
That said, I think it's a bit weird that they don't store the data of the services around the code itself in git, like they do with e.g. sites. That way you'd have an `issues` branch that you could still access if github is down.
But that would probably pave the way for easy migrations away from Github.
Every project i have been on where we built for availability and resilience has inevitably had at least one single point of failure. Usually it is something deemed non critical, but somehow can still bring the infrastructure down (A single DNS server at one of our production sites did this, we have 2 more accessible via a VPN tunnel, it was deemed if the production DNS went down the other two were still reachable, to bad the day it happened the tunnel was down too).
Also you have to deal with sysadmin error, i know us sysadmins are practically perfect in every way, but occasionally we make mistakes....big mistakes. ;)
When Microsoft buys companies, they tend to progressively decay as the original architects leave, the morale of remaining employees grinds down from the stress and they bring in cheaper contractors to duct tape the bits together and plug the holes in levee with their fingers. I've BTDTBTTS. cough LinkExchange, WebTV, Hotmail, Skype, Softricity, Nokia, LinkedIn, Danger/Sidekick cough GH maybe next. ¯\_(ツ)_/¯
That has nothing to do with Microsoft. That is ANY large merger.
1. Nothing is going to change. We bought this company because we love it
2. We need to show a higher profit for this quarter, cut all expenses for every subsidiary by 15% by Friday
3. Cut back on training, R&D, and support teams. they are a huge cost center
4. Bunch of employees leave after retention bonuses, replaced with MUCH cheaper labor
5. Need to show better on our next quarterly filing, slightly increase prices
6. Through attrition, replace more good people with cheap drones, until nobody knows WHY things are the way they are.
7. More increased prices, and way increased support contracts
8. Wonder why we have lost all this marketshare. Look at Company X, they are doing great, lets buy them.
Skype was bad before Microsoft when it was still part of eBay and stayed terrible after Microsoft. LinkedIn was legendary in dark patterns usage way before Microsoft.
Nokia... I don’t know about that one. Change to smartphones hit every “old” phone brand... Ericsson did not survive, Siemens did not survive, Alcatel is just a brand now, even Sony has a hard time... Nokia would probably die no matter what. All the Maemo/linux based OSes (that kept changing names all the time) were nice, but so was Palm’s WebOS...
Well, some other people in the comments disagree. And it hasn't happened yet, but it's the way they don't manage / integrate acquisitions very well unless they're wowie complementary products like Visio. Danger dropped off a cliff and Softricity was absolutely amazing but shelved, so friends of mine basically repeated the theme for VMware View and were acquihired by VMware. Time will tell where GH goes.
People's comments are meaningless, you can look at historical GitHub up-time and see that it hasn't changed meaningfully.
"And it hasn't happened yet"
Ah yes, now you have to backtrack from: it happened! to... no wait I promise it will happen! Based on... what? The fact that some acquisitions don't go well?
This is all pure speculation with no substantiation.
I went back from the time of Microsoft's acquisition, and that status seems heavily underreported. At least when I checked now, it was all green, green. That does not reflect my experience.
To be fair, they didn't say that was the cause of the current outage. They made a general observation that Microsoft-acquired companies degrade, which seems like a fairly reasonable observation: The goal of being the best Git service/repo falls by the wayside as other corporate goals push in.
Early Danger adopter here. The hiptop was great for its time but with the advent of the iPhone, it was obsolete. That was before the MS acquisition. Rubin (and presumably much of his team) had long prior left for Android.
Danger was already in freefall by the time of the acquisition. You can't blame Microsoft for that.
Do a search on HN for "Github down". It happened a lot before the Microsoft acquisition in mid-2018. Perhaps what you're saying is true, but your comment is entirely not relevant to this outage.
Have they written any postmortems regarding their last couple of degradations? I tried searching their blog but the only ones that popped up were over a year old.
I rolled my eyes twice, then merged a few PRs manually and moved on with my day. (i.e. the git server itself and all the API required to interact with the CI automation _appears_ to work just fine)
And ci integration; which runs automated tests; which must pass to release package/docker image/whatever; which is required for deploying new version for your system etc. The whole thing is as distributed as business of a guy selling hot dogs on the street.
Almost nothing I work on has a dependency on GitHub. Whether work or personal. We have all dependencies vendored at work and my personal stuff is the same.
There really is no good reason bee so dependent on GitHub.
I had the same guess... migrating from aws to azure & hitting some bumps. Have to assume they won't be very forthcoming about it if that is the reason.
> migrating from aws to azure & hitting some bumps
Could be the IO? I remember colleagues working on getting stuff running on azure and they experienced horrible IO latency, as well as very low throughput for lots of small IO (aka unix-style software).
Was a few years back so it might have improved since, but if those things are still non-optimal and github is built with a unix-style vision of tons of small IO access…
This specific outage taught me that github apparently stores git repos on-disk which I was not expecting though (because the API access complained it could not delete repos until they'd fully backed to disk or something).
% uptime is a terrible metric. Begin down for an hour in the middle of the day is only .14% downtime for the month but is typically regarded as a big deal. If it happened every single workday you're still looking at 97% uptime! Sounds wonderful, right? Just work around it for an hour per day!
Number of outages and duration total are much better metrics.
looking again, the reporting under 'API Requests' downtime seems accurate but none of the earlier outages report downtime under 'Git Operations', which were also broken.
didn't notice there was a selector for which type of downtime on that page that didn't default to all.
I've been noticing the issue they're describing for a few days now, with errors in GitHub Actions requiring rebuilds and webhooks not seeming to fire which caused Jira to go out of sync.
> We continue to investigate the issues with GitHub services and will shift to a slower update cadence to provide more meaningful updates going forward.
Posted 18 minutes ago. Feb 27, 2020 - 16:12 UTC
A second mirror doesn't really help - when github goes down, the code should still be available locally on your computer. The things that become truly available when github dies are all the non-git features: issues, PRs, etc...
There are several ways to work around this, but none are really satisfying.
>We continue to investigate the issues with GitHub services and will shift to a slower update cadence to provide more meaningful updates going forward.
Translation: our shitty software update practices are now affecting Github, not just Windows!
If anyone from Microsoft is reading this, why is your company so incompetent at software updates in the past few years?
I've read it as "we'll stop posting silly updates to this incident until we actually know something", not "we'll stop rolling out updates to our services".
I'd wage that Microsoft practices had nothing to do with this, but I'll wait for RCA.
You honestly believe that a company the size of GitHub has had their software update practices appreciably change since the acquisition? Relax, GitHub had update issues before and they will have them again.