Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GitHub was down (githubstatus.com)
291 points by arparthasarathi on Feb 27, 2020 | hide | past | favorite | 165 comments



Sincere apologies to all GitHub users for the downtime this morning, and the brief outages last week as well. We take reliability very seriously, and will publish a full RCA in the near future.


> […] downtime this morning, and the brief outages last week as well.

For context, there were four (public) incidents this week:

• Incident on Feb 27, 14:31 until 18:54 UTC - https://www.githubstatus.com/incidents/q07bfjh7jf1t

• Incident on Feb 25, 16:36 until 18:48 UTC - https://www.githubstatus.com/incidents/xp2qc958g4wt

• Incident on Feb 20, 21:31 until 22:16 UTC - https://www.githubstatus.com/incidents/bd29l6zgr43g

• Incident on Feb 19, 15:17 until 16:09 UTC - https://www.githubstatus.com/incidents/fxbbtd7mhz1c


Well Bitbucket lost all of my repositories this morning so you've got a long way to fall


Bitbucket also have frequent downtimes: https://bitbucket.status.atlassian.com/


What do you mean they "lost your repositories"?


When I tried to log in I was prompted to "upgrade my account" to a "Bitbucket Cloud" account. After doing so, all of my repositories were gone. It seems that my repositories remained on my "Bitbucket ""Regular""" account but that my email address was no longer associated with it, giving me no way of logging in to it. I emailed support 6 hours ago and have yet to get a response.


Wow Atlassian... That is quite horrifying! Glad I stuck with GitHub through the MS acquisition. Bitbucket was probably my main alternative.


For the record, Azure DevOps did the same thing to me when we switched over to Azure AD. My account and repositories ended up in an entirely corrupt state. Support was eventually able to resolve most of it, but I’m still discovering problems.


I shudder at the thought of GitHub one day trying to integrate with people's Microsoft accounts.


Yep. Luckily I was able to recover the most important one - a website I'd built a couple years ago for a paying client - from Heroku, of all places.


At least your main competitor's uptime metrics are also pretty bad, so fingers crossed.


What did you guys deploy/what scale tipping point did you guys hit that caused the past 3 days of problems?

At my job, if something we go wrong... management just tells us roll it back. That always fixes the problem, right? :P


Yep, it is Dunning-Kruger effect.

If rolling back works for a simple system, why wouldn't it work for a complex one?

Because one cannot step into the same river twice. Heraclitus would probably make a good engineering manager.


This is 100% not the Dunning-Kruger effect. How on Earth would it be?

edit - from wikipedia: In the field of psychology, the Dunning–Kruger effect is a cognitive bias in which people assess their cognitive ability as greater than it is.


Maybe not directly but it could still apply.

I interpreted GP as saying the aforementioned management has just enough cursory knowledge to want to apply the same hammer that worked on a simple system to that on a complex system, but not enough knowledge to realize the unknown-unknowns that they aren't even aware of.


I wouldn't say that's a psychological phenomenon though, just ignorance, arrogance or over-confidence.


Thanks Nat! Keep up the good work and thanks for contributing here. Github is still my favorite. =)


Do you guys ever miss meritocracy?


This seems to be the third or so day in the past week I've had issues with GitHub around this time in the morning. They've typically been really good. I'm a bit surprised there hasn't been more talk about it on HN.


They seem to be doing heavy work on it. Now on Mobile you can't see repos in "Desktop Mode" which is unfortunate. I have to tell my browser to pretend to be in desktop mode. Plus the regex post from the other day seems to imply they are working on new things when somebody from GH replied in said thread. I don't mind improvements, but don't break production guys...


They also changed the "Group Membership" dialog to be paginated when you add a new person to an organization. We have over 200 groups so now I have to page through for ever new hire we add. There's not even a search option.

I'm sure the pagination might better for performance, but it's terrible UI.


They may have missed that one, because at the same time they introduced both pagination and search to the repository membership page, and boy did that help us on one of our repos with a few hundred direct collaborators, by the end we could only manage access through the API because the page didn't even load most of the time.


I see why web pages need pagination so the server or browser doesn't OOM, but there really ought to be 10000 entries per page, not 25 that most sites seem to like.

Ctrl+F on a list of 10000 entries is far easier than clicking through 400 ajaxy pages and trying to figure out some custom and buggy filtering system that probably doesn't allow regex.

Past 10000 records most sites probably ought to just let you export in something bigquery compatible anyway - Regular Joe isn't going to have more than 10000 of anything, and anyone who does can learn how to use proper data tools.


> there really ought to be 10000 entries per page

Did you miss the part where I noted Github’s lists fail to load (let alone render) long before that point?


They really need beta.github.com to let people test changes that are not yet defacto. A uservoice type of thing, and the ability for people to join. I love to beta test and give feedback. Microsoft has used uservoice in the past, as has Sulake and other companies I've beta tested for (as a customer).

Edit:

Realized *.github.com takes you to your .github.io sites.


> Now on Mobile you can't see repos in "Desktop Mode" which is unfortunate.

Wait, what? On iOS Safari I can only see repos in desktop mode now (except the issue tracker which is responsive anyway). Which is a good thing. Not sure why you have the exact opposite experience?

(I do vaguely recall being asked if I would prefer desktop mode on my phone a while back, and I said yes.)


I think it sets a cookie or similar to save this.


[flagged]


Intentions are meaningless. If you're providing a service, (especially charging money for said service), you can't break it because "you're hard at work".


>you can't break it because "you're hard at work"

Apparently you can...because it is broken.

A few days of degraded service is frustrating, but their up-time has bought them a lot of credit in my book - especially considering what they do.

This is the real world. No service is magically infallible.


> Intentions are meaningless

Are you a robot? Have some humanity...


GitHub is not a human being, it's a company.


A lot of people have patience on day one, but this is the third or so day now this has happened. It’s understandable there are ruffled feathers.


Yeah... once every few weeks is one thing. Once a day is getting really annoying.


Yea, I'm surprised I didn't see anything from the outage yesterday.


Maybe they can’t post it because it doesn’t work /s


Still miles better than BitBucket.


When was the last time BitBucket had an outage? Personally I don't see a lot of difference between the two platforms; or GitLab (my primary now). Github probably has the best UI, but Gitlab's has gotten a lot better; and there are always self hosted solutions like Gogs.


> When was the last time BitBucket had an outage?

I've been doing on-going client work for someone using BitBucket and for weeks it feels like every other day has an outage related to their pipelines (CI) feature (the thing I happen to be working on).

It was constant banners about service disruption. There's a lot of UI outage related issues too, like the pipelines page starting to show a new build but never updating any of the progress until you reload the page -- which sounds like some type of API outage somewhere. I'm not sure if that gets reported as an outage but it makes using the platform not fun.


I'm pleased I don't have to deal with BitBucket any more, but back a year or two it felt like it had an outage that impacted work at least once every six months. Sure that might not sound like much, but it was always a pain.

Plus of course the service was so damn slow that using it was a daily pain.


There's a yellow banner (that you can't even close) shown every few weeks, and it's usually related to the Pipelines being down, again. That often results in degrading functionality in other parts of the project too. And it's still slow as molasses. I wish I'll never have to use Bitbucket again, in the future.


How much data does BitBucket have to process on a daily basis compared to GitHub, or GitLab?

I imagine that there are stability issues that any provider will have to deal with as they scale to account for the masses.


Github's diffs are pretty much instantaneous, Bitbucket just gives up, "now that's a lot of code!". No, it was a one line change actually.


Github will bail on large diffs, too.


I've also much preferred GitLab's GUI. It seemed to be much cleaner and smoother than Githubs (not to mention the much better named Merge Request).


Sure, but not better than Gitlab.


Do you use Gitlab? Everybody on HN loves to love Gitlab because they're the underdog, and the product isn't bad, but it's not that great either.


Yes, I use it for personal projects. I also use a company-hosted version at work. The built in CI is great. I can't think of a reason, other than price for companies, why to use GitHub over gitlab. Both are great, but gitlab's built-in CI I think is easier to use and better integrated.


Yeah, you just lose 6 hours of prod data on gitlab. :)


Don't forget to check your SLAs

Enterprise = 99.95% (quarterly)

https://help.github.com/en/github/site-policy/github-enterpr...

They're having a bad February but January was good. We will see what March has in store


> How do we calculate Uptime?

> Our Uptime calculation is based on the percentage of successful requests we serve through our web, API, and Git client interfaces.

Just curious, how do they measure this? What is the actual calculation?


> What is the actual calculation?

Not answering this directly, but the paper Meaningful Availability [0] released recently really changed my opinion on how to calculate and visualize availability. There's a discussion on HN as well [1].

[0]: https://www.usenix.org/system/files/nsdi20spring_hauer_prepu... [1]: https://news.ycombinator.com/item?id=22424173


That was insightful to some point. But, of course, the relevant metric is "expected availability" -- before I decide to go on the service; therefore, not the same as "customers served". If I have to think about downtime then my experience is degraded; moreso if I have to delay and batch planned interactions (all of which will later count as successful!)

[Edit: To the point: a high rate of randomly-timed failures is a kind of degraded experience, but not as critical as blocky patches of downtime. A 1% rate of randomly-timed failures is much much much preferred than having the service go out three straight days every February.]

Also: uptime is not the same as "customer delight". It's all about time.


> a high rate of randomly-timed failures is a kind of degraded experience, but not as critical as blocky patches of downtime

Do you think that's an accurate generalization for all software and business contexts? I think a novel insight about the paper is that windowed user uptime is able to visualize the differences. (See Figure 20 from the paper.)


It certainly is not - a trade might not care about 99% of the time, but the exact moment they want to do a trade the system must work.

Whereas of some GitHub request fails and I retry it's a minor annoyance, but in most cases I won't even know whether that was GitHub's flaw, my local system or some networking in between.


Our Uptime calculation is based on the percentage of successful requests we serve through our web, API, and Git client interfaces.

So when a customer finds a broken service it is in their financial best interest to repeatedly hammer the broken service and drive down the uptime calculation to trigger their rebate.

Just an observation, not a suggestion. I’d fire any customer I found doing this.


What do you mean? All modern enterprise analytics/monitoring solutions are going to be able to give you some kind of top-level "request success rate" metric. I assume they mostly just lean into whatever monitoring tooling they have set up. What kind of "calculation" are you imagining here? Like a very specific SRE formula for availability windows or something?


Depending on what part of the system is down, how do you know you even got a request to mark as failed?


You want to start measuring the closet to your users. In most cases that would be some sort of load balancer. I don't think there's much we can do without going to the client side.


It means their SLA bases on non 5xx and connection errors?


I mean that seems like a question to ask an account rep? I'm sure it's also probably not a hard and fast rule for every single customer, hence the ambiguity in the general language.


Same here.

They obviously don't have beacons on the client side, I wonder if it's based on statistics, at this time of the day on a Tuesday we should be getting x requests but are getting only x/n.


> They obviously don't have beacons on the client side

Says who?


Client side beacons would have to be implemented in the git client, and making parallel requests. I've seen no evidence of it myself, and I'd think people would freak out if found inside git's source.


Nothing quite like the “maybe” credit backed SLA.


what are the consequences of busting 99.95%? what happens if it is 99.94% vs 19.94%?


Mitigate: Local repos.


With their new Github Actions, these downtime would stall your entire company workflows if fully depend on it.


No matter how many talented engineers you have on staff, your entire service can still go down. Let's pause and reflect on that. ;)


"You can't legislate against failure, but you can focus on fast detection and response"

-- Chris Pinkham


"I will not be harassed in my own private domicile"

-- Jesse Pinkman


"I AM THE ONE WHO KNOCKS"

-- Walter White


"Prevention is ideal, but detection is a must"


It’s amazing how this is accepted in the software world. Move fast and break things, such a different philosophy to other areas.


It’s evolutionary pressure; software is malleable and potential functionality is limitless. Software companies that didn’t ascribe to this philosophy were repeatedly killed by ones that did until it became the status quo.


I mean, that's rather disputable: The Apollo 1 exploded, medical mistakes have a toll of 250.000 deaths per year in the US alone; among many other serious mistakes on vast different areas, I think unreliability is unfortunately a constant on the human race.


I think the quote from Pinkham is more about dealing with genuinely unavoidable failures, not those due to moving without appropriate due diligence.


The interesting thing is that Git is entirely non-centralized, so in theory they could simply redirect to servers onto which the data has been mirrored.


Git is, but the APIs and all the services they provide around it aren't.

That said, I think it's a bit weird that they don't store the data of the services around the code itself in git, like they do with e.g. sites. That way you'd have an `issues` branch that you could still access if github is down.

But that would probably pave the way for easy migrations away from Github.


>"But that would probably pave the way for easy migrations away from Github."

bingo


At least they could have used the concepts in Git's design. But it seems they didn't learn much from the tool they based their service on.


I don't think learned is the right word to use here. Github's centralized design and vendor lock-in is quite intentional.


Every project i have been on where we built for availability and resilience has inevitably had at least one single point of failure. Usually it is something deemed non critical, but somehow can still bring the infrastructure down (A single DNS server at one of our production sites did this, we have 2 more accessible via a VPN tunnel, it was deemed if the production DNS went down the other two were still reachable, to bad the day it happened the tunnel was down too).

Also you have to deal with sysadmin error, i know us sysadmins are practically perfect in every way, but occasionally we make mistakes....big mistakes. ;)

So redirecting might not always be possible.....


Yes, git is.

The issues, comments, PRs, wikis etc... that we all came to depend on aren't.


The wiki is stored as a git repository, although to my knowledge the others are not.


could learn a few things from Netflix and chaos engineering...


Let’s reflect on Amazon’s 99.999999999% (literally their number) durability on S3.


Durability isn't uptime.


Correct. Availability is 99.99. https://aws.amazon.com/s3/storage-classes/


Regardless, the SLA that AWS (& other cloud providers) meet is quite impressive.


But what do they offer if they don't meet it? I mean a discount on your monthly bill somehow doesn't sound like it would cover a potential loss.


When Microsoft buys companies, they tend to progressively decay as the original architects leave, the morale of remaining employees grinds down from the stress and they bring in cheaper contractors to duct tape the bits together and plug the holes in levee with their fingers. I've BTDTBTTS. cough LinkExchange, WebTV, Hotmail, Skype, Softricity, Nokia, LinkedIn, Danger/Sidekick cough GH maybe next. ¯\_(ツ)_/¯


That has nothing to do with Microsoft. That is ANY large merger.

  1. Nothing is going to change.  We bought this company because we love it
  2. We need to show a higher profit for this quarter, cut all expenses for every subsidiary by 15% by Friday
  3. Cut back on training, R&D, and support teams.  they are a huge cost center
  4. Bunch of employees leave after retention bonuses, replaced with MUCH cheaper labor
  5. Need to show better on our next quarterly filing, slightly increase prices
  6. Through attrition, replace more good people with cheap drones, until nobody knows WHY things are the way they are.
  7. More increased prices, and way increased support contracts
  8. Wonder why we have lost all this marketshare.  Look at Company X, they are doing great, lets buy them.


Can you remove the leading spaces causing your comment to appear in a code block? It’s quite hard to read on mobile. Thanks!


Skype was bad before Microsoft when it was still part of eBay and stayed terrible after Microsoft. LinkedIn was legendary in dark patterns usage way before Microsoft.

Nokia... I don’t know about that one. Change to smartphones hit every “old” phone brand... Ericsson did not survive, Siemens did not survive, Alcatel is just a brand now, even Sony has a hard time... Nokia would probably die no matter what. All the Maemo/linux based OSes (that kept changing names all the time) were nice, but so was Palm’s WebOS...


IME GitHub has not had increased downtime after Microsoft's acquisition.


Well, some other people in the comments disagree. And it hasn't happened yet, but it's the way they don't manage / integrate acquisitions very well unless they're wowie complementary products like Visio. Danger dropped off a cliff and Softricity was absolutely amazing but shelved, so friends of mine basically repeated the theme for VMware View and were acquihired by VMware. Time will tell where GH goes.


There are so many logical flaws here.

People's comments are meaningless, you can look at historical GitHub up-time and see that it hasn't changed meaningfully.

"And it hasn't happened yet"

Ah yes, now you have to backtrack from: it happened! to... no wait I promise it will happen! Based on... what? The fact that some acquisitions don't go well?

This is all pure speculation with no substantiation.

I recommend learning about confirmation bias.


I agree, but I just now took a look here: https://www.githubstatus.com/uptime?page=7

I went back from the time of Microsoft's acquisition, and that status seems heavily underreported. At least when I checked now, it was all green, green. That does not reflect my experience.


To be fair, they didn't say that was the cause of the current outage. They made a general observation that Microsoft-acquired companies degrade, which seems like a fairly reasonable observation: The goal of being the best Git service/repo falls by the wayside as other corporate goals push in.


>Softricity was absolutely amazing but shelved

Why do you say that? I thought it was just renamed App-V and it's still going strong to this day. https://docs.microsoft.com/en-us/windows/application-managem...


Early Danger adopter here. The hiptop was great for its time but with the advent of the iPhone, it was obsolete. That was before the MS acquisition. Rubin (and presumably much of his team) had long prior left for Android.

Danger was already in freefall by the time of the acquisition. You can't blame Microsoft for that.


Microsoft GitHub is currently part of core Microsoft software development infrastructure. Windows source code resides on GitHub even.


Do a search on HN for "Github down". It happened a lot before the Microsoft acquisition in mid-2018. Perhaps what you're saying is true, but your comment is entirely not relevant to this outage.


Have they written any postmortems regarding their last couple of degradations? I tried searching their blog but the only ones that popped up were over a year old.


I would also be really interested in them but didn't find anything yet. Maybe their new notification system has something to do with it?


HN already has a 'Github downtime' post as soon as I find Github weird and check HN if it's only me. How is everybody so fast? :-)


If Github is down, thousands of programmers suddenly have nothing better to do.


Conversely, if Hacker News was down GitHub would suddenly see a spike in traffic :)


I rolled my eyes twice, then merged a few PRs manually and moved on with my day. (i.e. the git server itself and all the API required to interact with the CI automation _appears_ to work just fine)


The actual server was broken for me:

  $ git push
  Enumerating objects: 26, done.
  Counting objects: 100% (26/26), done.
  Delta compression using up to 8 threads
  Compressing objects: 100% (15/15), done.
  Writing objects: 100% (15/15), 1.49 KiB | 1.49 MiB/s, done.
  Total 15 (delta 12), reused 0 (delta 0)
  remote: Resolving deltas: 100% (12/12), completed with 10     local objects.
  remote: Internal Server Error
  To git+ssh://github.com/<redacted>/<redacted>
   ! [remote failure]    wip -> wip (remote failed to report status)
  error: failed to push some refs to 'git+ssh://git@github.com/<redacted>/<redacted>'


Fortunately the first I saw of the one a few days ago started right around lunchtime.


The tragedy of the cloud.


git is distributed...


git is distributed. The GitHub issue and PR tracker isn't.


Yeah, but people are lazy and GitHub has a user-friendly UI with big green "Merge" buttons.


And ci integration; which runs automated tests; which must pass to release package/docker image/whatever; which is required for deploying new version for your system etc. The whole thing is as distributed as business of a guy selling hot dogs on the street.


HN is the next place after Github that a programmer frequently visits ;)


If Github is down, the software developing world comes to a standstill. So people either get a coffee, take a shit or go to YCNews.


Or a combination of all three!


Almost nothing I work on has a dependency on GitHub. Whether work or personal. We have all dependencies vendored at work and my personal stuff is the same.

There really is no good reason bee so dependent on GitHub.


Are they migrating to azure?


I had the same guess... migrating from aws to azure & hitting some bumps. Have to assume they won't be very forthcoming about it if that is the reason.


> migrating from aws to azure & hitting some bumps

Could be the IO? I remember colleagues working on getting stuff running on azure and they experienced horrible IO latency, as well as very low throughput for lots of small IO (aka unix-style software).

Was a few years back so it might have improved since, but if those things are still non-optimal and github is built with a unix-style vision of tons of small IO access…

This specific outage taught me that github apparently stores git repos on-disk which I was not expecting though (because the API access complained it could not delete repos until they'd fully backed to disk or something).


I don't think they mainly run on AWS, but on "bare metal": https://github.com/holman/ama/issues/553


Preparing the infrastructure for Windows open-source event?


Fourth time this month.


It’s quite unfortunate. I’m hoping the post some kind of RCA/analysis.

I know I would if we had a month with only 2 9s of uptime.


they're doing the usual corporate status page uptime lying:

https://i.imgur.com/vy7onDT.png https://www.githubstatus.com/uptime


I don't get it. Their own incidents tab on the same site shows four incidents this month: https://www.githubstatus.com/history

I'm not sure how that results in 99.98% uptime on the other tab.


turns out it's bad/misleading UI, there's a dropdown for which type of downtime which defaults to 'GIT Operations'


Ah, got it. It's still incorrect though. With the incident a couple days ago, I couldn't `git push` for a couple hours.


Yeah, this is just the poor UI of their provider, Atlassian Statuspage.


% uptime is a terrible metric. Begin down for an hour in the middle of the day is only .14% downtime for the month but is typically regarded as a big deal. If it happened every single workday you're still looking at 97% uptime! Sounds wonderful, right? Just work around it for an hour per day!

Number of outages and duration total are much better metrics.


There should be enough pixels for each minute of a day.


I hope they catch up with this. If you guarantee SLA than you need to be honest with it or guarantee less.


looking again, the reporting under 'API Requests' downtime seems accurate but none of the earlier outages report downtime under 'Git Operations', which were also broken.

didn't notice there was a selector for which type of downtime on that page that didn't default to all.


Check the dropdown. Select for example "Issues, PR, Projects". Shows 4 days of issues in feb.


Time to get on Microsoft's paid plan ...


I've been noticing the issue they're describing for a few days now, with errors in GitHub Actions requiring rebuilds and webhooks not seeming to fire which caused Jira to go out of sync.


Mssql only once guaranteed delivery in action?


Centralization is great, isn't it?


I think in a few years time we will look back and think

"Waa, that was so archaic the way we used to do things!"


It would be nice if someone wrote a decentralized VCS...


With issues built in like fossil, oh wait...


Gets philosophical as well - if you use laptop as a lamp in your tent - is it good or bad?


I appreciate the way they keep everyone informed in the downtime. It tells a lot about the company



Well, given the timing of GitHub reliability issues over the last few days, I think we can all agree it has everything to do with dates and timezones?

/s

Appreciate the work.


That explains why my test suite suddenly takes >1h and after I canceled it, the status was green O.o


Looks like they are fine now. Yay!


> We continue to investigate the issues with GitHub services and will shift to a slower update cadence to provide more meaningful updates going forward. Posted 18 minutes ago. Feb 27, 2020 - 16:12 UTC


Statuspage still isn't green https://www.githubstatus.com/


I'm still unable to leave any code reviews, and occasionally I'm getting error pages.


What do you guys recommend as a good way to continue work undisrupted when GitHub goes down? A second remote mirror?


A second mirror doesn't really help - when github goes down, the code should still be available locally on your computer. The things that become truly available when github dies are all the non-git features: issues, PRs, etc...

There are several ways to work around this, but none are really satisfying.


Use self-hosted repos like gitlab or fossil, and then mirror the public parts to github.


Git or fossil mesh.


>We continue to investigate the issues with GitHub services and will shift to a slower update cadence to provide more meaningful updates going forward.

Translation: our shitty software update practices are now affecting Github, not just Windows!

If anyone from Microsoft is reading this, why is your company so incompetent at software updates in the past few years?


I've read it as "we'll stop posting silly updates to this incident until we actually know something", not "we'll stop rolling out updates to our services".

I'd wage that Microsoft practices had nothing to do with this, but I'll wait for RCA.


You honestly believe that a company the size of GitHub has had their software update practices appreciably change since the acquisition? Relax, GitHub had update issues before and they will have them again.


All of this has happened before and all of this will happen again. So say we all.


Github is part of MS, but as I understand it, is run separately from the rest of Microsoft.


I think that's talking about status update cadence, not software updates.


Is it not ostensibly the same company, just with a new owner...?


Initially, but that is subject to change or more likely evolve over time.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: