Hacker News new | past | comments | ask | show | jobs | submit login
Incident with GitHub Actions, API requests, Codespaces, Git operations, Issues (githubstatus.com)
267 points by naglis on March 17, 2022 | hide | past | favorite | 118 comments



Whew, glad I decided to scroll HN right now. I've been puzzling over why I'm getting "! [remote rejected] master -> master (Internal Server Error)" as well while trying to push and decided to take a break.


Time to take some coffee and configure Vim


Don't you guys have other features and stuff to work on locally? What is this "time to take a break when GitHub is down"? I'm saying this a bit tongue in cheek btw :)


All my features are part of 1 PR, the PR contains no code to avoid bugs, the features are in my head.


I can’t see any deficiencies in this approach.


But muh plugins are all on Github...


I hope you have a lot of coffee.


It's been like that for at least 6 hours, randomly appearing. I would take a pause and try again and then it would work, but now it's definitely much more persistent.

Guess it's time to go play some video games....

https://xkcd.com/303/


Here you go:

  $ while ! git push my; do sleep 1; done
Works for me eventually, although commits do not appear in web interface (they do in the actual repository).


Having been on the receiving end of things like this: please, make the sleep longer. Adding more requests to already malfunctioning system is not a good way to help in fixing it.


Thanks but no thanks - no way am I doing anything to my core app repos when the repo host is fritzing out. This is one of those moments to go for a walk (or bed, depending on your timezone).


-f does not sound like a good idea to me in a script like that.


Also pretty much every usage of -f would be better off being --force-with-lease so you're less likely to accidentally clobber someone else's work. I have git fpush aliased to "push --force-with-lease" and try to spread the gospel when possible :)


Yeah, I learned it by using magit or vscode's other magit and they both default to --force-with-lease.


Good point. I just copy-pasted it from the terminal, as it made sense in my particular situation. I'll remove it.


Also yesterday depending on where you were in the world.


Yep, same here! Good time to make a new coffee :)


Same here got rejected when push. ! [remote rejected] HEAD -> main (Internal Server Error)


haha I thought I had finally made one too many git commits (I'm an over-commiter).


Never really realized that github had many technical incidents lol


It's the new Hotmail ;)


same here, I was having internet issues yesterday, and now that my internet is working github isn't, haha.


dito


I'm finding that pushes do go through eventually, this is probably grossly irresponsible, so I don't recommend its use, but I remembered I had this old alias to "push harder" in my ~/.gitconfig:

    [alias]
    thrust = "!f() { until git push $@; do sleep 0.5; done; }; f"
I've done a few pushes so far, and found that it's going through in <10 tries or so.


  # Retries a command a with backoff.
  #
  # The retry count is given by ATTEMPTS (default 100), the
  # initial backoff timeout is given by TIMEOUT in seconds
  # (default 5.)
  #
  # Successive backoffs increase the timeout by ~33%.
  #
  # Beware of set -e killing your whole script!
  function try_till_success {
    local max_attempts=${ATTEMPTS-100}
    local timeout=${TIMEOUT-5}
    local attempt=0
    local exitCode=0

    while [[ $attempt < $max_attempts ]]
    do
      "$@"
      exitCode=$?

      if [[ $exitCode == 0 ]]
      then
        break
      fi

      echo "Failure! Retrying in $timeout.." 1>&2
      sleep $timeout
      attempt=$(( attempt + 1 ))
      timeout=$(( timeout * 40 / 30 ))
    done

    if [[ $exitCode != 0 ]]
    then
      echo "You've failed me for the last time! ($@)" 1>&2
    fi

    return $exitCode
  }


Add some kind of exponential backoff to be a good citizen!


>Service degradation

>Time for some manual DoS


TIL about "until" loops! How neat.


half a second? Jesus dude calm down.


The delay makes me think you should use the German word for thrust


It's fine. Maybe it will force them to finally start paying attention to the quality of their work. If crap I'm writing for a living was misbehaving that frequently, I'd be sweeping the streets by now (or doing some other work that's actually useful to society).


It's OK to be frustrated since we rely on GitHub so much, but this is unkind. Software is complex. GitHub operates at a scale few of us work at. There are people at the other end doing their best traversing complex internal systems (organization and tech).

I would argue GitHub has done more for societal good than most tech ventures, by the way.


I was pretty pissed off, alright, so my comment probably gave out wrong vibes. I'm not arguing I could do any better (I probably wouldn't get past their interview process), and they certainly do have the talent (which is obvious by their technical blog posts).

It doesn't change the fact that the company has absolutely crap dev culture which seems to put features first and foremost, at the expense of everything else. There are products with even more complexity that don't fall over and die almost every single day. It's just not funny anymore. Facebook is pretty complex, it had major issues like this one, what, once in its entire life?

I don't remember Google Search (or other Google products) ever not answering my queries, and I've been using it for about 18 years.

And so on. I reckon it's because those companies have strong engineering culture (Google certainly does, at least), and this one doesn't.


GitHub actions has been like this for years now. Years. Years!!!!

And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough." Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.


> And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough.

I'm not one of those people. I may have been when I was much more inexperienced.

Software is hard. Full stop. Organizational politics, engineering culture, business / tech alignment are all hard. Distributed systems are hard.

> Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.

I may have also thought this when I was much more inexperienced. This isn't a resource problem. Even a small startup, when they start having failures due to scale from growth, it's not a money problem. Throwing money at this doesn't make it go away.

By the way, the Twitter fail whale impacted paying customers (advertisers).


That’s because GitHub Actions is Azure DevOps, or if you want to go back further, Team Foundation Server Pipelines.


People tend not to be very kind when any product they pay for goes down.

At the end of the day - our companies also have people that rely on our software working in order to do a lot of societal good.


Sure, but it’s incredibly naive to see gh having problems and go “they must not know what they are doing”


It is probably caused from postmortem culture not being shared in the community.

"Having problems" in this world (any kind, not only due to the github scale!) is something that happens - we are not perfect and we work on an incredible amount of layers of complexity.

It is sufficient to actually touch production code on a daily basis to see that it can happen to the best, with the best observability systems or processes. The key is avoiding blaming, and understanding iteratively how to fix the problems underneath (faster recovery, detection time, and so on).


Everybody should be refunded $0.05 for the unavailability of the service they paid for.


You should probably look for a new job then, because it's pretty difficult to get fired for underperformance as a software engineer these days. There are plenty of places you can write shit code, or if you prefer Rust, places where you can blog about other people writing shit code.

Anyway, you shouldn't fire someone for causing bugs in production since it indicates a systemic failure of all the checks that should come before the bug is deployed. Even if you can trace the root cause to one person, it would be counterproductive to fire them, because now they've made the mistake they probably won't make it again. Whereas their replacement doesn't have the same wisdom.


Does anybody else remember when GitHub's outage page used to have little graphs showing downtime?

Eventually they took it down as their outages were just too often.

GitHub has _always_ had terrible uptime. It's a great product - wish something would change but it seems cultural at this point.


They had massive problems with their main database cluster (MySQL). If you read through their engineering blog, most of the outages were related to their growth and the main database cluster. They moved workloads for some features to different clusters, but that's only to buy more time. Eventually they'll do proper shredding (by user or org I guess, not by feature) but that takes time.

Their engineering blog is full of articles about MySQL and the main "mysql1" database cluster, e.g. https://github.blog/2021-09-27-partitioning-githubs-relation...


i've noticed this too .. the real head-scratcher is how a solid chunk of github's db & infra folks left to join a database startup, one of them even becoming its ceo!!

if they had made github db/infra super-stable before this, it would be a vote of confidence in their new company, but instead imho it is the opposite


DB and infra folks are often tasked with shoveling shit uphill, and aren't in total control over how data or schemas get organized.


that's fair. i am just raising an eyebrow to github's apparent lack of sharding, as described in their incident reports -- while these engineers all left to join a db company that focuses specifically on sharding -- it seems like an experience mismatch.

if they were all sharding experts why wasn't github sharded properly. other large mysql shops have solved this, all the way back to the days of yahoo and flickr and livejournal


Which one are you referring to?


maybe i shouldn't have mentioned it, i don't want to name names and have this to come off as an off-topic attack subthread about a different company, sorry! it's a db company that has raised a lot of money and is mentioned on hn a lot, there are only a handful of these


my guess is:

    rot13 cynargfpnyr


I have no idea if this is remotely close to reality but, what if, their culture of breaking things and bad uptime is what allowed them to move fast and build a great product in the first place?


GitHub was founded in 2007. They were acquired by MS years ago. They should be well beyond any startup culture of "move fast at the expense of reliability".


I don't disagree with this, they could/should have transitioned already. But for one, cultures are hard/slow to change. And second, as an example, Facebook had the motto "move fast and break things" until 2014, and by that time they also were beyond the startup phase(), so this kind of culture is not only for early days.

() They were founded in 2004, that's 10 years in. By that time in 2014 they had 800M+ monthly active users and $12 Billion revenue; and they had this culture internally until this point.


Facebook is a social media app that hardly anyone (except for advertisers) pays for.

GitHub is an enterprise product crucial to tons of businesses.

Cultural comparisons between the two really shouldn't apply.


Aren't both companies potentially loosing money when their products don't work? The fact that it's crucial to businesses seems to be the client perspective, not the company perspective. It could also seen as critical for some businesses to advertise on Facebook. This could call for a different culture internally but I'm not convinced this is necessarily the case.


Whew, outage timestamps in UTC.

Now I won't have to know what time is it California, and if California currently has PST, PDT, PTSD, etc


As someone with diagnosed PTSD, I never thought I'd psychologically level with an entire state ;)


To anyone who is reading this and genuinely wants to know: it's PDT, UTC-7.


This is causing actions jobs to hang after completing, consuming precious minutes. I don't think I've ever seen a refund when this happens, so I recommend everyone check their jobs and cancel them for now.


Two days they have been down now. Github has, by far, the worst uptime of any critical service I've seen going on multiple years now.


The github.com homepage, as well as api (via `gh`) are not working for me either.


Their status page is reflecting the new outages. Good on GitHub for actually updating that quickly.


We've been experiencing problems here in Asia for almost 12 hours now, and it's been "all green" the whole time.


It's a shame they not open about the extent. Sign in/out hitting a 500 internal error isn't really "degraded"


> The github.com homepage

Only while logged in, it seems.


These incidents have to hurt Azure's brand value. It's a monster task to run something as big as GitHub, if they ever get it stable it will lend a lot of credibility to Microsoft's cloud skills.


There's not really all that much pointing to an infrastructure level failure - it's possible, but it's just as likely it's an application-level failure somewhere in Github's code. The API is returning 500s and not 503s and the failure is relatively quick, so it's not obviously a server outage.


It's yellow lights across the board, literally nothing is green. That's usually indicative of some sort of software infrastructure level failure or cascade failure, not an application-level failure, which usually manifests as one or two specific services going down (depending on how you define "infrastructure" and "application" - with IAC, arguably the software defined infrastructure _is_ an application). I doubt its a physical hardware issue. It's rarely hardware (except when your DS catches on fire).

No red lights, so it's probably not something catastrophic like that facebook DNS SNAFU, but it definitely smells infrastructure- or deployment-scoped. Like either small DNS issue, or some load balancers are sending traffic to servers which cannot handle it programmatically (schema change?) so they are barfing.


Only load balancer (as an infrastructure) can hit the lights across the board. Not much else.


Databases, Caches or the authentication service? For me read-only requests are working fine and I've not seen any issues. Submitting new contents (e.g. comments) is where it's failing for me. It might be that their database primary is falling over.


Serious questions:

1) Is GitHub runing under Azure's technology stack?

2) Is GitHub under Azure's mamagement (in contrast to Visual Studio's team)?

I'm not sure about two but I'm pretty sure that GitHub doesn't run under Azure at all, considering that GitHub has fully separate networking from MSN's/Azure's (and GitHub's machines do pingback unlike most of Microsoft's machines which don't).


The last time I checked, the only meaningful parts of GitHub that ran on Azure was/is Actions. Everything else is AWS.


> Everything else is AWS.

Huh? As of at least 2017 GitHub was running their own data centers [1]. Any evidence that’s changed? Microsoft bought them in 2018, I can’t imagine they went to AWS after that.

[1]: https://github.blog/2017-10-12-evolution-of-our-data-centers...


Ah sorry, yeah I wasn't being very accurate. I was checking "what's on Azure", I didn't really follow up to check in detail the breakdown of the rest. I just saw quite a bit on AWS and assumed it all was.


Believe they were multi-cloud, definitely had stuff on AWS


GitHub is pretty stable. What are you talking about? I doubt most GitHub users know it's on Azure.


Github Actions does not have good track record. https://www.githubstatus.com/history . You don't need a majority of GitHub users to understand it's owned by Microsoft for there to be an impact on brand value.


I don't consider this a reflection on Azure at all. It's really just a reflection on GitHub under Microsoft's leadership.


Eh, I'm no Microsoft fan but it used to have issues before the acquisition too. I can't really remember if it was better or worse.


At least one good thing about GH is that while things break, the status page is updated relatively fast compared to other companies, when all HN knows about outage for 1h+ until it's acknowledged.


And of course my developer teammates are still trying to merge PRs.

I don't care that it works "some of the time"! Don't mess with the repos when the repo host is having seemingly random issues.


For example: while actions are down, branches can be merged without ci tests passing, even for protected branches. This just happened on one of my repos.


One of our systems runs AWS code repository in parallel to Github and builds are triggered from there (but not in us-east-1). Time to migrate the rest of our systems to having that fallback.


It's almost the same time as their incident yesterday too. Although today the scope is wider - yesterday it was Webhooks and Actions. Today core git is broken as well as the APIs.


Yep. I hope they post an aws style postmortem… this is kinda ridiculous (although I do empathize as an ops person). Webhooks breaking broke all of our pr bots bringing development to a standstill yesterday; today everything seems f’d.


Looks like the drinking started early at GitHub... good on them!


It’s not DNS

There’s no way it’s DNS

It was DNS


Here we go again. GitHub going completely down at least once a month as I said. [0] So nothing has changed. That is excluding the smaller intermittent issues. Let's see if anyone implemented a self-hosted backup or failsafe just in case.

Oh dear.

[0] https://news.ycombinator.com/item?id=30149071


The entire point of git is that it's decentralized, lol. If I've cloned locally like millions of people do daily, I have a backup.


> The entire point of git is that it's decentralized, lol.

No-one here is criticizing git itself. That is not the point.

It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.

The majority of these projects went all in on GitHub, including using GitHub actions, npm packages, hosting their whole website, etc hence as soon as it goes down, they can't push or update anything; especially if it was very urgent. It has become a giant single point of failure for nearly everything.

There is a reason why the Linux Kernel, Mozilla, Qt, Chromium, GNOME, ReactOS, etc self-host their own repositories and have fail-safes repositories if Github goes down and becomes unreliable.


If you're not building some downtime into your model you're not being realistic. It's easy to point fingers but the reality is every product and company will experience unexpected downtime. It's an easy business decision for executives/buyers, pay a team of top engineers to home grow a durable product assuming it can even be done at extreme cost now and later or be okay with a couple of hours of downtime here and there with far less cost.

Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.


> Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.

And? It is a read-only mirror. It just 'pulls' changes from the self-hosted copy. It can't be used for direct development for the maintainers. If the main official repository was on GitHub and that goes down, then everything will be down as well including (issues, pull requests, actions, etc). Then you will be totally reliant on GitHub for 'fix it'.

There is a reason why those same projects do not use GitHub as their main repository and tell you 'We don't accept issues or patches here'8. They have control over their issues trackers, review process and CIs and their projects won't halt due to GitHub's unpredictable and intermittent issues.

For those projects, GitHub is only* used as a read-only mirror for cloners, but useless for anyone to send patches, track issues, PRs, etc. which that is done on their self-hosted repositories and it has been like that for them for years.


It's a remote origin, once I clone and branch which I can do from a mirror, I can write and commit as much as I want to the repo, where I push the change up to is ultimately my decision assuming I have access. The point stands, these companies use Github to act as a mirror/backup for their project in the event of something like a disaster (e.g. datacenter fire).

There is no perfect solution and there never will be. Everything has associated cost. You're focused on the distribution of devops tooling, but that is only a fraction of the story. Many large companies have moved to Saas based products because they realize doing it themselves comes with significant cost. An hour or two of downtime is cheaper then a datacenter, equipment, bandwidth, licensing, and expertise to manage all of it.

It's a simple cost benefit analysis. You need to look at this issue through the lens of a business and not just an engineer would be my advise. Interestingly enough you can only point to OSS projects which rarely pay for tooling anyways.


> It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.

server != service

assuming its a distributed service vs one server for a multi-billion$ company also group of humans built this service, so its not gonna be perfect :shrug:

companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%


> assuming its a distributed service vs one server for a multi-billion$ company also group of humans built this service, so its not gonna be perfect :shrug:

Regardless of any of that, it still is proven to be unreliable. It is also not an excuse to go all in and risk being fully dependent on GitHub (and their services) and tolerate such downtimes and run to HN and complain about it each month.

> companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%

Then I should see no-one being surprised or complaining about 'GitHub having issues' or 'GitHub is down again' whilst also using it for GitHub actions, pages, issues or pushing their changes and they are not paying for GitHub Enterprise or some higher plan; especially serious open source project like Mozilla, Chromium, etc. That's why they self-host.

Until the next time GitHub goes down again (hopefully that won't be in another month's time).


> Then I should see no-one being surprised or complaining

Oh agree 100%, this is the equivalent of the "reply-all email threads" and people responding to be remove or stop. I find it entertaining overall.

> Until the next time GitHub goes down again

Cheers


Good point! This would have been a bigger issue back in the days of cvs and svn.


At some point GitHub main page 500'ed for me. The problem is probably somewhere down to the core, not at something isolated.


This is why you should have your code on multiple remotes, i.e. Azure DevOps, Git labs, self hosted git server.


Can't push changes at the moment.


Pushing to repos is also not working


same here


I downloaded a GitHub repo from Software Heritage [0]. I searched and found the repo was in the archive. Software Heritage saved my day.

[0]: http://archive.softwareheritage.org/


It's intermittent, I was able to get a push through eventually, and am now hung trying to convert a draft PR to ready for review. It took many tries to get to draft.

I'm probably not helping by repeatedly trying, but I don't want to forget this PR.

Yay it finally went through.


I'm able to occasionally push commits, but PRs aren't picking up the update or rerunning CI


Why is GitHub having so many issues recently? do you think it's due to the recent events?


Do they regularly publish post-mortems after their repeated incidents? Might be interesting...


I think they usually do, especially for the hairy issues.


In Asia I've been having problems for almost 12 hours now (both locally and from our CI/CD which is in a different country). Also had similar problems on Tuesday.


Wow, suddenly staying on-prem with old rusty Jenkins is not so bad. (It has its issues, but at least I had better service levels in last 12 months)


You have to use Jenkins though.


Ah, so this is the reason for the mystery failure I encountered with GitHub Actions. My job just failed without emitting a single error message.


They just had a (smaller) outage yesterday. At first I thought it's yesterday's incident finally got enough points on hn.


Pull review comments and approvals as well


I'm unable to even sign out. It gives me a 500 and then drops me right back at the homepage on a refresh.


zenhub appears to be having issues as well (can't load ticket at all) due to their GitHub integrations I assume.


ya’ll do know Git is a distributed VCS right? it’s ok for the the remote to be offline.


I cant even comment on issues...


do they publish postmortem's? gist.github.com was down too for sometime




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: