Whew, glad I decided to scroll HN right now. I've been puzzling over why I'm getting "! [remote rejected] master -> master (Internal Server Error)" as well while trying to push and decided to take a break.
Don't you guys have other features and stuff to work on locally? What is this "time to take a break when GitHub is down"? I'm saying this a bit tongue in cheek btw :)
It's been like that for at least 6 hours, randomly appearing. I would take a pause and try again and then it would work, but now it's definitely much more persistent.
Having been on the receiving end of things like this: please, make the sleep longer. Adding more requests to already malfunctioning system is not a good way to help in fixing it.
Thanks but no thanks - no way am I doing anything to my core app repos when the repo host is fritzing out. This is one of those moments to go for a walk (or bed, depending on your timezone).
Also pretty much every usage of -f would be better off being --force-with-lease so you're less likely to accidentally clobber someone else's work. I have git fpush aliased to "push --force-with-lease" and try to spread the gospel when possible :)
I'm finding that pushes do go through eventually, this is probably grossly irresponsible, so I don't recommend its use, but I remembered I had this old alias to "push harder" in my ~/.gitconfig:
[alias]
thrust = "!f() { until git push $@; do sleep 0.5; done; }; f"
I've done a few pushes so far, and found that it's going through in <10 tries or so.
# Retries a command a with backoff.
#
# The retry count is given by ATTEMPTS (default 100), the
# initial backoff timeout is given by TIMEOUT in seconds
# (default 5.)
#
# Successive backoffs increase the timeout by ~33%.
#
# Beware of set -e killing your whole script!
function try_till_success {
local max_attempts=${ATTEMPTS-100}
local timeout=${TIMEOUT-5}
local attempt=0
local exitCode=0
while [[ $attempt < $max_attempts ]]
do
"$@"
exitCode=$?
if [[ $exitCode == 0 ]]
then
break
fi
echo "Failure! Retrying in $timeout.." 1>&2
sleep $timeout
attempt=$(( attempt + 1 ))
timeout=$(( timeout * 40 / 30 ))
done
if [[ $exitCode != 0 ]]
then
echo "You've failed me for the last time! ($@)" 1>&2
fi
return $exitCode
}
It's fine. Maybe it will force them to finally start paying attention to the quality of their work. If crap I'm writing for a living was misbehaving that frequently, I'd be sweeping the streets by now (or doing some other work that's actually useful to society).
It's OK to be frustrated since we rely on GitHub so much, but this is unkind. Software is complex. GitHub operates at a scale few of us work at. There are people at the other end doing their best traversing complex internal systems (organization and tech).
I would argue GitHub has done more for societal good than most tech ventures, by the way.
I was pretty pissed off, alright, so my comment probably gave out wrong vibes. I'm not arguing I could do any better (I probably wouldn't get past their interview process), and they certainly do have the talent (which is obvious by their technical blog posts).
It doesn't change the fact that the company has absolutely crap dev culture which seems to put features first and foremost, at the expense of everything else. There are products with even more complexity that don't fall over and die almost every single day. It's just not funny anymore. Facebook is pretty complex, it had major issues like this one, what, once in its entire life?
I don't remember Google Search (or other Google products) ever not answering my queries, and I've been using it for about 18 years.
And so on. I reckon it's because those companies have strong engineering culture (Google certainly does, at least), and this one doesn't.
GitHub actions has been like this for years now. Years. Years!!!!
And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough." Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.
> And the crazy thing is you see people on HN demanding that some one person side project/SaaS has to be at 100% uptime with multiple failovers, automatic scaling, etc. etc. There is such an emphasis on scalability on HN and yet... you just brush that all away because "software is tough.
I'm not one of those people. I may have been when I was much more inexperienced.
Software is hard. Full stop. Organizational politics, engineering culture, business / tech alignment are all hard. Distributed systems are hard.
> Yeah, no shit. Poor Github. They are also Microsoft now. One of the wealthiest corporations in the entire world. And people are paying Github. This isn't Twitter fail whale we're talking about.
I may have also thought this when I was much more inexperienced. This isn't a resource problem. Even a small startup, when they start having failures due to scale from growth, it's not a money problem. Throwing money at this doesn't make it go away.
By the way, the Twitter fail whale impacted paying customers (advertisers).
It is probably caused from postmortem culture not being shared in the community.
"Having problems" in this world (any kind, not only due to the github scale!) is something that happens - we are not perfect and we work on an incredible amount of layers of complexity.
It is sufficient to actually touch production code on a daily basis to see that it can happen to the best, with the best observability systems or processes. The key is avoiding blaming, and understanding iteratively how to fix the problems underneath (faster recovery, detection time, and so on).
You should probably look for a new job then, because it's pretty difficult to get fired for underperformance as a software engineer these days. There are plenty of places you can write shit code, or if you prefer Rust, places where you can blog about other people writing shit code.
Anyway, you shouldn't fire someone for causing bugs in production since it indicates a systemic failure of all the checks that should come before the bug is deployed. Even if you can trace the root cause to one person, it would be counterproductive to fire them, because now they've made the mistake they probably won't make it again. Whereas their replacement doesn't have the same wisdom.
They had massive problems with their main database cluster (MySQL). If you read through their engineering blog, most of the outages were related to their growth and the main database cluster. They moved workloads for some features to different clusters, but that's only to buy more time. Eventually they'll do proper shredding (by user or org I guess, not by feature) but that takes time.
i've noticed this too .. the real head-scratcher is how a solid chunk of github's db & infra folks left to join a database startup, one of them even becoming its ceo!!
if they had made github db/infra super-stable before this, it would be a vote of confidence in their new company, but instead imho it is the opposite
that's fair. i am just raising an eyebrow to github's apparent lack of sharding, as described in their incident reports -- while these engineers all left to join a db company that focuses specifically on sharding -- it seems like an experience mismatch.
if they were all sharding experts why wasn't github sharded properly. other large mysql shops have solved this, all the way back to the days of yahoo and flickr and livejournal
maybe i shouldn't have mentioned it, i don't want to name names and have this to come off as an off-topic attack subthread about a different company, sorry! it's a db company that has raised a lot of money and is mentioned on hn a lot, there are only a handful of these
I have no idea if this is remotely close to reality but, what if, their culture of breaking things and bad uptime is what allowed them to move fast and build a great product in the first place?
GitHub was founded in 2007. They were acquired by MS years ago. They should be well beyond any startup culture of "move fast at the expense of reliability".
I don't disagree with this, they could/should have transitioned already. But for one, cultures are hard/slow to change. And second, as an example, Facebook had the motto "move fast and break things" until 2014, and by that time they also were beyond the startup phase(), so this kind of culture is not only for early days.
() They were founded in 2004, that's 10 years in. By that time in 2014 they had 800M+ monthly active users and $12 Billion revenue; and they had this culture internally until this point.
Aren't both companies potentially loosing money when their products don't work? The fact that it's crucial to businesses seems to be the client perspective, not the company perspective. It could also seen as critical for some businesses to advertise on Facebook. This could call for a different culture internally but I'm not convinced this is necessarily the case.
This is causing actions jobs to hang after completing, consuming precious minutes. I don't think I've ever seen a refund when this happens, so I recommend everyone check their jobs and cancel them for now.
These incidents have to hurt Azure's brand value. It's a monster task to run something as big as GitHub, if they ever get it stable it will lend a lot of credibility to Microsoft's cloud skills.
There's not really all that much pointing to an infrastructure level failure - it's possible, but it's just as likely it's an application-level failure somewhere in Github's code. The API is returning 500s and not 503s and the failure is relatively quick, so it's not obviously a server outage.
It's yellow lights across the board, literally nothing is green. That's usually indicative of some sort of software infrastructure level failure or cascade failure, not an application-level failure, which usually manifests as one or two specific services going down (depending on how you define "infrastructure" and "application" - with IAC, arguably the software defined infrastructure _is_ an application). I doubt its a physical hardware issue. It's rarely hardware (except when your DS catches on fire).
No red lights, so it's probably not something catastrophic like that facebook DNS SNAFU, but it definitely smells infrastructure- or deployment-scoped. Like either small DNS issue, or some load balancers are sending traffic to servers which cannot handle it programmatically (schema change?) so they are barfing.
Databases, Caches or the authentication service? For me read-only requests are working fine and I've not seen any issues. Submitting new contents (e.g. comments) is where it's failing for me. It might be that their database primary is falling over.
1) Is GitHub runing under Azure's technology stack?
2) Is GitHub under Azure's mamagement (in contrast to Visual Studio's team)?
I'm not sure about two but I'm pretty sure that GitHub doesn't run under Azure at all, considering that GitHub has fully separate networking from MSN's/Azure's (and GitHub's machines do pingback unlike most of Microsoft's machines which don't).
Huh? As of at least 2017 GitHub was running their own data centers [1]. Any evidence that’s changed? Microsoft bought them in 2018, I can’t imagine they went to AWS after that.
Ah sorry, yeah I wasn't being very accurate.
I was checking "what's on Azure", I didn't really follow up to check in detail the breakdown of the rest. I just saw quite a bit on AWS and assumed it all was.
Github Actions does not have good track record. https://www.githubstatus.com/history . You don't need a majority of GitHub users to understand it's owned by Microsoft for there to be an impact on brand value.
At least one good thing about GH is that while things break, the status page is updated relatively fast compared to other companies, when all HN knows about outage for 1h+ until it's acknowledged.
For example: while actions are down, branches can be merged without ci tests passing, even for protected branches. This just happened on one of my repos.
One of our systems runs AWS code repository in parallel to Github and builds are triggered from there (but not in us-east-1). Time to migrate the rest of our systems to having that fallback.
It's almost the same time as their incident yesterday too. Although today the scope is wider - yesterday it was Webhooks and Actions. Today core git is broken as well as the APIs.
Yep. I hope they post an aws style postmortem… this is kinda ridiculous (although I do empathize as an ops person). Webhooks breaking broke all of our pr bots bringing development to a standstill yesterday; today everything seems f’d.
Here we go again. GitHub going completely down at least once a month as I said. [0] So nothing has changed. That is excluding the smaller intermittent issues. Let's see if anyone implemented a self-hosted backup or failsafe just in case.
> The entire point of git is that it's decentralized, lol.
No-one here is criticizing git itself. That is not the point.
It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.
The majority of these projects went all in on GitHub, including using GitHub actions, npm packages, hosting their whole website, etc hence as soon as it goes down, they can't push or update anything; especially if it was very urgent. It has become a giant single point of failure for nearly everything.
There is a reason why the Linux Kernel, Mozilla, Qt, Chromium, GNOME, ReactOS, etc self-host their own repositories and have fail-safes repositories if Github goes down and becomes unreliable.
If you're not building some downtime into your model you're not being realistic. It's easy to point fingers but the reality is every product and company will experience unexpected downtime. It's an easy business decision for executives/buyers, pay a team of top engineers to home grow a durable product assuming it can even be done at extreme cost now and later or be okay with a couple of hours of downtime here and there with far less cost.
Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.
> Every single project you listed uses Github as a mirror meaning when they go down internally, Github is the backup which from my perspective is a little ironic.
And? It is a read-only mirror. It just 'pulls' changes from the self-hosted copy. It can't be used for direct development for the maintainers. If the main official repository was on GitHub and that goes down, then everything will be down as well including (issues, pull requests, actions, etc). Then you will be totally reliant on GitHub for 'fix it'.
There is a reason why those same projects do not use GitHub as their main repository and tell you 'We don't accept issues or patches here'8. They have control over their issues trackers, review process and CIs and their projects won't halt due to GitHub's unpredictable and intermittent issues.
For those projects, GitHub is only* used as a read-only mirror for cloners, but useless for anyone to send patches, track issues, PRs, etc. which that is done on their self-hosted repositories and it has been like that for them for years.
It's a remote origin, once I clone and branch which I can do from a mirror, I can write and commit as much as I want to the repo, where I push the change up to is ultimately my decision assuming I have access. The point stands, these companies use Github to act as a mirror/backup for their project in the event of something like a disaster (e.g. datacenter fire).
There is no perfect solution and there never will be. Everything has associated cost. You're focused on the distribution of devops tooling, but that is only a fraction of the story. Many large companies have moved to Saas based products because they realize doing it themselves comes with significant cost. An hour or two of downtime is cheaper then a datacenter, equipment, bandwidth, licensing, and expertise to manage all of it.
It's a simple cost benefit analysis. You need to look at this issue through the lens of a business and not just an engineer would be my advise. Interestingly enough you can only point to OSS projects which rarely pay for tooling anyways.
> It is GitHub that is defeating the whole point of it all, once their hosted central server goes down.
server != service
assuming its a distributed service vs one server for a multi-billion$ company
also group of humans built this service, so its not gonna be perfect :shrug:
companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%
> assuming its a distributed service vs one server for a multi-billion$ company also group of humans built this service, so its not gonna be perfect :shrug:
Regardless of any of that, it still is proven to be unreliable. It is also not an excuse to go all in and risk being fully dependent on GitHub (and their services) and tolerate such downtimes and run to HN and complain about it each month.
> companies that use such tools and in trust all the business process to a provided service and do consider an event like this is a blocker should build in contingency plans or accept that there is no real 5-nines of availability more like 90-98%
Then I should see no-one being surprised or complaining about 'GitHub having issues' or 'GitHub is down again' whilst also using it for GitHub actions, pages, issues or pushing their changes and they are not paying for GitHub Enterprise or some higher plan; especially serious open source project like Mozilla, Chromium, etc. That's why they self-host.
Until the next time GitHub goes down again (hopefully that won't be in another month's time).
It's intermittent, I was able to get a push through eventually, and am now hung trying to convert a draft PR to ready for review. It took many tries to get to draft.
I'm probably not helping by repeatedly trying, but I don't want to forget this PR.
In Asia I've been having problems for almost 12 hours now (both locally and from our CI/CD which is in a different country). Also had similar problems on Tuesday.