GitHub Actions has a lot of basic usability issues, none of which are fatal but all of which irritate me on a daily basis.
Let's start with the first and simplest: Why did my build fail? You'd think this should be front and center. Yet, the UX is "click through a couple links, then wade through thousands of lines of log output". In practice this is "download the logs and grep them for text strings like FAILURE".
The only way we've made this livable is by writing action code that sends test failures to a slack channel, and more or less ignoring the GitHub UI.
Second usability issue: Restarting a build. Happens all the time because of some flakey third-party service. It's pretty much my primary interaction with the GitHub Actions UI.
1. Click on build name
2. Click on "Re-run jobs" which brings down a dropdown menu
3. Click on "Re-run failed jobs" menu item
4. Click on "Re-run jobs" button on the confirm dialog
Seriously, after five times going through this flow I'm screaming for a JUST RERUN THE FUCKING JOBS ALREADY button. What's annoying is that this used to be better - the dropdown and confirm dialog are recent additions. The UX is degenerating.
The irony is that we migrated from CircleCI because they were screwing up their UX, and we figured GitHub would make something better... sigh.
This is why UI is hard. In a parallel universe they made the re-run jobs button like you want, and there's some person at the top of the Hacker News comments in that universe making the complaint that it's too hard to tell what the buttons do, and that only the common path of rebuilding failed builds is easy and once you need to do something uncommon it becomes irritatingly complex.
Is it though?. I mean, pretty much all mainstream CICD services do this right out of the box. GitLab does this so well it's even totally transparent and never an issue.
How come GitHub actions, which succeeded all of its alternatives, managed to turn a solved problem into a constant source of headaches?
I dunno. In my experience UI is hard because some PM decided to start switching everything around, citing some engagement metric that seems to go up, even though it's obvious to everybody that it's a bad idea at a human level.
I've started building in auto retries to the CI scripts for this type of thing, at least once it annoys me to a certain extent. The most glaring and unavoidable ones that come to mind across several projects are external certificate time stamping services.
I generally hate this type of thing: retry mechanisms are a lazy band-aid that can mask real problems. But at the same time, it just isn't possible to get to 100% and retry is a better band-aid than constant human intervention.
> I generally hate this type of thing: retry mechanisms are a lazy band-aid that can mask real problems.
I disagree. Retry mechanisms is mandatory for any network request. You cannot expect a message to travel across the globe on a wire without ever failing.
I would say the opposite: you must have a retry mechanism for all requests performed by your application; unless it’s acceptable to pass on this failure to the client (who will retry the request).
> I disagree. Retry mechanisms is mandatory for any network request. You cannot expect a message to travel across the globe on a wire without ever failing.
Actually good point, and I agree. I didn't state that well. What I was thinking of was some other things in builds I've seen retries be the band-aid for: flaky, timing-based unit tests; locked files/directories; or applying database migrations. My usual reaction to adding a retry to a build step is "no, let's fix the actual problem". Network requests to external services are a special case though, and absolutely should always have retries.
> I generally hate this type of thing: retry mechanisms are a lazy band-aid that can mask real problems.
They can, but beyond the old "it broke once it deployed" obvious pattern, 9 out of 10 times retries only sidestep basic nuisances from transient errors such as a request timing out.
FWIW, while the UI for re-running jobs used to be better, the result was a LOT worse as it would re-run all the jobs instead of just the failing ones, so I am still MUCH happier with the new flow ;P.
I feel like those are not mutually exclusive. The button could say "Rerun failed jobs" no confirmation with a small ▼ next to it that has the other "rerun all jobs" drop down.
For the majority of my pipelines I rerun from the start. The pipeline itself is idempotent, I don't care about each step and I much rather have a clean run than a fast run.
If you're already putting your own frontend on it in Slack, you could add a 're-run' button to that failure notification that hits the API for that? (I assume there is one.)
There's a weird DSL for buttons/layout in Slack, but it works, and obviously it's pretty easily testable, just fire the examples at your own endpoint and tweak it until it does what you want.
My own main frustration with Actions is just that the docs are bad, making it hard to configure.
My favorite issue is that when the service is down it just doesn't run jobs or notify in any way that they are not being run. No "error" state in PR view or anything, let's just skip tests.
I've used GitHub Actions quite extensively now, across infrastructure automation, Python CI/CD, and iOS CI/CD, and while not perfect, it's the best platform I've used for this stuff so far.
Compared to Jenkins it needed far less maintenance. Compared to CircleCI it felt much easier to work with and to build reliable pipelines due to the locking primitives it provides, and compared to Semaphore I found it easier to understand how the pieces fit together.
My criticism would mostly be about missing features, but the pace of development has been great and the only one I have left on my list is SSH debugging, which didn't end up being much of a blocker to our adoption anyway.
As for reliability, nothing is perfect, but in my experience it's at-or-above the level that CircleCI provided, and far surpassed our in-house Jenkins server.
We use GitLab CI for internal code and GitHub Actions for public code, and are generally happy with both.
GitHub Actions emphasizes reusability and composability around “actions” as installable packages suitable for a “marketplace” model. This works well, but once you deviate from a standard action, changing its behavior requires learning (or re-learning every six months) the GitHub Actions DSL to make it do what you want. This can be frustrating when it feels like you need to model your task to fit within an abstraction that didn’t help you in the first place.
GitLab CI feels much more “raw” and closer to the metal – there is less emphasis on packaging code into reusable workflows, and more emphasis on the practicalities of operating robust CI pipelines. Yes, you could have a different runner for every task, and “package” CI code into Docker images similarly to how GitHub Actions packages it into workflows. But the service isn’t designed around that idea. And for internal pipelines, that feels like a needless abstraction. It’s nice to keep pipelines simple and robust.
Put another way: there is much less “magic” in GitLab CI than in GitHub Actions. In many ways GitHub Actions feels more like a toy than a tool. This is generally a good thing, IMO, and overall I’d say the tradeoffs are worth it for both services, especially given the different contexts (internal vs. public code) in which we use them.
I forgot to mention GitLab CI, but in my experience, much like with Travis CI, it is as you say lower level. I do not find that to be a good thing.
Rather than lower level being faster as it is with code, generally with CI we want to raise the abstraction level so that it's easier to make use of things that make CI run faster – parallelism in workflows, caching etc.
As for reliability, I also found GitLab CI lacked the concurrency control caching primitives necessary to trust continuous delivery pipelines.
I found you could either write a slow correct pipeline, or a faster but fundamentally unsafe one.
Thanks for the feedback! I'm one of the PMs for GitHub Actions, and I appreciate this. Thinking about Actions as a set of primitives that you can compose is very much how I think about the product (and I think the other PMs as well) so I'm glad that resonates.
We're always welcome to feedback, and we're continuing to invest and improve on the product, so I'm hopeful that we can address the features that you're missing.
* Setting up GHA is still a lot of "commit and hope for the best". I've resorted to having a sandbox repo just for experimentation/testing so that I don't overly pollute repos that I actually care about. It would be great to get more instrumentation to see what is going on.
* I have a monorepo for Dockerfiles. It's quite annoying that I have to have separate invocations for different Dockerfiles in dependabot.yml. I should be able to specify /Dockerfile or /Dockerfile* as patterns for detection. The Dependabot invocation for GitHub Actions is a single entry and it would be great to have that.
* I quite like Step Security's Harden Runner but it does require more work/invocations to get this set up. Maybe GH can work with them to more closely incorporate said functionality?
* Make the cache bigger? I build a fair number of multi-arch containers and starting all of them at once tends to blow out the cache.
* Given the interest around sigstore and SBOMs, maybe incorporate native capabilities to sign artifacts and generate SBOMS?
Thanks. The "commit and hope for the best" problem really resonates with me. There are two great projects that might provide some pain relief - nektos/act or rhysd/actionlint. But I agree that commit-to-validate is probably the best strategy at the moment, which is deeply unfortunate. This is an area that I intend to improve in the future.
As for the cache, we doubled it at the end of last year to 10GB. https://github.blog/changelog/2021-11-23-github-actions-cach..., but I can see how multi-arch images would be very large. Have you considered putting images into GitHub Container Registry instead of putting the layers into the cache? I'd love to understand if that is appropriate for your workflow, and if not, what the limitation there is.
Appreciate the rest of the feedback, I'll pass it along to the appropriate teams.
> Setting up GHA is still a lot of "commit and hope for the best". I've resorted to having a sandbox repo just for experimentation/testing so that I don't overly pollute repos that I actually care about. It would be great to get more instrumentation to see what is going on.
There is act[0] which aims to let you run github actions locally via Docker. It isn't perfect but it does a decent job at it, and for the most part your pipeline can be run locally.
After MS bought GH, I had hopes that they would build a tool to run action locally, but nothing yet.
I've had no luck reproducing problems in Actions with act, and the rest of the time have problems in act that I don't in Actions it seems.
I like the idea and also would like something first-party, but I imagine it's hard and GitHub would want it to be less buggy than act is, and maybe they're trying but it's not there.
Tbh even if it ran remotely in actual Actions, but just didn't show up in the repo UI, logged locally, that would be fine?
From my perspectives, GHA is missing 2 things over CircleCI. A way to pause an action for approval, or a way to pull artifacts from other workflows. Both of these actions are _possible_ with an external service but painful to setup. I want to: create a terraform plan, approve it, and then deploy the specifically approved plan. That's not so difficult in CircleCI but is _painful_ in GHA.
Sure, but that's actually worse than useless for my use-case. Image this, you have an action that publishes your plan to your PR (#1 - it's a biggish feature). It gets merged and goes to approval. Then people happen. PR #2 is addressing a customer-facing bug so it gets fast-tracked and rammed through before PR #1. Suddenly PR #1 is silently invalid. It _should_ be rejected at this point but the whole point of CI/CD is to save time and reduce the surface area for human mistakes.
specifically for your terraform example, wouldn't it make more sense to have the PR merged only when apply was successful?
i'm not sure how well that can be represented in GH actions, but that would surely be the better option?
you'll always risk some kind of race condition there, e.g. atlantis locks the project while something is planned but not applied to avoid such things from happening. this of course prevents having multiple PRs "ready" at the same time, you'd have to unlock the active PR lock to be able to implement another one.
This still can't use GHA to enforce any sort of integrity so it's kinda moot. I have some of my projects set up to deploy with CircleCI...which can give me the build, approve, apply (specifically the thing you approved) chain that I'm looking for (so there's no race condition). "Why not use CircleCI?" well i do, but if my company decides to cut costs, it may not survive the chopping block...so I'm looking at other options.
That's very much how I think about it too, which is why it frustrates me that I can't create canned workflows that apply to all my repos of a certain type (language specific linting and releasing say).
I know I can create user/organisation templates, but all that does is put it in the UI chooser to create a commit to put it in the repo from the web. I want to do something like `include: OJFord/workflows/terraform-provider.yml` or `include: OJFord/workflows/rust.yml`
Perhaps even better would be I don't even have to specify that in the repo, they just apply automatically to any which match a given pattern - named `terraform-provider-*` or having a file `Cargo.toml` say - but I realise that's probably too big a deviation from the way Actions works at this point.
Interesting idea! Even if Actions will follow a submodule though (I doubt it tbh, it happens before any actions run of course so we have no control over that) you can't point it 'at trunk' as far as I know, they're always at a specific commit.
(E.g. if you git submodule foreach git checkout master, your diff if you have one will be updating commit hashes, not -whatever +master. This is good for a lot of other reasons but doesn't help here.)
It's definitely more pleasant to use than Jenkins, but Jenkins is as painful as it gets.
90% of the times it's some ancient setup that nobody knows how it works. And the amount of hands-on required to keep in working is waaaay higher than other CI pipelines.
Circle CI is also not very nice at all. Things like GitLab CI or Sourcehut Builds are sooo much simpler, lighter and faster in comparison.
Jenkins sucks from a UI perspective, and from a maintenance perspective, but it's very powerful in ways that most other CIs are not.
The classic example for us was concurrency control. We did continuous delivery, and so it's critical that you're not trying to release the `master` branch multiple times concurrently, especially when database migrations might happen, etc.
With Jenkins you could quite easily say "only 1 of this job at a time", but this wasn't (at the time) a feature of CircleCI, GitLab, or Travis. GitHub Actions added it about 2 years ago I think, and Semaphore has had something similar for a while.
While I agree Jenkins is painful, having this feature and the reliability it brought reduced pain significantly. CircleCI not having this feature, for us, caused a lot of pain.
> My criticism would mostly be about missing features,
My only real requested feature is one that's been open since early 2020 (this issue replaced the old 2020 issue) and has the status "Status: Q1 2022 – Jan-Mar" https://github.com/github/roadmap/issues/161
Without this, i'm relying on Azure actions to start/stop+deallocate a self-hosted runner, which adds ~1 minute to the job startup time.
Have you seen https://github.com/marketplace/actions/debugging-with-ssh? (Maybe this only works with public Github - that's all I use). I add an `if: ${{ failure() }}` to only spawn the SSH when my job fails and use that to debug CI issues.
I've been capturing a list of GitHub Actions gotchas;
1. If you reference an environment in a build step, this causes an `on: deployment` event trigger. This is not intuitive.
2. Environments with deployment protection rules are just for branches. Using tagged releases for a production environment means we must allow all branches to be deployed to that environment.
3. Workflows are unable to start other workflows to prevent cyclic actions. Understandable, there are exceptions however e.g (1) above.
4. Pull requests raised by Dependabot will not have the `id-token: write` permission needed to create an OIDC ID token that we use to access external systems (the benefit of then not having to manage secrets). Note: Using `on: pull_request_target` will be granted the permission but then you're executing the `main` branch and not building/testing the Dependabot changes.
Then we have secrets, Organisation, Repository, Environment and Dependabot.
Anything in Organisation, Repository or Dependabot should really be at most read only permissions. Environment based secrets may have write permissions but be wary of the malicious PR that then references this environment, what protections are in place? Environment approval? Deployment branch protection?
Github Actions folks if you're listening, please address lack of "Allow Failure" modifier for job runs. There is a reason every other CI on the planet has this, and no, it's not the same as "Continue on Error" https://github.com/actions/toolkit/issues/399 (currently 800+ upvotes and counting)
Hot take; they don't use their security alerts to manage dependency vulnerabilities.
Sorry, taking this on a tangent but the security alerts feature as integrated into the product is WAY less usable than other features, such as actions.
Just off the top of my head:
* No way to assign alerts to people
* No free-form comments when dismissing alerts?
* Old alerts re-open if a regression introduces the change back.. With no history on the alert to see why an "opened 6 months ago" alerts suddenly appeared
* Alerts that can't have PR fixes opened because "The vulnerable package is no longer used"(paraphrasing) are still open.. Why?
Such a lackluster area of the product it has me wanting to give Snyk(blarghh) another look or some other alternatives.
(Disclosure: current PM for Dependabot Alerts, currently working at GitHub)
You're right -- we haven't invested nearly enough in Dependabot Alerts. We're working to change that, starting with some foundational improvements like alert persistence, which shipped in February. (https://github.blog/2022-02-08-improving-developer-experienc...)
After our recent ship, we're in good shape to start addressing some of your concerns, like greater clarity through an alert's lifecycle or comments with dismissal.
Would love to hear any additional feedback. Let me know!
I don't think it's too hot a take, but it is incorrect. Dependabot Alerts (and Updates) are required on all internal services and folks use them (and provide similar feedback).
I know that security alerts in general are working on ways of assigning alerts (or opening issues that reference those alerts)--hopefully this is something that will show up soon.
(Disclosure: former Dependabot PM, no longer work at GitHub)
Hate to say it but have you tried turning it on and off again? IIRC sometimes the state of alerts got stuck in a weird place and resulted in the incorrect state. Toggling it might put it back into the desired state.
I'd also check and make sure Updates is turned off (delete the dependabot.yml).
That's how everyone does it today, but given that GitHub has an issues product, it really should be better integrated.
The main issue is that issues are publicly viewable, while security alerts are only available to certain users (e.g. repo admins). Since repo owners of e.g. popular OSS repos might not want public issues around their security vulnerabilities, this isn't something that's been actioned.
Should GitHub offer more granular permissions on issues to resolve this? Probably, but unfortunately that's reaching fairly deep into how GitHub works today.
(Disclosure: former Dependabot PM, no longer at GitHub)
Stack Exchange (and by extension, Stack Overflow) publishes all their data (anonymized) for download and offline usage as part of the Archive.org project. If you wanted, you too could have a copy of it locally to search while the site is down.
not to sound snarky, but what are you doing to address the 34 incident outages in the past 90 days? i feel like theres been no real update on this slow rolling disaster. we're not even a week into april and already a 5 hour outage in codespaces.
This is my biggest concern. The more we centralise on GitHub, the more we centralise the risk on a service which goes down harder and more often than our Jenkins infrastructure did.
I've had multiple meetings with my team in the past month about what our contingency plan is for moving off GH if need be. It's gotten that bad. So far, we've gotten to "well I guess GitLab is available and we can put it behind a VPN" and punted on talking about it any further.
It's weird to me that GitHub doesn't have larger machines types available for actions yet. I don't want to bother with a self-hosted runner just to get more CPUs. They have much larger machines available for Codespaces - why not actions? I'm happy to pay for them.
Hey founder of BuildJet here,
With BuildJet for GitHub Actions, you can get up to 64 vCPU as a GitHub Actions runner. We plug right into your existing setup and have a significantly higher per core performance compared to the native runner.
I wish GitHub Actions worked more like Drone. Just run commands in specified Docker containers. None of this weird probably dangerous in-channel signaling via STDOUT. Completely skip the base level OS with a billion tools people might want pre installed that shifts over time, that’s just asking for your build to break while you’re on vacation. It’s a very Microsoft way to do it. It’s messy.
I know GitHub Actions has a built in opinion to kinda run like that, but I have had a ton of problems with permissions in that mode and it incurs an unavoidable 30 second overhead as it sets up Docker, every single time. Which for a 2 second unit test suite is an eternity.
I loved Drone CI when it was OSS. I contributed to Drone, I loved Drone with my whole heart. Drone stopped loving me; they pulled the rug out from under me, made my contributions non-free, and charged more than me and my company could possibly justify to use my own contributions. I learned the valuable lesson of why you don’t want to sign a CLA the hard way.
I am using GitHub Actions and I am not particularly happy about it. GitHub Actions: “It’s better than Jenkins and about on par with Travis”
I’m trying to love GitHub Actions, however, when builds and tests randomly fail on custom runners when they’re configured the exact same and there’s no option to debug, it becomes painstakingly frustrating. We’ve tried most of the big CIs, Travis/CircleCI/codeship- and while GH Actions integrates nicely with GitHub itself, it’s a far less superior product compared to its peers.
If anyone at GHA is reading comments, could you pressure your upstream VM hosts to all allow ‘rr’ (Mozilla record and replay) usage? They must enable performance measurement counters (PMU), but it is game-changing to be able to use CI not only to detect issues, but also exactly reproduce rare bugs by simply downloading an exact replay of the exact steps that led to it. We currently use self-hosted buildkite and buildbot runners specifically because we can enable the required performance counters, and temporarily save every run as an rr file so any result can be reproduced locally and investigated.
We switched to buddy.works[0] about a year ago and honestly it’s just been… smooth. The UI is just great, the wealth and breadth of options is ever increasing and all the basics like knowing what went wrong, restarting, debugging, duplicating etc just work as you’d expect. One of the few companies I can recommend.
I also highly recommend Buddy. It's the most intuitive build system I've found: every step in the pipeline is a container image and commands on top. Incredibly flexible and powerful but also easy to maintain.
We have based our entire deployment pipeline on GHA. Works great when it works... and after all the outages of the past month, we're working on an exit strategy.
They do, actually! All of the public help requests and feature suggestions have been in Jira since, like, the early 00s. It's weird to see when they pop up in Google results with ancient screenshots attached.
" Turn weekly team photos into GIFs and upload to README"
Well, as a remote worker that has my camera off as much as possible out of principle I'd really hate this. It actually puts me off applying knowing glimpses of my home get turned into GIFs for the entire company to see.
Let's start with the first and simplest: Why did my build fail? You'd think this should be front and center. Yet, the UX is "click through a couple links, then wade through thousands of lines of log output". In practice this is "download the logs and grep them for text strings like FAILURE".
The only way we've made this livable is by writing action code that sends test failures to a slack channel, and more or less ignoring the GitHub UI.
Second usability issue: Restarting a build. Happens all the time because of some flakey third-party service. It's pretty much my primary interaction with the GitHub Actions UI.
1. Click on build name
2. Click on "Re-run jobs" which brings down a dropdown menu
3. Click on "Re-run failed jobs" menu item
4. Click on "Re-run jobs" button on the confirm dialog
Seriously, after five times going through this flow I'm screaming for a JUST RERUN THE FUCKING JOBS ALREADY button. What's annoying is that this used to be better - the dropdown and confirm dialog are recent additions. The UX is degenerating.
The irony is that we migrated from CircleCI because they were screwing up their UX, and we figured GitHub would make something better... sigh.