Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: validate: don't fail-fast #10291

Merged
merged 1 commit into from
Sep 1, 2024
Merged

CI: validate: don't fail-fast #10291

merged 1 commit into from
Sep 1, 2024

Conversation

ulysses4ever
Copy link
Collaborator

See discussion at #10263

@geekosaur do we want to backport to 3.12?


Template B: This PR does not modify behaviour or interface

E.g. the PR only touches documentation or tests, does refactorings, etc.

Include the following checklist in your PR:

  • Patches conform to the coding conventions.
  • Is this a PR that fixes CI? If so, it will need to be backported to older cabal release branches (ask maintainers for directions).

Copy link
Member

@Mikolaj Mikolaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's turn it on and see how it goes.

@Mikolaj Mikolaj added the merge me Tell Mergify Bot to merge label Aug 29, 2024
@mergify mergify bot added the ready and waiting Mergify is waiting out the cooldown period label Aug 29, 2024
@geekosaur
Copy link
Collaborator

I do have a bit of a worry about this: while odd GitHub Actions failures are common, so are (especially, early in development) actual problems where, if one job fails, the others will as well (possibly after another 45 minutes, as with the Mac jobs).

@mpickering
Copy link
Collaborator

I think this PR implements

If one CI job fails (for example a windows job), then all other jobs are cancelled, which can hide further failures on other platforms or versions.

rather than

Tests are run on CI sequentially, some tests are not run if earlier tests fail (for example, unit tests run before the package tests). In a bad situation you may do multiple rounds if a unit test fails and then a package test for ./Setup and then a package test for cabal-install.

Perhaps the commit can be updated with the description about what the commit is intended to achieve?

@ulysses4ever
Copy link
Collaborator Author

@mpickering

Perhaps the commit can be updated with the description about what the commit is intended to achieve?

Indeed, thank you for catching it. I updated the commit message to say:

CI: validate: the matrix won't fail-fast

Which means that if a Windows job fails, all other jobs in the matrix
will be allowed to finish (other platforms, as well as other compilers on Windows, etc.)

Inspired by the discussion at #10263

Now, the sequential failures need another approach. Quick googling revealed that we'll have to add several

if: success() || failure()

to every step that we want to run irrespective of the success of the previous steps.

@geekosaur I don't understand your concern, could you, perhaps, elaborate? You can always cancel the job manually if you feel like it. Or push an update, and GitHub restarts the whole thing for you. I fail to see how that 45-minutes MacOS job hurts anything. It may be pointless, I agree, but if it doesn't hurt, we have a net gain, because of the other type of failures (spurious Windows failures, as you say).

@mpickering
Copy link
Collaborator

I think the point is that there will be a lot of redundant work performed on each MR if there is a failure on one of the jobs. I don't know if this will cause any issues or not, I don't know how much compute time each project is allowed.

Overall the CI workload for cabal seems quite high to me (compared to GHC). On GHC there are 5 validation jobs (5 different platforms) which run on each MR, then there is a more stringent pipeline which runs on each batch of MRs before they are merged together. I imagine this might be quite difficult to engineer with github.

@ulysses4ever Perhaps you can combine all the different steps together and just have one "test" step, which works by one call to the validation script?

@ulysses4ever
Copy link
Collaborator Author

Perhaps you can combine all the different steps together and just have one "test" step, which works by one call to the validation script?

I don't think it's a good idea. There's a reason they were split. Maybe several ones. For instance, some steps have to be blocked on specific platforms / compiler versions occasionally: this happened several times during my involvement with this (2-3 years) with MacOS and Windows.

Another reason is pure UI. I think it's much easier to see where the problem lies when you have some level of granularity in your test process. We have a bunch of ugly bash to print headers and stuff but I like GitHUb's UI for "steps" much more. Granted, this is subjective.

I don't know how much compute time each project is allowed.

AFAIK we use GitHub cloud and nothing else. I don't know why we should care about their resources. But if you think that this change is not helpful, we can close this PR — after all, you started this discussion, so if this PR doesn't implement what you intended, perhaps, it doesn't achieve anything. The reasons I did it are:

  1. @andreasabel suggested this particular change,
  2. it was very easy to implement, and
  3. seemed like a net gain to me because Windows jobs fail nondeterministically quite often. In that case, other jobs will get canceled for no good reason.

@geekosaur
Copy link
Collaborator

I don't know why we should care about their resources.

If nothing else, because it slows down any other jobs we (and anyone else) have running: GitHub doesn't give unlimited resources to free users.

@ulysses4ever
Copy link
Collaborator Author

@geekosaur can I get a reference to GitHub documentation explaining the alleged showdown?

@geekosaur
Copy link
Collaborator

The basics are here. The rest should be obvious: GitHub does not have an infinite number of machines, therefore the more concurrent actions that are running, the more load is on those machines and the longer it takes for actions to finish.

@fgaz
Copy link
Member

fgaz commented Aug 31, 2024

@geekosaur we were granted a higher limit than that, I think it's enough to sustain this change

@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Sep 1, 2024
Which means that if a Windows job fails, all other jobs in the matrix
will be allowed to finish (other platforms, as well as other compilers on Windows, etc.)

Inspired by the discussion at #10263
@Mikolaj Mikolaj force-pushed the ulysses4ever-patch-1 branch from 44950ea to f0e0985 Compare September 1, 2024 13:08
@mergify mergify bot merged commit dceba0f into master Sep 1, 2024
51 checks passed
@mergify mergify bot deleted the ulysses4ever-patch-1 branch September 1, 2024 15:13
@ulysses4ever
Copy link
Collaborator Author

Oh, I didn't notice that a merge label was applied earlier... I didn't mean to merge it before discussions are fully resolved. Sorry everyone!

@geekosaur
Copy link
Collaborator

I think we determined that Mergify doesn't consider discussion to be changes to the PR, likely because GitHub considers PR comments to be distinct from the PR itself. We need to be careful about that.

@fgaz
Copy link
Member

fgaz commented Sep 3, 2024

Negative reviews will block instead, so don't hesitate to write one if necessary.

@geekosaur
Copy link
Collaborator

@mergify backport 3.12

Copy link
Contributor

mergify bot commented Sep 14, 2024

backport 3.12

✅ Backports have been created

mergify bot added a commit that referenced this pull request Sep 14, 2024
CI: validate: don't fail-fast (backport #10291)
@geekosaur geekosaur mentioned this pull request Nov 3, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
continuous-integration merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days merge me Tell Mergify Bot to merge re: devx Improving the cabal developer experience (internal issue) ready and waiting Mergify is waiting out the cooldown period
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants