Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: use pl self-hosted runners for test #1381

Merged
merged 6 commits into from
Oct 26, 2023
Merged

ci: use pl self-hosted runners for test #1381

merged 6 commits into from
Oct 26, 2023

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Aug 22, 2023

This PR puts test and coverage jobs on PL self-hosted runners.

On c5.2xlarge machines, the workflow finishes in 14m 26s. On c5.4xlarge it's 13m 22s. If you want, I could get you a rough breakdown of the difference in $.

This PR doesn't look into how we could start using caching again to speed up builds. We'll look into that too, as soon as we have some free time on our hands (one can hope).

For comparison, the most recent run on master took 28m 46s.

@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2023

Codecov Report

Merging #1381 (a536001) into master (8076305) will increase coverage by 0.04%.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1381      +/-   ##
==========================================
+ Coverage   91.06%   91.11%   +0.04%     
==========================================
  Files         145      145              
  Lines       27397    27397              
==========================================
+ Hits        24950    24963      +13     
+ Misses       2447     2434      -13     

see 5 files with indirect coverage changes

@galargh galargh requested review from anorth and Stebalien September 3, 2023 10:33
@galargh galargh marked this pull request as ready for review September 3, 2023 10:34
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the fromJSON syntax is doing, but the result here looks great. Thank you!

@galargh
Copy link
Contributor Author

galargh commented Sep 4, 2023

I don't understand what the fromJSON syntax is doing

Oh yeah, that's far from obvious. We could potentially simplify it here, too. Let me break it down.

runs-on: ${{ fromJSON(github.repository == 'filecoin-project/builtin-actors' && '["self-hosted", "linux", "x64", "4xlarge"]' || '"ubuntu-latest"') }}

The above means that if this code is evaluated in the context of this repository (as opposed to a fork), we turn ["self-hosted", "linux", "x64", "4xlarge"] string into an array (of labels) that runs-on can accept. Otherwise, we turn the "ubuntu-latest" into ubuntu-latest string, which runs-on can accept too.

An alternative form of writing this, perhaps a clearer one, would be something like this.

runs-on: ${{ github.repository == 'filecoin-project/builtin-actors' && fromJSON('["self-hosted", "linux", "x64", "4xlarge"]') || 'ubuntu-latest' }}

This shows the real reason why we're using fromJSON in the first place i.e. GitHub expressions do support array operations on arrays, but they have no syntax for defining static arrays 🤷

The reason why I used the first version is that I copied it from somewhere where the default (the part after ||) is not a static string but a value sourced from configuration variables (which can either be an array or a string).

@Stebalien
Copy link
Member

I would be interested in a rough cost difference. When we first reported this issue, testing was taking an hour. At the moment, the "important" test jobs (build, test, clippy) run in about 20m, which is "ok" but not "great".

Looking at these runs, the "14m" (c5.2xlarge) is more like "10m" given that we can merge once tests finish (don't have to wait for coverage. So that's much better.

But... given that we're already down to ~20m, I want to make sure this isn't going to cost too much.

@anorth
Copy link
Member

anorth commented Oct 19, 2023

If you want, I could get you a rough breakdown of the difference in $

@galargh I saw elsewhere you suggested the difference was small. Could you quantify that a bit better for us?

@galargh
Copy link
Contributor Author

galargh commented Oct 20, 2023

If you want, I could get you a rough breakdown of the difference in $

@galargh I saw elsewhere you suggested the difference was small. Could you quantify that a bit better for us?

Yes, sorry for the delay. I'll be getting to you with the numbers either today or Monday.

@galargh
Copy link
Contributor Author

galargh commented Oct 25, 2023

Hi, sorry for the delay again, here are the stats I put together.

In the last 30 days, we had:

  • 153 executions of Continuous Integration / test job; each took 20 minutes on average (Grafana View)
  • 149 executions of Continueous Integration / coverage job; each took 25 minutes 24 seconds (Grafana View)

If we transition to 4xlarege self-hosted runners, we can expect:

  • Continuous Integration / test job to take 7 minutes 40 seconds which should incur monthly cost of $26.81 (AWS Estimate)
  • Continuous Integration / coverage job to take 12 minutes 20 seconds which should incur monthly cost of $45.38 (AWS Estimate)

If we transition to 2xlarge self-hosted runners, we can expect:

  • Continuous Integration / test job to take 9 minutes 25 seconds which should incur monthly cost of $21.30 (AWS Estimate)
  • Continuous Integration / coverage job to take 13 minutes 25 seconds which should incur monthly cost of $33.23 (AWS Estimate)

All in all, if we go with 4xlarge, we're looking at $72.19 monthly cost which is going to save us 53 hours 50 minutes of runtime a month overall. With 2xlarge, it's $53.53 for 46 hours 6 minutes worth of savings. Of course, we don't have to stick to the same type of runners for each job either. If you wanted to, we could also explore other runner options (these are the ones we currently have but adding new ones is not a problem).

Assumptions made when creating the estimates:

  • a self-hosted runner can take up to 2 minutes to be brought up
  • there is 15% shared infrastructure cost
  • Continuous Integration / test doesn't do significant outbound data transfer
  • Continuous Integration / coverage uploads coverage ~200M reports to GitHub archive and codecov.io

cc @laurentsenta (you were interested in how self-hosted runners comparison reports look like)

@Stebalien
Copy link
Member

  1. This definitely seems worth it.
  2. It sounds like the 2xlarge runners are sufficient.
  3. IMO, we should just drop test coverage, or make it manual. It's not particularly reliable anyways and not worth the cost.

Thoughts @anorth?

@anorth
Copy link
Member

anorth commented Oct 25, 2023

  1. Yes
  2. I'm inclined to spend the extra $ for lower latency
  3. Yes I'm ok going to manual triggering for test coverage. It's a bit awkward to find in GH UI, but I agree it's of limited utility at the moment

@anorth
Copy link
Member

anorth commented Oct 25, 2023

I've moved coverage out in #1465. I think that means this PR is good to go if we use 4xlarge.

@galargh
Copy link
Contributor Author

galargh commented Oct 26, 2023

That's awesome to hear :) I've rebased the PR now. Let me know if you want to change to 2x runners instead. Personally, I think I'd stay on 4x too since it's only $5 and we cut over a minute off of each build.

@Stebalien
Copy link
Member

Nah, $5 a month is worth it.

@Stebalien
Copy link
Member

@anorth I'll leave this to you to merge.

@anorth anorth added this pull request to the merge queue Oct 26, 2023
Merged via the queue into master with commit 0f202bb Oct 26, 2023
12 checks passed
@anorth anorth deleted the galargh-patch-1 branch October 26, 2023 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants