How do you debug jobs that are hanging ? #2931
-
Hi, We have some jobs that are staying in this state and never get picked up by a runner.
If I follow the trail on cloudwatch (webhook, then scale-up), for these hanging jobs, I always see a line such as:
So an instance is indeed created, and when I look into that runner jobs, it does seem to have picked up another job (which I understand can be an expected behavior), but no other runner seems to be made available to start that initial job. At the end we end-up in a situations where we have no runners started in AWS, but some jobs waiting for runners. This seem to happen more frequently these days, but we haven't been able to identify a pattern, re-running the job does fix the issue. Running v2.1.1 with ephemeral multi-runners. What would be your recommendation for where to investigate next ? Any tips ? Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Hi, Any tips on the above? It's pretty annoying to have jobs hanging due to no runners started. It would be awesome to get some clues on where to look at to address this. If the issue is not with our env, I'd be more than happy to contribute back with a PR, but the challenge here is not knowing where to look next (couldn't find clew in the Cloudwatch logs, searched for things like "error", "exception", and various other combination in all log groups without success) Thanks a lot |
Beta Was this translation helpful? Give feedback.
-
This is still very relevant to us, we end up using a standalone runner we use to clear the queue, but we definitely still have some jobs hanging by lack of available runners. But we're really unsure on how to proceed further to find the root cause of that issue, suggestions/tips would really be appreciated. |
Beta Was this translation helpful? Give feedback.
-
I believe I finally found the cause of the issue, which was a misconfiguration (or incorrect understanding of a setting) on my end. In short, the message to look for is:
When multiple jobs were triggered at the same time, a RACE condition between the runners were causing some jobs not to see runners being created. You'll find more details about the issue and a suggested improvement in this PR: https://github.com/philips-labs/terraform-aws-github-runner/pull/3046 |
Beta Was this translation helpful? Give feedback.
I believe I finally found the cause of the issue, which was a misconfiguration (or incorrect understanding of a setting) on my end.
In short, the message to look for is:
When multiple jobs were triggered at the same time, a RACE condition between the runners were causing some jobs not to see runners being created.
You'll find more details about the issue and a suggested improvement in this PR: https://github.com/philips-labs/terraform-aws-github-runner/pull/3046