Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimum_running_time_in_minutes #4076

Open
sdarwin opened this issue Aug 16, 2024 · 4 comments
Open

minimum_running_time_in_minutes #4076

sdarwin opened this issue Aug 16, 2024 · 4 comments
Labels
question Further information is requested

Comments

@sdarwin
Copy link
Contributor

sdarwin commented Aug 16, 2024

Hi,
I just encountered a situation where some GHA jobs failed because they lacked runners.
These are ubuntu 22.04 machines, without too much customization. The AMI is pre-installed with standard packages.
After debugging I believe that I discovered the problem.
The full time period of the boot-up takes around 5 minutes. Including "Runner update in progress, do not shutdown runner." However, if the time is 5:15 or something, guess what happens... The scale-down function kills the instance.

  1. I have just set minimum_running_time_in_minutes: 15 . Theoretically, this will solve it. What are the reasons to not have a longer minimum_running_time_in_minutes by default? It would avoid this problem. What are the pros and cons? This could happen to others.

  2. What is the meaning of the similarly named variable runner_boot_time_in_minutes? It says "The minimum time for an EC2 runner to boot and register as a runner." This definition might make sense to someone who already understands very well what the variable does. But for me, if I don't know, that does not explain. Consider this analogy:

"The minimum time you may be at the Starbucks. 20 minutes."

Then, what happens if I go into Starbucks, order a coffee, and leave within 5 minutes?

I have violated the "minimum time". What are the consequences?

"The minimum time for an EC2 runner to boot". What if I boot within 5 minutes? It is less than the "minimum" time of 20 minutes. The explanation should say more.. For example: "the minimum time... before the scale-down function will consider this instance for termination." However, if that is the definition, it sounds like minimum_running_time_in_minutes, so why are there two identical variables? So, it must be something else.

Thanks.

@npalm
Copy link
Collaborator

npalm commented Aug 16, 2024

The scale down will kill instances that are detected as orphan, which means instances that are running in aws but not registered in GitHub. The variable runner_boot_time_in_minutes allows you to configure the scale down to ignore instances that are still booting. The default is 5 minutes. In your case it seems you have to set this variable to 6 minutes (or so).

@npalm npalm added the question Further information is requested label Aug 16, 2024
@sdarwin
Copy link
Contributor Author

sdarwin commented Aug 16, 2024

minimum_running_time_in_minutes 'The time an ec2 action runner should be running at minimum before terminated, if not busy.'

runner_boot_time_in_minutes 'The minimum time for an EC2 runner to boot and register as a runner.'

Is the purpose of runner_boot_time_in_minutes regarding termination? The description should mention that. Or, with the wording "scale down". Either way.

The choices of the word "minimum" is funny. Would this not also explain it?

runner_boot_time_in_minutes The maximum allowed time an EC2 runner has to boot and register as a runner, before being considered for termination.

"max" is descriptive. Ok, how can it be said with "minimum"

runner_boot_time_in_minutes 'The minimum amount of time that must have passed while an EC2 runner is booting up and registering as a runner, before being able to be considered for termination'

you have to set this variable to 6 minutes (or so).

I have set it to 15 minutes. Actual boot up times are variable and unknown. The current default is really cutting it close, and actually causing a failure.

Why not allow instances plenty of time to boot up?
The issues here are:

  • suggested doc updates, above.
  • change the defaults of both those variables to 10 or 15 minutes. Because... why not. It would be more dependable and reliable.

Of course, you may approve or reject either, or both, of the ideas. :-) OK.


Tangent: If someone has properly configured their environment, an instance is only launched when it is needed. And then, for ephemeral runners, it is really important that it succeeds, or else the jobs will be missing 1 runner. There is no fall-back or recovery, if you have auto-scaled ephemeral runners. Maybe this is another github issue. What if you want to avoid always-on dedicated servers, and you want to depend on autoscaling. But if one autoscaler runner fails, due to the boot time issue being discussed here, the job will fail, for lack of a runner, and it will not recover.

@sdarwin
Copy link
Contributor Author

sdarwin commented Aug 19, 2024

I had not realized! A "job retries" feature just got added. Awesome!

Still on the topic of the above mentioned variables, another thing that might help is to add a couple sentences in the documentation, unambiguously explaining the difference between minimum_running_time_in_minutes and runner_boot_time_in_minutes, since they look extremely similar to each other. Even to the point that the docs say this:
"So ensure you configure the minimum_running_time_in_minutes to a value that is high enough to get your runner booted and connected to avoid it being terminated before executing a job."

Where is runner_boot_time_in_minutes in that same paragraph?

If there is an "orphan" runner, is that from either of the above variables? Is it possible to distinguish two types of "orphan" runners, or they are the same in the cloudwatch logs? Could the cloudwatch logs distinguish which timer killed the instance, minimum_running_time_in_minutes or runner_boot_time_in_minutes, and mention that in the log.

@npalm
Copy link
Collaborator

npalm commented Aug 22, 2024

Would be great if you have a bit of time to improve the docs and explain the difference via a PR. You find the variable runner_boot_time_in_minutes in the rood module and mult-runner.

Orphan runners are tagged by scale down before they will be deleted. This is a change done in one of the latest release. You find logs of runners marked orphan in the scale-down log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants