-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate how to cache packages to reduce flakyness #25
Comments
During my tests I was hit be several problems related to the stability of this github action: apt mirror syncs, rosdep network problems, etc. I agree with the approach of caching packages as much as we can. Another measure could be to implement re-try mechanism to avoid problems that disappear in seconds. I think that the use of docker proposed in ros-actions-ci ros-tooling/action-ros-ci#67 could also fall under this same category of reducing flakyness. I'm available to help with implementations. Not sure about which priorities we have for the different solutions. |
We have an open issue in our backlog to tackle this ros-security/aws-roadmap#173. We are always open to reviewing contributions! |
At a point where its stable, or even now since we're essentially using nightly builds anyway, we should use the official ros docker image https://hub.docker.com/r/osrf/ros2/ |
I'm not sure this is possible today, since we're using this action in practice to target multiple OS (ubuntu, debian) and architecture (x64, arm32, arm64), which those images don't provide (that I know of). Perhaps it could special-case to use those images for Ubuntu though. On a different note towards the "nightly" part - I wonder whether during heavy-development times the "up-to-23-hours-out-of-date" snapshot you'd be working against would introduce confusing issues to users. |
Linking this since it may be relevant |
Seeing this too with apt: |
Here's what I've come up with so far: RoverRobotics-forks/rmw_cyclonedds@b6d7de8 |
Unfortunately, this will be very flakey, because the cache will never be invalidated, unless the key changes. I.e. this will keep re-using what's been cached once forever, even if APT/PIP/... changes. You'd need a way to reliably hash APT content which will be... challenging to write. I've been playing with a simpler alternative, run the CI on a prebuilt Docker image: https://github.com/ros-tooling/setup-ros-docker This removes the need of APT/PIP/... on Linux, at least. This is not ready for prime time, but the poc works. |
Why is that different than the local cache folder? My intent is that it will reload the cache and apt/pip/brew/etc. will do its own cache invalidation and update the cache folder as it normally does. At the end of the build, the modified cache gets saved off again. It does seem there are some permission bugs to sort out though. Edit: the bug I’m seeing seems to be this actions/cache#144 |
In this case, you're paying the cost of the cache mechanism + the previously existing mechanism, so it'd be interesting to actually run a benchmark to see if we shave any time doing things like this. |
@thomas-moulard From a performance standpoint, saving and loading the entire cache folders did not pay off. Surprisingly to me, the runtime gains on downloading packages was virtually nil. Also, the runtime losses of caching were significant. cache: https://github.com/RoverRobotics-forks/rmw_cyclonedds/actions/runs/67620206 |
Not sure I'm following here, the cache version is taking 1h, and the one w/o cache seems significantly faster ~30 min? |
Yes. But that's the "workflow execution time", which is not really a meaningful performance number - it's longer than the longest build but shorter than the total build time. When not caching, the Mac builds alone totaled more than the entire workflow time. Dig into each job, specifically how long the "Acquire ROS dependencies" (i.e. setup-ros) step and the Cache X / Post Cache X steps take. |
Sorry, I still don't see it. I took two OS X runs, and it still don't compare favorably:
Seems like we shave out 1 minute by caching, but HomeBrew caching adds 3min14s? edit: maybe I am not associating the runs correctly? Let's see the sum of all OS X builds to evacuate this potential issue: Still shorter w/o caching. Maybe the builds are just taking a varying amount of time? Did you maybe had a better chance in a previous run? |
My takeaway is that caching in the way I did it isn't worth it. I don't know what you mean by "I still don't see it" but I don't think it's worth looking into much. In your first link, restoring the cache took 3m34s. If you expand the step Cache Homebrew:
And then at the end (Post Cache Homebrew) it didn't save off the cache again, so that took no time. Runtime variability can be high! For example in these two runs, both without caching, and both starting Here, ros setup took 4m34s = 274s with Here, ros setup took 11m54s = 714s with |
I see, thanks for clarifying! |
Performance aside, it seems all the flakiness I’m seeing is from azure.archive.ubuntu.com/ubuntu. That url fails for both apt-get update and upgrade. Adding another mirror may be a good idea. https://github.com/ros2/rmw_cyclonedds/runs/551468958?check_suite_focus=true |
Generally, we're moving away from running those builds on bare metal. Running them in Docker containers on GH action would mitigate those issues as Debian/Canonical infrastructure is not as flaky. It also prevents updates of the host image from breaking the builds. |
Okay. Are these docker containers ready to use today? Or do you mean hypothetically? It would be nice to have it mentioned in the readme |
It's WIP, but I have a poc here, not really ready for prime time yet: Also, OSRF has official docker images, we saw other users using them successfully with the action. |
Description
As a GH action user, my pipelines infrequently fail due to infrastructure connectivity issues.
We should evaluate whether all parts currently requiring internet connectivity could be cached, including:
pip
apt
choco
ccache
(?)Test Plan
Documentation Plan
Release Plan
Acceptance Criteria
Once all above items are checked, this story can be moved to done.
The text was updated successfully, but these errors were encountered: