Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RADICAL executor MPI test leaves threads running at shutdown #3708

Open
benclifford opened this issue Nov 29, 2024 · 6 comments
Open

RADICAL executor MPI test leaves threads running at shutdown #3708

benclifford opened this issue Nov 29, 2024 · 6 comments
Labels

Comments

@benclifford
Copy link
Collaborator

Describe the bug

In PR #3397 I've been pushing slowly but aggressively on making sure all parts of Parsl close down OS-ish resources they have allocated, rather than relying on some later OS cleanup. Right now primarily threads, file descriptors (but also ideally processes). Work from that has slowly trickled into master over time.

Right now I'm getting these threads left behind at the end of the RADICAL-Pilot MPI test. The assert failure is the aggressive thread checking I have added, the threads are I think from inside the radical.pilot codebase.

Is there a way to shut these down at executor shutdown?

$ pytest parsl/tests/test_radical/test_mpi_funcs.py --config local

>           assert threading.active_count() == 1, "test left threads running: " + repr(threading.enumerate())
E           AssertionError: test left threads running: [<_MainThread(MainThread, started 139787084908352)>, <Thread(Thread-1 (_work), started daemon 139786984863424)>, <Thread(Thread-2 (_run_proxy), started daemon 139786740291264)>, <Thread(Thread-3 (_monitor), started daemon 139786731898560)>, <Thread(Thread-4 (_work), started daemon 139786723505856)>, <Thread(Thread-6 (_watch), started daemon 139786312459968)>, <Thread(Thread-7 (_listener), started daemon 139786304067264)>, <Thread(Thread-8 (_listener), started daemon 139785934984896)>, <Thread(Thread-9 (_listener), started daemon 139785926592192)>, <Thread(Thread-10 (_listener), started daemon 139785918199488)>, <Thread(Thread-11 (_listener), started daemon 139785909806784)>, <Thread(Thread-12 (_listener), started daemon 139785901414080)>, <Thread(Thread-14 (_listener), started daemon 139785540724416)>, <Thread(Thread-15 (_listener), started daemon 139785515546304)>, <Thread(Thread-16 (_listener), started daemon 139785062569664)>, <Thread(Thread-18 (_listener), started daemon 139785045784256)>, <Thread(Thread-19 (_listener), started daemon 139784676701888)>, <Thread(Thread-21 (_listener), started daemon 139784643131072)>, <Thread(Thread-22 (_listener), started daemon 139784634738368)>, <Thread(Thread-23 (_bulk_collector), started daemon 139784626345664)>]
E           assert 20 == 1
E            +  where 20 = <function active_count at 0x7f22b6d37ec0>()
E            +    where <function active_count at 0x7f22b6d37ec0> = threading.active_count

cc @AymenFJA @andre-merzky

To Reproduce
Checkout #PR 3397, run the above command

Expected behavior
Assert passes, because threads are cleaned up.

Environment
commit 254d8ba of PR #3397

@andre-merzky
Copy link

Hi @benclifford - that's really good timing you are showing here, as we work on those code paths anyway right now :-D We track the work related to this ticket now in radical-cybertools/radical.pilot#3272.

One thing: we are a bit optimistic about thread termination, in the sense that we set termination flags but don't join the threads. So there is usual a delay of up to a second until the threads are actually gone (the slowest one needs 0.5 seconds max to pick up that signal). Is that something you would worry about, or is it ok for you to add a delay to the test to capture that behavior?

@benclifford
Copy link
Collaborator Author

@andre-merzky its preferable for my work that everything is gone by the time the executor says it is shut down, rather than adding in delays in the test.

@andre-merzky
Copy link

Yeah, I thought so. All right, this might need a bit more time then, but we'll get it done.

github-merge-queue bot pushed a commit that referenced this issue Dec 14, 2024
In addition to the work happening
[here](radical-cybertools/radical.pilot#3269) to
address this issue #3708. This PR
ensures the cleanup of threads generated by RPEX.
- Bug fix (partially #3708)

---------

Co-authored-by: Ben Clifford <benc@hawaga.org.uk>
@benclifford
Copy link
Collaborator Author

#3718 removes one of the 20 leftover threads - the bulk collector one - leaving 18 leftovers (+ 1 expected main thread)

@andre-merzky
Copy link

@benclifford : the remaining threads are now cleanly terminated by radical-cybertools/radical.pilot#3269 - that goes into the next RP release, before the break.

@benclifford
Copy link
Collaborator Author

I've just tried radical pilot 1.90 (as in PR #3725) and I see all the threads shut down now. I'll close this issue when #3725 is merged.

github-merge-queue bot pushed a commit that referenced this issue Dec 17, 2024
This brings in fixes for a few issues that are fixed on the radical side
of things:

#3722 - a race condition on task completion
#3708 - cleaner shutdown handling as part of #3397
#3646 - Python 3.13 support

# Changed Behaviour

whatever has changed in radical-pilot

# Fixes

#3722

## Type of change

- Bug fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants