Repeated imports #20

luiztauffer · 2024-08-27T12:10:47Z

We can avoid repeated imports throughout the scripts by moving them to the top of the script.
There's one specific import that I'm concerned about though, them numpy after setting local ENV vars:

voluseg/voluseg/_steps/step4a.py

Lines 7 to 12 in c2cf159

    
           os.environ["OMP_NUM_THREADS"] = "1" 
        
           os.environ["MKL_NUM_THREADS"] = "1" 
        
           os.environ["NUMEXPR_NUM_THREADS"] = "1" 
        
           os.environ["OPENBLAS_NUM_THREADS"] = "1" 
        
           os.environ["VECLIB_MAXIMUM_THREADS"] = "1" 
        
           import numpy as np

This happens multiple times across the project. Is this necessary because of pyspark opening multiple threads, or something similar?
Any way we could avoid that?

mikarubi · 2024-08-29T03:30:15Z

This is a problem with numpy multithreading, rather than spark. After numpy is imported, one cannot change the number of requested threads, which leads to problems during spark parallelism.

It should work if we move these exact blocks to the top of the script. The key is to always set the environment variables before numpy is imported. Also, we may need to watch out for other modules that import numpy indirectly. So, in general, this change has to be made carefully, and we need to test the performance to see whether multithreading appears during parallelism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated imports #20

Repeated imports #20

luiztauffer commented Aug 27, 2024

mikarubi commented Aug 29, 2024 •

edited

Loading

Repeated imports #20

Repeated imports #20

Comments

luiztauffer commented Aug 27, 2024

mikarubi commented Aug 29, 2024 • edited Loading

mikarubi commented Aug 29, 2024 •

edited

Loading