Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated imports #20

Open
luiztauffer opened this issue Aug 27, 2024 · 1 comment
Open

Repeated imports #20

luiztauffer opened this issue Aug 27, 2024 · 1 comment

Comments

@luiztauffer
Copy link
Collaborator

We can avoid repeated imports throughout the scripts by moving them to the top of the script.
There's one specific import that I'm concerned about though, them numpy after setting local ENV vars:

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
import numpy as np

This happens multiple times across the project. Is this necessary because of pyspark opening multiple threads, or something similar?
Any way we could avoid that?

@mikarubi
Copy link
Owner

mikarubi commented Aug 29, 2024

This is a problem with numpy multithreading, rather than spark. After numpy is imported, one cannot change the number of requested threads, which leads to problems during spark parallelism.

It should work if we move these exact blocks to the top of the script. The key is to always set the environment variables before numpy is imported. Also, we may need to watch out for other modules that import numpy indirectly. So, in general, this change has to be made carefully, and we need to test the performance to see whether multithreading appears during parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants