-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline class #36
base: master
Are you sure you want to change the base?
Pipeline class #36
Conversation
Not sure if this improves on the existing setup but happy to be convinced otherwise! Here are some thoughts:
If I understand, once a module (such as pyspark) is imported then it will be cached and not be imported again, even if the command is called. So if everything works well, spark sessions should not be recreated. But I may be wrong about this.
Agree we should abstract the parallelism, but also feel like it could be done in a simpler way. At the moment, almost all Spark commands go through https://github.com/mikarubi/voluseg/blob/master/voluseg/_tools/evenly_parallelize.py and this function could be made more general to include dask, etc. (similar to load_volume, save_volume)
I think this would be cool. In general, Spark logs should be kept separate from the main logs because they are very verbose and will overwhelm voluseg-specific output.
What would be the advantage of this over adding all the function calls into a voluseg.run() script? Since the parameters will be saved in a json file, the setup can always be recreated if needed (e.g. if the pipeline stops mid way).
Overall I feel like we should leave this for now and revisit when other things are up to speed. |
That sounds good to me, but I’ll go through some rationale on the points above, just for the record, since I think they will be relevant for the project eventually: 1. Single Spark Context and Session per pipeline
What I meant was not importing the module itself, but instantiating new spark sessions and contexts, like this: spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContex re-creating these for each step in the pipeline seems unnecessary and potentially inefficient. But, more importantly, whenever we need to set specific configurations for the sessions, we will have to go about doing that at every single place the instantiation is happening, unnecessarily repeating code and making the code less maintainable. We might need to set specific configurations, for example, when dealing with the spark logs or even the tests coverage (see #28). 2. Single Logger per pipelineA Pipeline class also helps organizing that. The centralized logger is an important feature before we discuss the cloud deployment. 3. Class vs functional approach
We could have it both ways, but the class approach offers a better management of states and shared variables, imo. For example, we could have a class method that automatically detects where the execution stopped last, based on the parameters file, if you think that would be useful. 4. Parallelism backend abstraction
The spark context is used in other functions as well, and I believe these other functions would also need the Dask option. |
Ok, sounds good. Just a quick note about
So the question is whether the first call of this function creates a global session (which won't be recreated in future calls): |
@mikarubi here's a proposal for a Pipeline class, which should bring some improvements, at least when running Voluseg as an application:
parallel_backend
which would accept Spark, Dask, etc... and functions would be agnostic to itlet me know what you think about this, and then I can put more time into writing tests, improving the logging, etc...