Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline class #36

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Pipeline class #36

wants to merge 3 commits into from

Conversation

luiztauffer
Copy link
Collaborator

@mikarubi here's a proposal for a Pipeline class, which should bring some improvements, at least when running Voluseg as an application:

  • a single and centralized Spark context, this facilitates the spark configuration. From a separation of concerns perspective, functions should not be responsible for recreating spark sessions and contexts
  • eventually this can be further abstracted into a parallel_backend which would accept Spark, Dask, etc... and functions would be agnostic to it
  • a centralized Logger. I already managed to capture all print statements, but not yet the Spark logging, but I believe it should be possible
  • the simplest way possible for Python users to run Voluseg, see below:
import voluseg

# Set initial parameters
parameters = voluseg.get_parameters_dictionary()
parameters['dir_ants'] = '/path_to/ants-2.5.2/bin/'
parameters['dir_input'] = '/path_to/sample_data'
parameters['dir_output'] = '/path_to/output'
parameters['registration'] = 'high'
parameters['diam_cell'] = 5.0
parameters['f_volume'] = 2.0

# Initialize the pipeline
voluseg_pipeline = voluseg.Pipeline(parameters=parameters0)

# Run the pipeline
voluseg_pipeline.run()

let me know what you think about this, and then I can put more time into writing tests, improving the logging, etc...

@mikarubi
Copy link
Owner

Not sure if this improves on the existing setup but happy to be convinced otherwise! Here are some thoughts:

  • a single and centralized Spark context, this facilitates the spark configuration. From a separation of concerns perspective, functions should not be responsible for recreating spark sessions and contexts

If I understand, once a module (such as pyspark) is imported then it will be cached and not be imported again, even if the command is called. So if everything works well, spark sessions should not be recreated. But I may be wrong about this.

  • eventually this can be further abstracted into a parallel_backend which would accept Spark, Dask, etc... and functions would be agnostic to it

Agree we should abstract the parallelism, but also feel like it could be done in a simpler way. At the moment, almost all Spark commands go through https://github.com/mikarubi/voluseg/blob/master/voluseg/_tools/evenly_parallelize.py and this function could be made more general to include dask, etc. (similar to load_volume, save_volume)

  • a centralized Logger. I already managed to capture all print statements, but not yet the Spark logging, but I believe it should be possible

I think this would be cool. In general, Spark logs should be kept separate from the main logs because they are very verbose and will overwhelm voluseg-specific output.

  • the simplest way possible for Python users to run Voluseg, see below:
# Initialize the pipeline
voluseg_pipeline = voluseg.Pipeline(parameters=parameters0)

# Run the pipeline
voluseg_pipeline.run()

What would be the advantage of this over adding all the function calls into a voluseg.run() script? Since the parameters will be saved in a json file, the setup can always be recreated if needed (e.g. if the pipeline stops mid way).

let me know what you think about this, and then I can put more time into writing tests, improving the logging, etc...

Overall I feel like we should leave this for now and revisit when other things are up to speed.

@luiztauffer
Copy link
Collaborator Author

@mikarubi

Overall I feel like we should leave this for now and revisit when other things are up to speed.

That sounds good to me, but I’ll go through some rationale on the points above, just for the record, since I think they will be relevant for the project eventually:

1. Single Spark Context and Session per pipeline

If I understand, once a module (such as pyspark) is imported then it will be cached and not be imported again

What I meant was not importing the module itself, but instantiating new spark sessions and contexts, like this:

spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContex

re-creating these for each step in the pipeline seems unnecessary and potentially inefficient. But, more importantly, whenever we need to set specific configurations for the sessions, we will have to go about doing that at every single place the instantiation is happening, unnecessarily repeating code and making the code less maintainable. We might need to set specific configurations, for example, when dealing with the spark logs or even the tests coverage (see #28).
It would be preferable to do the session and context configuration and instantiation only once, and share these objects between steps. A Pipeline class helps organizing that.

2. Single Logger per pipeline

A Pipeline class also helps organizing that. The centralized logger is an important feature before we discuss the cloud deployment.

3. Class vs functional approach

What would be the advantage of this over adding all the function calls into a voluseg.run() script? Since the parameters will be saved in a json file, the setup can always be recreated if needed (e.g. if the pipeline stops mid way).

We could have it both ways, but the class approach offers a better management of states and shared variables, imo. For example, we could have a class method that automatically detects where the execution stopped last, based on the parameters file, if you think that would be useful.
From an application perspective, both approaches will be able to accomplish what is needed, it will boil down to personal taste.

4. Parallelism backend abstraction

At the moment, almost all Spark commands go through https://github.com/mikarubi/voluseg/blob/master/voluseg/_tools/evenly_parallelize.py

The spark context is used in other functions as well, and I believe these other functions would also need the Dask option.

@mikarubi
Copy link
Owner

Ok, sounds good. Just a quick note about SparkSession.builder.getOrCreate() for our future reference.

This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.builder.getOrCreate.html

So the question is whether the first call of this function creates a global session (which won't be recreated in future calls):

@luiztauffer luiztauffer mentioned this pull request Sep 16, 2024
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants