Pipeline class #36

luiztauffer · 2024-09-09T09:14:36Z

@mikarubi here's a proposal for a Pipeline class, which should bring some improvements, at least when running Voluseg as an application:

a single and centralized Spark context, this facilitates the spark configuration. From a separation of concerns perspective, functions should not be responsible for recreating spark sessions and contexts
eventually this can be further abstracted into a parallel_backend which would accept Spark, Dask, etc... and functions would be agnostic to it
a centralized Logger. I already managed to capture all print statements, but not yet the Spark logging, but I believe it should be possible
the simplest way possible for Python users to run Voluseg, see below:

import voluseg

# Set initial parameters
parameters = voluseg.get_parameters_dictionary()
parameters['dir_ants'] = '/path_to/ants-2.5.2/bin/'
parameters['dir_input'] = '/path_to/sample_data'
parameters['dir_output'] = '/path_to/output'
parameters['registration'] = 'high'
parameters['diam_cell'] = 5.0
parameters['f_volume'] = 2.0

# Initialize the pipeline
voluseg_pipeline = voluseg.Pipeline(parameters=parameters0)

# Run the pipeline
voluseg_pipeline.run()

let me know what you think about this, and then I can put more time into writing tests, improving the logging, etc...

mikarubi · 2024-09-10T01:17:32Z

Not sure if this improves on the existing setup but happy to be convinced otherwise! Here are some thoughts:

a single and centralized Spark context, this facilitates the spark configuration. From a separation of concerns perspective, functions should not be responsible for recreating spark sessions and contexts

If I understand, once a module (such as pyspark) is imported then it will be cached and not be imported again, even if the command is called. So if everything works well, spark sessions should not be recreated. But I may be wrong about this.

eventually this can be further abstracted into a parallel_backend which would accept Spark, Dask, etc... and functions would be agnostic to it

Agree we should abstract the parallelism, but also feel like it could be done in a simpler way. At the moment, almost all Spark commands go through https://github.com/mikarubi/voluseg/blob/master/voluseg/_tools/evenly_parallelize.py and this function could be made more general to include dask, etc. (similar to load_volume, save_volume)

a centralized Logger. I already managed to capture all print statements, but not yet the Spark logging, but I believe it should be possible

I think this would be cool. In general, Spark logs should be kept separate from the main logs because they are very verbose and will overwhelm voluseg-specific output.

the simplest way possible for Python users to run Voluseg, see below:
# Initialize the pipeline
voluseg_pipeline = voluseg.Pipeline(parameters=parameters0)

# Run the pipeline
voluseg_pipeline.run()

What would be the advantage of this over adding all the function calls into a voluseg.run() script? Since the parameters will be saved in a json file, the setup can always be recreated if needed (e.g. if the pipeline stops mid way).

let me know what you think about this, and then I can put more time into writing tests, improving the logging, etc...

Overall I feel like we should leave this for now and revisit when other things are up to speed.

luiztauffer · 2024-09-10T08:36:23Z

@mikarubi

Overall I feel like we should leave this for now and revisit when other things are up to speed.

That sounds good to me, but I’ll go through some rationale on the points above, just for the record, since I think they will be relevant for the project eventually:

1. Single Spark Context and Session per pipeline

If I understand, once a module (such as pyspark) is imported then it will be cached and not be imported again

What I meant was not importing the module itself, but instantiating new spark sessions and contexts, like this:

spark = SparkSession.builder.getOrCreate()
spark_context = spark.sparkContex

re-creating these for each step in the pipeline seems unnecessary and potentially inefficient. But, more importantly, whenever we need to set specific configurations for the sessions, we will have to go about doing that at every single place the instantiation is happening, unnecessarily repeating code and making the code less maintainable. We might need to set specific configurations, for example, when dealing with the spark logs or even the tests coverage (see #28).
It would be preferable to do the session and context configuration and instantiation only once, and share these objects between steps. A Pipeline class helps organizing that.

2. Single Logger per pipeline

A Pipeline class also helps organizing that. The centralized logger is an important feature before we discuss the cloud deployment.

3. Class vs functional approach

What would be the advantage of this over adding all the function calls into a voluseg.run() script? Since the parameters will be saved in a json file, the setup can always be recreated if needed (e.g. if the pipeline stops mid way).

We could have it both ways, but the class approach offers a better management of states and shared variables, imo. For example, we could have a class method that automatically detects where the execution stopped last, based on the parameters file, if you think that would be useful.
From an application perspective, both approaches will be able to accomplish what is needed, it will boil down to personal taste.

4. Parallelism backend abstraction

At the moment, almost all Spark commands go through https://github.com/mikarubi/voluseg/blob/master/voluseg/_tools/evenly_parallelize.py

The spark context is used in other functions as well, and I believe these other functions would also need the Dask option.

mikarubi · 2024-09-10T14:37:09Z

Ok, sounds good. Just a quick note about SparkSession.builder.getOrCreate() for our future reference.

This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.builder.getOrCreate.html

So the question is whether the first call of this function creates a global session (which won't be recreated in future calls):

luiztauffer added 2 commits September 9, 2024 11:03

pipeline

cd9b30b

requirements

50121f5

Merge branch 'master' into pipeline-class

053a9b2

luiztauffer mentioned this pull request Sep 16, 2024

Project improvements tracking #21

Open

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline class #36

Pipeline class #36

luiztauffer commented Sep 9, 2024

mikarubi commented Sep 10, 2024

luiztauffer commented Sep 10, 2024

mikarubi commented Sep 10, 2024

Pipeline class #36

Are you sure you want to change the base?

Pipeline class #36

Conversation

luiztauffer commented Sep 9, 2024

mikarubi commented Sep 10, 2024

luiztauffer commented Sep 10, 2024

1. Single Spark Context and Session per pipeline

2. Single Logger per pipeline

3. Class vs functional approach

4. Parallelism backend abstraction

mikarubi commented Sep 10, 2024