Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach for multiplexing workflows: new parallelization levels #199

Closed
jluethi opened this issue Nov 14, 2022 · 18 comments
Closed

Approach for multiplexing workflows: new parallelization levels #199

jluethi opened this issue Nov 14, 2022 · 18 comments
Labels
flexibility Support more workflow-execution use cases Overview

Comments

@jluethi
Copy link
Collaborator

jluethi commented Nov 14, 2022

See first comment for a better overview than the initial issue


Tasks should be able to run on many resources. How do we implement this?

Tasks that typically run per cycle ("preprocessing", image to image):
Parsing
Illumination correction
stitching
registration

Tasks that could run per cycle, mostly run across all cycles (on a subset of channels selected by the user, those are image to label or image to table tasks => "analysis phase"):
Segmentation
measurements
napari workflows

Open questions:

  • How do we loop over multiple resources? As an input to the task or handled by the server?
  • How do we handle the inputs to tasks that loop over cycles? Both in the backend (e.g. do we store some cycle index as part of the input) and on the web GUI.
  • How do we let the user pick which cycles a workflow should run on? Could be something like allowing the users to pick resources to run on (e.g. defaults to all, but can this be changed?)
  • Can users add more cycles (resources) after the processing has started? Would be a nice feature to have, but high complexity. What if the user needs to change parameters? Typically, that would mean that the workflow needs to be rerun for everything (such that we are sure the output is actually generated with the parameters in the pipeline => reproducibility)

Scenario 1: We keep the hard limit of 1 workflow per project
We could limit the flexibility here: Users can't add resources later and can't run on subsets of resources => no questions about workflow changes
To cover use cases like #37, we could add a new version of the parsing task that adds channels to an existing OME-Zarr file. In that way, a user could create a new project, add the OME-Zarr as an input and add more channels
Also, this means the looping over resources may need to become part of the tasks, adding significant complexity for each task
=> seems like a hack

Scenario 2: For multiplexing projects, we can have both workflows per resource and workflows for the whole dataset.
To keep our workflows & handling somewhat simple, we could introduce a workflow for each resource. In that way, it's easy to run them separately (just 2 separate workflows), the tasks don't need to be aware of them (a task just handles a single resource) and they can be run at different times while maintaining reproducibility.
Separate workflows per resource make sense for the preprocessing phase, less for the analysis phase => have separate workflows for the two parts. And can only run analysis phase when preprocessing has finished for all parts.
Would add a bit of complexity on the parsing side: How do we avoid race conditions when creating the separate OME-Zarr images for each resource? => will need predefined cycle/resource indices. Some race conditions could be possible for updating the OME-Zarr metadata files per well / plate
Are workflows assigned to a given resource? How do we represent that?
Complexity of workflow linearity: Registration will need access to a given cycle + a reference cycle (e.g. aligning everything to the first cycle). Registration can only start when both the reference cycle and the current cycle have been processed => may need to split into 3 parts: preprocessing, registration, analysis. Registration only starts when preprocessing is finished, also runs per cycle
Also adds complexity on the Web GUI side: How do we represent many workflows in preprocessing & registration? How do we represent status for each? How do we make it easy to run a very similar workflow across all cycles, without the user entering the same options 20 times?


I favor a version of scenario 2, but there are some questions to be worked out still. It may come down to being explicit of splitting the processing into 3 phases with separate workflows and have a good way to then chain those workflows.

@jluethi jluethi changed the title Approach for handling processing of many resources (semi) independently Approach for multiplexing workflows Nov 17, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Nov 17, 2022

I've been giving this a bit more thought after the Wednesday call. I think I have a more structured overview of it here.

Once data is parsed into the OME-Zarr, the file is structured in the following way:

└── B
    └── 03
        ├── 0   => image 0, all channels for cycle 1 => acquisition 0
        ├── 1   => image 1, all channels for cycle 2 => acquisition 1
        ├── 2   => image 2, all channels for cycle 3 => acquisition 2
        └── 3   => image 3, all channels for cycle 4 => acquisition 3

So far, our task that parallelize on the well level (all the processing tasks, the other tasks currently prepare the OME-Zarr files) use a component like 20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/
[side note, looking through the json submission logs, I noticed that we also define a well as "well": ["20200812-CardiomyocyteDifferentiation14-Cycle1.zarr/B/03/0/" ],. But actually, B/03 is the well, the /0 already refers to the first image in that well]

I really like this idea that all of our tasks run on OME-Zarr images, because each acquisition is in itself an OME-Zarr image. This should make our workflows quite flexible and make it easier to handle non-HCS OME-Zarr images in the future.

So far, we had the assumption that there is only ever 1 image per well and we always run that. Now with multiplexing, I think we need to generalize this:

  • A well can have 1-n images.
  • We may want to run a workflow on:
    • all images across all wells separately (=> each image is a job)
    • a subset of channels that gets loaded from different images within the same well
    • only a given image within the well (e.g. a given acquisition => cycle)

This will require a bit of generalization, but I think we can achieve that.

The things that I haven't figured out yet:

  1. Do we need to be able to support different parameters across the different images?
    Certainly the case for parsing (each acquisition will have its own set of channel names). But maybe parsing is a special case.
    Illumination correction may need to map different cycles to different illumination profiles (maybe there is a default per wavelength, see discussion on specifying channels: Refactor: How do we refer to channels? #211). But I don't think we can always enforce this to be a single parameter that works across all multiplexing cycles
    In registration, we typically register every cycle against cycle 0 (but then don't need to register cycle 0).

What if we have multiplexing cycles with fewer channels? In that case, the same channel list wouldn't apply to all cycles (e.g. as their illumination correction parameters)

=> Is there such a thing as acquisition-specific parameters? [without making tasks super complex]

  1. How do we run workflows that process channels from multiple acquisitions, thus multiple OME-Zarr images
    Some tasks should run across a combination of cycles, thus process things from multiple OME-Zarr images. For example, we may have a nuclear segmentation in cycle 1 and want to measure the nuclear intensities of all channels across all cycles within those nuclei. How do we submit this? And how do we ensure all the channels have been processed to that level

  2. How do we submit workflows for different OME-Zarr files separately for part of the workflow, but not others?
    Do we need a new parallelization level?
    We will want to run illumination correction for each cycle independently, it would make sense to submit a separate job per cycle. But for a lot of image analysis, e.g. some napari workflow that measures channels across cycles, we want to submit for all the cycles combined.
    Or at least need to be able to do so, there's also cases where a user may want to submit the same napari workflow to each cycle independently (=> maybe first case where the parallelization level actually should be something that can be changed).


tl;dr: Do we need a new parallelization level: OME-Zarr image or OME-Zarr collection (= well for our case)?
If we have this new parallelization level, how do we handle parameters for this?

@tcompa
Copy link
Collaborator

tcompa commented Nov 17, 2022

For the record: the first test of a multiplexing workflow will be parsing+MIP

@tcompa
Copy link
Collaborator

tcompa commented Nov 24, 2022

#223 in principle enables a workflow that:

  • starts with a set of cycle folders,
  • parses all images into a single zarr,
  • copies the zarr to a new one (optionally with ROI tables projected to 2D),
  • fills the new zarr with MIP projections from the original one

This is tested in test_workflow_multiplexing_MIP (within tests/test_workflows_multiplexing.py), but a real-life example is in order (especially because the replicate_zarr_structure was heavily refactored).

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

That's great! Very curious to test this.
With "real-life example", you mean running the same test set through this workflow as an actual Fractal workflow? I think that would be a very good basis for the discussions of multiplexing goals, handling cycle-dependent parameters, handling cycle-independent processing etc.
Will make it much more concrete to discuss those topics.

@tcompa
Copy link
Collaborator

tcompa commented Nov 24, 2022

With "real-life example", you mean running the same test set through this workflow as an actual Fractal workflow?

It could also be via one of our examples in this repo, but then it's worth going the extra mile and testing it directly as part of Fractal. This multiplexing test dataset I'm using is very trivial (it's not the actual multiplexing dataset, but a fake one), and I am not 100% that everything is fine with a more realistic case.

Also, because of the refactor, it's always good to re-run a "real-life" example with non-multiplexing datasets, just to confirm that everything still works.

@tcompa
Copy link
Collaborator

tcompa commented Nov 24, 2022

By the way: this PR did not touch the maximum_intensity_projection task at all. This fits well with our plan that if a task operates (e.g.) on a OME-zarr image, then it may be irrelevant to know whether it's part of a multiplexing or non-multiplexing dataset.
Of course this one is a simple case, because MIP does not require additional parameters.

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

By the way: this PR did not touch the maximum_intensity_projection task at all. This fits well with our plan that if a task operates (e.g.) on a OME-zarr image, then it may be irrelevant to know whether it's part of a multiplexing or non-multiplexing dataset.

That's great to hear. Argues for the parallelization level approach that is agnostic to multiplexing, I agree! :)
And good idea to test it with the multiplexing test dataset, that already has some extra complexity in the dataset structure (though in theory, that shouldn't affect how does tasks then run). Let me know if I can help with some of the testing, otherwise I'll just wait curiously for the result

@tcompa
Copy link
Collaborator

tcompa commented Nov 24, 2022

This is now part of fractal-tasks-core 0.4.6.

If you have some examples ready or almost ready, it would be great to test:

  1. A non-multiplexing workflow that includes a MIP and whatever other task acting on the 2D zarr.
  2. A multiplexing workflow that goes from parsing (say for the current "testing" multiplexing dataset) to MIP.

(I can also take care of this, but a bit later next week)

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

Great, I'll start to have a look at it, should have tests running tomorrow the latest :)

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

Wow, this was smoother than expected!

  1. A non-multiplexing workflow that includes a MIP and whatever other task acting on the 2D zarr.

I ran example 8 (2x2 dataset, MIP, segmentation & napari workflows). It all ran as before, processing still works without issues.

A multiplexing workflow that goes from parsing (say for the current "testing" multiplexing dataset) to MIP.

I ran example 2 again, now with replicate Zarr & MIP added. It makes the MIPs for all 3 cycles and all the channels. Also tests the reverse order of cycles, also worked fine :)

The processed 3rd cycle as an MIP:
Screenshot 2022-11-24 at 17 15 13

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

I think that's a great sign for this multiplexing approach. Will be interesting to tackle tasks with some parameters.
@tcompa Would you expect that e.g. cellpose_segmentation also works out of the box and is just run for each cycle independently?
And will be interesting to tackle illumination correction & napari workflows afterwards :)

Also, I need to run some larger experiments so I can actually observe the jobs running, see what's happening on slurm. The small multiplexing test was over before I realized it 😅

@tcompa
Copy link
Collaborator

tcompa commented Nov 24, 2022

I think that's a great sign for this multiplexing approach. Will be interesting to tackle tasks with some parameters. @tcompa Would you expect that e.g. cellpose_segmentation also works out of the box and is just run for each cycle independently?

It's possible, but I've not thought carefully about it to be honest. Feel free to try ;)

Also, I need to run some larger experiments so I can actually observe the jobs running, see what's happening on slurm. The small multiplexing test was over before I realized it sweat_smile

Those tests will get much more informative in a few days, when the jobs are the SLURM-backend ones (quick update). But you can already get a feeling with parsl, of course.

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

It's possible, but I've not thought carefully about it to be honest. Feel free to try ;)

Will do :)

Those tests will get much more informative in a few days, when the jobs are the SLURM-backend ones (quick update). But you can already get a feeling with parsl, of course.

Briefly reran it. We do see 6 small slurm jobs for the 2 wells x 3 cycles => that would be the goal! So that looks great.

I'll update/make an overview of all the multiplexing issues & open questions so that we can keep the overview over the multiplexing processing milestone :)

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 24, 2022

It's possible, but I've not thought carefully about it to be honest. Feel free to try ;)

Doesn't work yet, see #227. Makes sense though, we anyway know we need to address channel names :)

@jluethi jluethi changed the title Approach for multiplexing workflows Approach for multiplexing workflows: new parallelization levels Nov 25, 2022
@jluethi
Copy link
Collaborator Author

jluethi commented Nov 25, 2022

I'll be making a general overview issue. Let's focus this issue on the parallelization levels approach.

We may even use the parallelization level approach to allow Fractal processing for multi-FOV plates (⇒ different, does not go by acquisition, but many images of the same acquisition). Not something we need to implement now, but let's keep that use-case in mind while we generalize the parallelization levels.
Another generalization that could come up, but I think we'll postpone the details there, is parallelizing over time points in time data. That gets a bit more complicated, because time points are all within the same OME-Zarr file (=> maybe not solved via parallelization levels? Let's see).

What should the parallelization levels be called?

  • Plate level: e.g. creating the OME-Zarr structure, replicating OME-Zarr structure
  • Well level: Process all the images in a well in one job, e.g. combine channels from different acquisition (=> only really relevant for multiplexing. In a multi-FOV case, I think one would not want to process FOVs in the same job )
  • Image level: Run over all the images in the well separately. This should probably be the default for tasks. Works well in default settings, works well for many multiplexing settings (e.g. doing illumination correction or MIPs) and would also work well for multi FOV setups.

There's one edge case I can think of: Registration will basically run on the image level, but will always also load the reference cycle (typically acquisition 0) to compare things to. And we may need to run some consolidation job that runs on the well level to find the overlap area of all cycles and adjust the ROIs (topic TBD in the registration jobs)

A thing that can be slightly confusing is that there is another level of looping happening: over ROIs. This does not affect what runs in a given job though. One can loop over the well ROI (single ROI), the FOV ROI (many per well) or over newly created ROIs like organoid (TBD, also many per well). This all happens within a single Fractal job though (currently sequentially).

@jluethi jluethi mentioned this issue Nov 25, 2022
1 task
@tcompa
Copy link
Collaborator

tcompa commented Nov 25, 2022

* **Plate level:** e.g. creating the OME-Zarr structure, replicating OME-Zarr structure

Quick reminder: the OME-zarr creation cannot happen at the plate level, because the concept of plate is not defined before the end of the task. As we discussed recently somewhere else, this really is a global task.

If we want an ome-zarr creation task working at the plate level, then we should:

  1. Drop support for multi-plate parsing (never tested, but required - see Test multi-plates support #49).
  2. Provide a list of plates already upon adding an image datasets, before running the ome-zarr-creation task.

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 25, 2022

the OME-zarr creation cannot happen at the plate level, because the concept of plate is not defined before the end of the task. As we discussed recently somewhere else, this really is a global task.

Happy to have a this as a global level for the time being then.

We will most likely need to support multi-plate setups, at least for the case that the plates are processed in the same way (typical use-case: Screening). For use cases where users want to process different plates differently, I think we'd go with separate projects.

Provide a list of plates already upon adding an image datasets, before running the ome-zarr-creation task.

Once we tackle multi-plate, we can then work out the details of 2. above. I could see this being the series of different input datasets, which are specified before the ome-zarr creation task is run.

@tcompa tcompa added the flexibility Support more workflow-execution use cases label Sep 15, 2023
@jluethi
Copy link
Collaborator Author

jluethi commented Sep 27, 2023

Multiplexing registration tasks run as expected at the well level or image level (depending on which part of the calculation is done. See further discussions on how to handle subsets here: fractal-analytics-platform/fractal-server#792
Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flexibility Support more workflow-execution use cases Overview
Projects
None yet
Development

No branches or pull requests

2 participants