-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approach for multiplexing workflows: new parallelization levels #199
Comments
I've been giving this a bit more thought after the Wednesday call. I think I have a more structured overview of it here. Once data is parsed into the OME-Zarr, the file is structured in the following way:
So far, our task that parallelize on the well level (all the processing tasks, the other tasks currently prepare the OME-Zarr files) use a I really like this idea that all of our tasks run on OME-Zarr images, because each acquisition is in itself an OME-Zarr image. This should make our workflows quite flexible and make it easier to handle non-HCS OME-Zarr images in the future. So far, we had the assumption that there is only ever 1 image per well and we always run that. Now with multiplexing, I think we need to generalize this:
This will require a bit of generalization, but I think we can achieve that. The things that I haven't figured out yet:
What if we have multiplexing cycles with fewer channels? In that case, the same channel list wouldn't apply to all cycles (e.g. as their illumination correction parameters) => Is there such a thing as acquisition-specific parameters? [without making tasks super complex]
tl;dr: Do we need a new parallelization level: OME-Zarr image or OME-Zarr collection (= well for our case)? |
For the record: the first test of a multiplexing workflow will be parsing+MIP |
#223 in principle enables a workflow that:
This is tested in |
That's great! Very curious to test this. |
It could also be via one of our Also, because of the refactor, it's always good to re-run a "real-life" example with non-multiplexing datasets, just to confirm that everything still works. |
By the way: this PR did not touch the |
That's great to hear. Argues for the parallelization level approach that is agnostic to multiplexing, I agree! :) |
This is now part of fractal-tasks-core 0.4.6. If you have some examples ready or almost ready, it would be great to test:
(I can also take care of this, but a bit later next week) |
Great, I'll start to have a look at it, should have tests running tomorrow the latest :) |
Wow, this was smoother than expected!
I ran example 8 (2x2 dataset, MIP, segmentation & napari workflows). It all ran as before, processing still works without issues.
I ran example 2 again, now with replicate Zarr & MIP added. It makes the MIPs for all 3 cycles and all the channels. Also tests the reverse order of cycles, also worked fine :) |
I think that's a great sign for this multiplexing approach. Will be interesting to tackle tasks with some parameters. Also, I need to run some larger experiments so I can actually observe the jobs running, see what's happening on slurm. The small multiplexing test was over before I realized it 😅 |
It's possible, but I've not thought carefully about it to be honest. Feel free to try ;)
Those tests will get much more informative in a few days, when the jobs are the SLURM-backend ones (quick update). But you can already get a feeling with parsl, of course. |
Will do :)
Briefly reran it. We do see 6 small slurm jobs for the 2 wells x 3 cycles => that would be the goal! So that looks great. I'll update/make an overview of all the multiplexing issues & open questions so that we can keep the overview over the multiplexing processing milestone :) |
Doesn't work yet, see #227. Makes sense though, we anyway know we need to address channel names :) |
I'll be making a general overview issue. Let's focus this issue on the parallelization levels approach. We may even use the parallelization level approach to allow Fractal processing for multi-FOV plates (⇒ different, does not go by acquisition, but many images of the same acquisition). Not something we need to implement now, but let's keep that use-case in mind while we generalize the parallelization levels. What should the parallelization levels be called?
There's one edge case I can think of: Registration will basically run on the image level, but will always also load the reference cycle (typically acquisition 0) to compare things to. And we may need to run some consolidation job that runs on the well level to find the overlap area of all cycles and adjust the ROIs (topic TBD in the registration jobs) A thing that can be slightly confusing is that there is another level of looping happening: over ROIs. This does not affect what runs in a given job though. One can loop over the well ROI (single ROI), the FOV ROI (many per well) or over newly created ROIs like organoid (TBD, also many per well). This all happens within a single Fractal job though (currently sequentially). |
Quick reminder: the OME-zarr creation cannot happen at the plate level, because the concept of plate is not defined before the end of the task. As we discussed recently somewhere else, this really is a global task. If we want an ome-zarr creation task working at the plate level, then we should:
|
Happy to have a this as a global level for the time being then. We will most likely need to support multi-plate setups, at least for the case that the plates are processed in the same way (typical use-case: Screening). For use cases where users want to process different plates differently, I think we'd go with separate projects.
Once we tackle multi-plate, we can then work out the details of 2. above. I could see this being the series of different input datasets, which are specified before the ome-zarr creation task is run. |
Multiplexing registration tasks run as expected at the well level or image level (depending on which part of the calculation is done. See further discussions on how to handle subsets here: fractal-analytics-platform/fractal-server#792 |
See first comment for a better overview than the initial issue
Tasks should be able to run on many resources. How do we implement this?
Tasks that typically run per cycle ("preprocessing", image to image):
Parsing
Illumination correction
stitching
registration
Tasks that could run per cycle, mostly run across all cycles (on a subset of channels selected by the user, those are image to label or image to table tasks => "analysis phase"):
Segmentation
measurements
napari workflows
Open questions:
Scenario 1: We keep the hard limit of 1 workflow per project
We could limit the flexibility here: Users can't add resources later and can't run on subsets of resources => no questions about workflow changes
To cover use cases like #37, we could add a new version of the parsing task that adds channels to an existing OME-Zarr file. In that way, a user could create a new project, add the OME-Zarr as an input and add more channels
Also, this means the looping over resources may need to become part of the tasks, adding significant complexity for each task
=> seems like a hack
Scenario 2: For multiplexing projects, we can have both workflows per resource and workflows for the whole dataset.
To keep our workflows & handling somewhat simple, we could introduce a workflow for each resource. In that way, it's easy to run them separately (just 2 separate workflows), the tasks don't need to be aware of them (a task just handles a single resource) and they can be run at different times while maintaining reproducibility.
Separate workflows per resource make sense for the preprocessing phase, less for the analysis phase => have separate workflows for the two parts. And can only run analysis phase when preprocessing has finished for all parts.
Would add a bit of complexity on the parsing side: How do we avoid race conditions when creating the separate OME-Zarr images for each resource? => will need predefined cycle/resource indices. Some race conditions could be possible for updating the OME-Zarr metadata files per well / plate
Are workflows assigned to a given resource? How do we represent that?
Complexity of workflow linearity: Registration will need access to a given cycle + a reference cycle (e.g. aligning everything to the first cycle). Registration can only start when both the reference cycle and the current cycle have been processed => may need to split into 3 parts: preprocessing, registration, analysis. Registration only starts when preprocessing is finished, also runs per cycle
Also adds complexity on the Web GUI side: How do we represent many workflows in preprocessing & registration? How do we represent status for each? How do we make it easy to run a very similar workflow across all cycles, without the user entering the same options 20 times?
I favor a version of scenario 2, but there are some questions to be worked out still. It may come down to being explicit of splitting the processing into 3 phases with separate workflows and have a good way to then chain those workflows.
The text was updated successfully, but these errors were encountered: