Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce import-ome-zarr task #521

Closed
tcompa opened this issue Sep 15, 2023 · 3 comments · Fixed by #557
Closed

Introduce import-ome-zarr task #521

tcompa opened this issue Sep 15, 2023 · 3 comments · Fixed by #557
Labels
flexibility Support more workflow-execution use cases High Priority Current Priorities & Blocking Issues

Comments

@tcompa
Copy link
Collaborator

tcompa commented Sep 15, 2023

This non-parallel task (or parallel at the plate level, provided we implement fractal-analytics-platform/fractal-server#802) would take a single ome-zarr plate.zarr as an input.

Its goals (sorted by increasing complexity) are to:

  1. Generate the appropriate component lists (plate/image/well), and include them in its metadata_update output - to be then parsed by fractal-server and placed in Dataset.meta (soon to become something like Dataset.components - TBD).
    NOTE: This functionality requires browsing through groups according to the OME-NGFF structure, and constructing the lists based on the zarr attributes. Ideally we should never use glob or anything similar, but only proceed through zarr groups/subgroups/attributes.
  2. Generate a ROI table (with a single entry, analogous to the current well_ROI_table) for each ome-ngff image in plate.zarr, by combining the pixel sizes and array size into a physical-unit ROI.
    NOTE 1: This functionality needs to happen per-image (even if it belongs to a higher-level task).
    NOTE 2: The well_ROI_table name would not be fully appropriate, because it assumes the well=image correspondence (which is true for Fractal, but not in general). Maybe we can find a more generic name (image_ROI_table, full_ROI_table, main_ROI_table, ...).
  3. Generate a second ROI table, which splits the main ROI into several others ROIs. The default here could be to generate a grid of ROIs (possibly with user-provided size, e.g. 7x9). The goal is e.g. to allow 3D processing of smaller dataset regions, e.g. in the cellpose task.

Side-note: we currently have a fixture like:

@pytest.fixture(scope="session")
def zenodo_zarr_metadata(testdata_path):
    metadata_3D = {
        "plate": ["plate.zarr"],
        "well": ["plate.zarr/B/03"],
        "image": ["plate.zarr/B/03/0/"],
        "num_levels": 6,
        "coarsening_xy": 2,
        "original_paths": [str(testdata_path / "10_5281_zenodo_7059515/")],
        "image_extension": "png",
    }

    metadata_2D = ...

for our tests. Together with #212, the current issue should aim at making it fully obsolete, since the relevant information would be gathered elsewhere.

@jluethi
Copy link
Collaborator

jluethi commented Sep 16, 2023

Great summary of the new copy task!

or parallel at the plate level

True and running on the plate level is sensible. But we couldn't use plate level in the current server implementation (because it would look for the plate key in metadata/components). Thus, this may run without a parallelization level and e.g. take the name of the Zarr file (that is in the input_paths folder) as an argument).

NOTE 1: This functionality needs to happen per-image (even if it belongs to a higher-level task).

True. But I think everything can be done based on zarr metadata (both the pixel sizes as well as the dimensions in pixels are contained in the metadata). Thus, I would say a task that runs on the plate level and just loops over all images is fine.

NOTE 2: The well_ROI_table name would not be fully appropriate, because it assumes the well=image correspondence (which is true for Fractal, but not in general). Maybe we can find a more generic name (image_ROI_table, full_ROI_table, main_ROI_table, ...).

Agreed! Let's go with image_ROI_table.

@jluethi jluethi added the Priority Important, but not the highest priority label Sep 27, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Oct 9, 2023

When acting on an existing OME-Zarr, is the import-ome-zarr task expected to modify it? That is, is it OK to add ROI tables to an existing OME-Zarr?

@jluethi
Copy link
Collaborator

jluethi commented Oct 9, 2023

it OK to add ROI tables to an existing OME-Zarr?

I would say yes, at least for that first implementation. ROI table are core to the further processing in Fractal, so there needs to be a good way to add them.

Should this be optional?
Maybe, but default to on => create OME-Zarr tables


Thinking through some scenarios:
In what scenarios would this not work?
If there is a read-only OME-Zarr somewhere. But then: Where would the Fractal workflow save any output (as it needs to go into the OME-Zarr)
=> there may be workflows that require some type of Copy OME-Zarr (or subset of it) before it's on storage that Fractal can write to.

In that case:
Import OME-Zarr without writing ROI tables. Then copy OME-Zarr (subset?).
Then add ROI tables (e.g. through import OME-Zarr again, but this time importing the newly created copy?)
Or have the copy task gain that functionality?


Are there other reasons to modify an OME-Zarr during import?

  1. Modifying the image data => that should be the job of tasks afterwards
  2. Modifying the label data => that should be the job of tasks afterwards
  3. Modifying metadata: Maybe. But we could eventually do e.g. a Pydantic NGFF model check on the OME-Zarr we're importing. And even further down the road, maybe there's a set of things where Fractal pushes beyond the current spec => import would add that relevant metadata if it's required.

Modifying metadata may actually be something we'd want to allow, e.g. to add the wavelength_id parameter to the Omero tables. Not sure whether we'll work well with OME-Zarrs without that one.


My conclusion:

  1. The "Create OME-Zarr ROI table" functionality should probably be a library function that could be used in different places
  2. The Import OME-Zarr task by default should create OME-Zarr ROI tables
  3. There may be additional metadata modifications that could become necessary in the future.

@tcompa tcompa linked a pull request Oct 9, 2023 that will close this issue
1 task
@tcompa tcompa added High Priority Current Priorities & Blocking Issues and removed Priority Important, but not the highest priority labels Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flexibility Support more workflow-execution use cases High Priority Current Priorities & Blocking Issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants