Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: How do we refer to channels? #211

Closed
jluethi opened this issue Nov 17, 2022 · 8 comments · Fixed by #239
Closed

Refactor: How do we refer to channels? #211

jluethi opened this issue Nov 17, 2022 · 8 comments · Fixed by #239
Labels
High Priority Current Priorities & Blocking Issues

Comments

@jluethi
Copy link
Collaborator

jluethi commented Nov 17, 2022

We currently refer to channels via index or via the string we use in parsing (i.e. A01_C01). There are two issues that aren't covered well by this:

  1. Multiplexing: We can have A01_C01 in each cycle => we need further specification to know what we are processing
  2. Parsing different microscopes: Other microscopes like e.g. the MD have a totally different file name structure, don't contain e.g. an A01_C01.

Thus, the broader question: How do we refer to a given channel in a way that works for multiplexing as well?
Channels have the following so far: An identifier string (e.g. A01_C01), an index (e.g. 0, only within an OME-Zarr image) and a name (e.g. DAPI).
There is additional metadata we currently don't have, but which may be important in some use-cases: acquisition number (hard-coded to 0 atm) and some microscope metadata (e.g. wavelength, laser used and such)

The identifier string isn't really relevant to the users, but we could get them used to it. For most use-cases, e.g. segmentation or napari workflows, users would probably related to the channel by its name. For illumination correction, the wavelength is the most important criteria. For registration, it's typically current cycle vs. cycle 0 on a specific wavelength or identifier.

Conclusion: We should have different options to refer to channels.
But we can start with 1-2 for the moment, as long as they are general enough to handle multiplexing. Selection by name is reasonable to me (if we make them unique, see below). Alternatively, selection by channel identifier (A01_C01) and acquisition number is also unique.
=> Can we define a channel selection function that can take different identifiers and as long as they can be mapped to a unique channel?

Question:
Should we enforce channel names to be unique? Would simplify selection of channels by name only and be useful when we make measurements on channel names (because we could use e.g. the channel name as a suffix to identify which intensity was measured) => I'd argue for unique channel names

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 25, 2022

We should enforce unique channel names during parsing (e.g. have DAPI_1, DAPI_2, DAPI_3, not just 3x DAPI). Then we can use channel names as a unique identifier. Let's expand on it here: #231

Ways of addressing channels

Then, there are 3 valid ways to specify a channel:

  1. Directly by its index in the image (everything eventually maps back to this)
  2. By its wavelength identifier and, in multiplexing, its acquisition (optional, our default experiments don't have an acquisition metadata, but then the wavelength needs to be unique [this is enforced by the way we parse the data]).
  3. By its channel name: the label in the omero metadata

On 2: let's use the A01_C01 as a wavelength identifier for Yokogawa for the time being. If we add more microscopes, maybe we generalize this to more to a string that is more relevant to the user, but it should always be some identifier string


Where is the channel information stored?

We may want to have a cached place within Fractal (e.g. within the dataset or the metadata) to cache this information for easy access. But the ground truth should be in the OME-Zarr file. If it's not necessary to cache the information for Fractal, even better! :) [also, if we see the data within metadata/dataset as a cache of what's in the OME-Zarr file, it makes it easier to accept an external OME-Zarr file, we may just have to set/refresh some caches on "import"]
The whole omero metadata we are currently using has been labeled "transitional", but it's not clear yet where it will go. For simplicity, I suggest we introduce a new key-value pair into the "omero" > "channels" list and treat this list as the place where we find relevant channel info. We can then start the conversation with the OME-NGFF community on where such channel info should go. It will be easier when we have everything we want in one place (=> all in the "omero" > "channels" list that concerns channel selection).

  1. Is natively stored as part of the OME-Zarr file
  2. Let's store it as an additional key-value pair name wavelength_id in the omero metadata (see below)
  3. Already part of the omero metadata (see below)
Example omero metadata
"omero": {
        "channels": [
            {
                "active": true,
                "coefficient": 1,
                "color": "00FFFF",
                "family": "linear",
                "inverted": false,
                "label": "DAPI",
                "wavelength_id": "A01_C01",
                "window": {
                    "end": 700,
                    "max": 65535,
                    "min": 0,
                    "start": 50
                }
            },
            {
                "active": true,
                "coefficient": 1,
                "color": "FF00FF",
                "family": "linear",
                "inverted": false,
                "label": "nanog",
                "wavelength_id": "A01_C02",
                "window": {
                    "end": 200,
                    "max": 65535,
                    "min": 0,
                    "start": 20
                }
            },
            {
                "active": true,
                "coefficient": 1,
                "color": "FFFF00",
                "family": "linear",
                "inverted": false,
                "label": "Lamin B1",
                "wavelength_id": "A02_C03",
                "window": {
                    "end": 1500,
                    "max": 65535,
                    "min": 0,
                    "start": 50
                }
            }
        ],
        "id": 1,
        "name": "TBD",
        "version": "0.4"
    }

I tested this new omero metadata on the 2x2 example manually and the napari viewer doesn't have an issue with us adding such additional key-value pairs. Thus, the omero metadata can be the centralized place where we put all the metadata for each image.


Edge case handling

Because we use the transitional omero metadata, which is optional, we don't know whether someone who brings their own OME-Zarr file will have this metadata. We can enforce it within Fractal for our standard workflows, but we should be able to handle the case where they are not available and fail gracefully otherwise.

Specifically: Option 1 is always possible, options 2 & 3 are dependent on what metadata is available.

Also, this check should allow us to catch non-unique channel names or missing channel names

If we are able to do input inference (see #200), then this would become very relevant. Even if we only do it at runtime (i.e. each task checks whether the label selection its getting is valid for the OME-Zarr file it should process before it starts running), this would be useful and allow us to gracefully catch channel selection that is either not available or non-unique.

@jluethi jluethi added the High Priority Current Priorities & Blocking Issues label Nov 25, 2022
@tcompa
Copy link
Collaborator

tcompa commented Nov 25, 2022

Preliminary to-do list for the next PR(s)

OME-zarr-creation tasks (multiplexing or not)

  • Upon creating the OMERO metadata for each image, always add all channels to the OMERO metadata.
  • Always include, for each channel, a wavelength_id attribute (e.g. A01_C01).
  • Always include, for each channel, a label attribute. If it is provided to the task, use it. If not, default to something related to wavelength_id (e.g A01_C01 or 0_A01_C01).
  • Explicitly verify that label attributes are globally (AKA per-well) unique, fail otherwise.
  • EDIT: Explicitly verify that wavelength_id attributes are unique for each image, fail otherwise.
  • Add tests

Helper functions:

  • Create a helper function that (in its first version)
    • Yakes either label or wavelength_id as an input (but not both) and a list of channels
    • Extracts the tuple (label, wavelength_id, index) for that channel.
  • Create a helper function that (in its first version) returns the list of channels for a given instance of OMERO attributes.
  • EDIT: Create a helper function that (in its first version)
    • Takes the path of a well,
    • Identifies the set of OMERO metadata attributes corresponding to each image
    • Check that the well only includes unique channel labels.
  • NOT INCLUDED IN THE FIRST VERSION: Create a helper function that (in its first version)
    • takes the path of a well,
    • identifies the set of OMERO metadata attributes corresponding to each image
    • filters all the available channels by wavelength_id,
    • returns a list of channels made of items like (image, label, wavelength_id, index)

Tasks:

  • For each task that loops over channels (e.g. illumination_correction), use the helper functions to correctly assign channel parameters to the right array index.
  • NOT INCLUDED IN THE FIRST VERSION For each task that only acts on a given image (e.g. cellpose_segmentation or the current version of napari_workflows) and need to be channel-aware:
    • Expose at least the wavelength_id (EDIT: this is in-place, but we should rename the argument).
    • As for the previous point: provide a meaningful name
    • Also expose the label field, when appropriate, and use helper function to get the matching channel.

Broader:

@jluethi, anything I missed from our discussion?

@jluethi
Copy link
Collaborator Author

jluethi commented Nov 25, 2022

Looks good!

Upon creating the OMERO metadata for each image, always add all channels to the OMERO metadata.

Aren't we already doing this? Also, we just add the channels contained in that image, not channels from other acquisitions, right?

Less urgent: Create a second helper function that (in its first version)

That's the one to load channels by name in the multiplexing case, correct?

For each task that only acts on a given image (e.g. cellpose_segmentation or the current version of napari_workflows) expose one or both of the label and wavelength_id arguments and use the helper functions to map this to a given array index.

If we build the "run-per-image" functionality first, then using label will not work in the multiplexing case (because labels are unique per plate, so a label will not be available in all the cycles. But it's a good thing to select in the non-multiplexed case. We can think of making this fail for multiplexing with a relevant error message while keeping the general functionality.

tcompa added a commit that referenced this issue Dec 1, 2022
@tcompa tcompa linked a pull request Dec 1, 2022 that will close this issue
@tcompa
Copy link
Collaborator

tcompa commented Dec 1, 2022

I updated the check-list above with the current status of #239

@tcompa
Copy link
Collaborator

tcompa commented Dec 1, 2022

Still missing:

  • Explicitly verify that wavelength_id attributes are unique for each image, fail otherwise.

@jluethi
Copy link
Collaborator Author

jluethi commented Dec 1, 2022

Explicitly verify that wavelength_id attributes are unique for each image, fail otherwise.

Let's discuss tomorrow, not sure we should have this as such a hard condition

@jluethi
Copy link
Collaborator Author

jluethi commented Dec 2, 2022

Explicitly verify that wavelength_id attributes are unique for each image, fail otherwise.

I agree now on this, just the between images in a well case to be discussed :)

@tcompa
Copy link
Collaborator

tcompa commented Dec 2, 2022

TO-DOs:

  • At the beginning of create-ome-zarr tasks, check validity of allowed_channels (that is, uniqueness of wavelength_id).
  • In cellpose_segmentation expose both wavelength_id and channel_label (for what is now labeling_channel).
  • Do not allow both of them simultaneously.
  • Rename arguments for cellpose task.
  • Fix docstring parameters.
  • In cellpose_segmentation, capture failures for channel not found in image. This enables usecase X below.
  • In napari_workflows_wrapper (which is a per-image task) expose both wavelength_id and channel_label (this is part of the input_specs definition).
  • Fix docstring with examples of input_specs / output_specs.

  • In the future, support Usecase X with some other solution which does not create 39 useless jobs.
  • In the future, include napari-workflows wrapper task at the well level, with its own logic to select channels (from the same cycle or from different cycles).

Usecase X:

  • Run cellpose task with channel_label="DAPI_1" in a multiplexing well with 40 cycles.
  • 40 jobs are created (one per image).
  • 39 jobs fail gracefully, doing nothing.
  • 1 job acts on DAPI channel of cycle 0.

Repository owner moved this from TODO to Done in Fractal Project Management Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Current Priorities & Blocking Issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants