Skip to content

Provenance Tags

William Silversmith edited this page Jun 29, 2017 · 6 revisions

Imagine working in a chemistry lab without labels on your test tubes or a lab notebook detailing what's in the experiment, crazy right? Unfortunately, as our dataset count increases and our experiments proliferate that seems to be the case for our ultrastructural imaging sets.

Fortunately, we have a new method for dealing with this problem: provenance files. Similar to a neuroglancer info file which describes the data format and bounding box of all the mip levels of a data layer, a provenance file provides information as to its purpose, who owns it, the manipulations that have been applied to it and by whom. Simply place the provenance file in the same directory as the data layer.

Here's an example provenance tag associated with gs://neuroglancer/pinky40_v11/image:

{
    "sources": [], # if it was e.g. a segmentation it would refer to the source image layer
    "owners": ["tmacrina@princeton.edu", "sseung@princeton.edu", "ws9@princeton.edu", "jingpeng@princeton.edu"],
    "processing": [
        {"method": "ingestion via ChunkFlow.jl", "by": "jingpeng@princeton.edu"}, 
        {"method": "downsample w/ averaging", "by": "ws9@princeton.edu"}
    ], 
    "description": "Channel images of pinky40, aligned by Tommy Macrina (tmacrina@princeton.edu) and Dodam Ih (dih@princeton.edu)"
}

From time to time, we'll run a script neuroglancer/pipeline/scripts/validate_provenance.py that scans all of our datasets and finds missing provenance files and invalid formats. Once we have enough datasets labeled, we'll run the script via a cron job and output the results to slack if there are failures.

Why not add these fields to an info file?

There are two reasons. The first is that the info file is used to drive lots of processes, so it's likely to get mangled by either a manual or automatic process (and then you'll lose all your carefully marked up metadata). The second is that it's also simpler to see which datasets haven't been labeled yet by visual inspection.

What about dataset level tags? These are just data layers.

You're a sharp cookie! We have a format defined for the dataset, but we're still working on getting our software tools together for reading and writing to it. The fields we envision for the dataset level are the following: "dataset_name", "dataset_description", "organism", "imaged_date", "imaged_by", "owners"