Skip to content
This repository has been archived by the owner on Jun 26, 2021. It is now read-only.

Dataset Transition (v0.3.1 to v0.3.2)

Justus Schock edited this page Jun 27, 2019 · 1 revision

Dataset Guide (incl. Integration to new API)

Author: Michael Baumgartner (@mibaumgartner)

With Delira v0.3.2 a new dataset API was introduced to allow for more flexibility and add some features. This notebook shows the difference between the new and the old API and provides some examples for newly added

Overview Old API

The old dataset API was based on the assumption that the underlying structure of the data can be described as followed:

  • root
    • sample1
      • img1
      • img2
      • label
    • sample2
      • img1
      • img2
      • label
    • ...

A single sample was constructed from multiple images which are all located in the same subdirectory. The corresponding signature of the AbstractDataset was given by data_path, load_fn, img_extensions, gt_extensions. While most datasets need a load_fn to load a single sample and a data_path to the root directory, img_extensionsand gt_exntensions were often unsed. As a consequence a new dataset needed to be created which initialises the unused variables with arbitrary values. #%% md

Overview New API

The new dataset API was refactored to a more general approach where only a data_path to the root directory and a load_fn for a single sample need to be provided. A simple loading function (load_fn) to generate random data independent from the given path might be realized as below.

import numpy as np


def load_random_data(path: str) -> dict:
    """Load random data

    Parameters
    ----------
    path : str
        path to sample (not used in this example)

    Returns
    -------
    dict
        return data inside a dict
    """
    return {
        'data': np.random.rand(3, 512, 512),
        'label': np.random.randint(0, 10),
        'path': path,
    }

When used with the provided BaseDatasets, the return value of the load function is not limited to dictionaries and might be of any type which can be added to a list with the append method.

New Datasets

Some basic datasets are already implemented inside Delira and should be suitable for most cases. The BaseCacheDataset saves all samples inside the RAM and thus can only be used if everything fits inside the memory. ´BaseLazyDataset´ loads the individual samples on time when they are needed, but might lead to slower training due to the additional loading time.

from delira.data_loading import BaseCacheDataset, BaseLazyDataset


# because `load_random_data` does not use the path argument, they can have
# arbitrary values in this example
paths = list(range(10))

# create case dataset
cached_set = BaseCacheDataset(paths, load_random_data)

# create lazy dataset
lazy_set = BaseLazyDataset(paths, load_random_data)

# print cached data
print(cached_set[0].keys())

# print lazy data
print(lazy_set[0].keys())

In the above example a list of multiple paths is used as the data_path. load_fn is called for every element inside the provided list (can be any iterator). If data_path is a single string, it is assumed to be the path to the root directory. In this case, load_fnis called for every element inside the root directory.

Sometimes, a single file/folder contains multiple samples. BaseExtendCacheDataset uses the extend function to add elements to the internal list. Thus it is assumed that load_fn provides an iterable object, where eacht item represents a single data. AbstractDataset is now iterable and can be used directly in combination with for loops:

for cs in cached_set:
    print(cs["path"])

New Utility Function (Integration to new API)

The behavior of the old API can be replicated with the LoadSample, LoadSampleLabelfunctions. LoadSample assumes that all needed images and the label (for a single sample) are located in a directory. Both functions return a dictionary containing the loaded data. sample_ext maps keys to iterables. Each iterable defines the names of the images which should be loaded from the directory. ´sample_fn´ is used to load the images which are than stacked inside a single array.

from delira.data_loading import LoadSample, LoadSampleLabel


def load_random_array(path: str):
    """Return random data

    Parameters
    ----------
    path : str
        path to image

    Returns
    -------
    np.ndarray
        loaded data
    """
    return np.random.rand(128, 128)


# define the function to load a single sample from a directory
load_fn = LoadSample(
    sample_ext={
        # load 3 data channels
        'data': ['red.png', 'green.png', 'blue.png'],
        # load a singel segmentation channel
        'seg': ['seg.png']
    },
    sample_fn=load_random_array,
    # optionally: assign individual keys a datatype
    dtype={"data": "float", "seg": "uint8"},
    # optioanlly: normalize individual samples
    normalize=["data"])

# Note: in general the function should be called with the path of the
# directory where the imags are located
sample0 = load_fn(".")

print("data shape: {}".format(sample0["data"].shape))
print("segmentation shape: {}".format(sample0["seg"].shape))
print("data type: {}".format(sample0["data"].dtype))
print("segmentation type: {}".format(sample0["seg"].dtype))
print("data min value: {}".format(sample0["data"].min()))
print("data max value: {}".format(sample0["data"].max()))

By default the range is normalized to (-1, 1), but norm_fn can be changed to achieve other normalization schemes. Some examples are included in delira.data_loading.load_utils.

LoadSampleLabel takes an additional argument for the label and a function to load a label. This functions can be used in combination with the provided BaseDatasets to replicate (and extend) the old API.

If there are any questions left, feel free to contact us. The best way to do so is via our slack community or by just opening an issue at this repo.

Clone this wiki locally