-
Notifications
You must be signed in to change notification settings - Fork 26
Dataset Transition (v0.3.1 to v0.3.2)
Author: Michael Baumgartner (@mibaumgartner)
With Delira v0.3.2 a new dataset API was introduced to allow for more flexibility and add some features. This notebook shows the difference between the new and the old API and provides some examples for newly added
The old dataset API was based on the assumption that the underlying structure of the data can be described as followed:
- root
- sample1
- img1
- img2
- label
- sample2
- img1
- img2
- label
- ...
- sample1
A single sample was constructed from multiple images which are all located in
the same subdirectory.
The corresponding signature of the AbstractDataset
was given by
data_path, load_fn, img_extensions, gt_extensions
.
While most datasets need a load_fn
to load a single sample and a
data_path
to the root directory, img_extensions
and gt_exntensions
were often unsed.
As a consequence a new dataset needed to be created which initialises the
unused variables with arbitrary values.
#%% md
The new dataset API was refactored to a more general approach where only a
data_path
to the root directory and a load_fn
for a single sample need
to be provided.
A simple loading function (load_fn
) to generate random data independent
from the given path might be realized as below.
import numpy as np
def load_random_data(path: str) -> dict:
"""Load random data
Parameters
----------
path : str
path to sample (not used in this example)
Returns
-------
dict
return data inside a dict
"""
return {
'data': np.random.rand(3, 512, 512),
'label': np.random.randint(0, 10),
'path': path,
}
When used with the provided BaseDatasets, the return value of the load
function is not limited to dictionaries and might be of any type which can be
added to a list with the append
method.
Some basic datasets are already implemented inside Delira and should be
suitable for most cases. The BaseCacheDataset
saves all samples inside the
RAM and thus can only be used if everything fits inside the memory.
´BaseLazyDataset´ loads the individual samples on time when they are needed,
but might lead to slower training due to the additional loading time.
from delira.data_loading import BaseCacheDataset, BaseLazyDataset
# because `load_random_data` does not use the path argument, they can have
# arbitrary values in this example
paths = list(range(10))
# create case dataset
cached_set = BaseCacheDataset(paths, load_random_data)
# create lazy dataset
lazy_set = BaseLazyDataset(paths, load_random_data)
# print cached data
print(cached_set[0].keys())
# print lazy data
print(lazy_set[0].keys())
In the above example a list of multiple paths is used as the data_path
.
load_fn
is called for every element inside the provided list (can be any
iterator). If data_path
is a single string, it is assumed to be the path
to the root directory. In this case, load_fn
is called for every element
inside the root directory.
Sometimes, a single file/folder contains multiple samples.
BaseExtendCacheDataset
uses the extend
function to add elements to the
internal list. Thus it is assumed that load_fn
provides an iterable object,
where eacht item represents a single data.
AbstractDataset
is now iterable and can be used directly in combination
with for loops:
for cs in cached_set:
print(cs["path"])
The behavior of the old API can be replicated with the LoadSample
,
LoadSampleLabel
functions. LoadSample
assumes that all needed images and
the label (for a single sample) are located in a directory. Both functions
return a dictionary containing the loaded data.
sample_ext
maps keys to iterables. Each iterable defines the names of the
images which should be loaded from the directory. ´sample_fn´ is used to load
the images which are than stacked inside a single array.
from delira.data_loading import LoadSample, LoadSampleLabel
def load_random_array(path: str):
"""Return random data
Parameters
----------
path : str
path to image
Returns
-------
np.ndarray
loaded data
"""
return np.random.rand(128, 128)
# define the function to load a single sample from a directory
load_fn = LoadSample(
sample_ext={
# load 3 data channels
'data': ['red.png', 'green.png', 'blue.png'],
# load a singel segmentation channel
'seg': ['seg.png']
},
sample_fn=load_random_array,
# optionally: assign individual keys a datatype
dtype={"data": "float", "seg": "uint8"},
# optioanlly: normalize individual samples
normalize=["data"])
# Note: in general the function should be called with the path of the
# directory where the imags are located
sample0 = load_fn(".")
print("data shape: {}".format(sample0["data"].shape))
print("segmentation shape: {}".format(sample0["seg"].shape))
print("data type: {}".format(sample0["data"].dtype))
print("segmentation type: {}".format(sample0["seg"].dtype))
print("data min value: {}".format(sample0["data"].min()))
print("data max value: {}".format(sample0["data"].max()))
By default the range is normalized to (-1, 1), but norm_fn
can be
changed to achieve other normalization schemes. Some examples are included
in delira.data_loading.load_utils
.
LoadSampleLabel
takes an additional argument for the label and a function
to load a label. This functions can be used in combination with the provided
BaseDatasets to replicate (and extend) the old API.
If there are any questions left, feel free to contact us. The best way to do so is via our slack community or by just opening an issue at this repo.
If there are any questions left, feel free to contact us. The best way to do so is via our slack community or by just opening an issue at this repo.