Datasets usually suffer from various data quality problems or data errors. At the same time, there are various error detection strategies to detect different kinds of data errors. To effectively detect the data errors, the user has to deploy and test multiple error detection strategies. However, evaluating each error detection strategy on the new dataset requires tedious human evaluation efforts. Therefore, estimating the performance of each strategy upfront is desirable for a more effective strategy selection.
In this project, we propose a new approach to estimate the performance of error detection strategies. The intuition is that error detection strategies will perform similarly on similarly dirty datasets. Therefore, we introduce the novel concept of dirtiness profiles, which make datasets comparable with respect to their dirtiness. Based on the similarity of dirtiness profiles, we estimate the expected performance of the available error detection strategies on the new dataset.
This project is implemented on top of abstraction layer.
This folder contains input datasets.
This folder contains the underlying data cleaning tools.
This folder contains the outputted results.
This file contains the implementation of the dataset class.
This file contains the implementation of the data cleaning tool class.
This file contains the implementation of the main application.
You can find more information about this project and the authors here.
You can also use the following bib entry to cite this project/paper.
@inproceedings{reds2019mahdavi,
title={REDS: Estimating the performance of error detection strategies based on dirtiness profiles},
author={Mahdavi, Mohammad and Abedjan, Ziawasch},
booktitle={SSDBM},
year={2019}
}