Skip to content

Commit

Permalink
docs update
Browse files Browse the repository at this point in the history
  • Loading branch information
György Kovács committed Oct 17, 2023
1 parent f748246 commit e063bb4
Show file tree
Hide file tree
Showing 11 changed files with 72 additions and 80 deletions.
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,8 +218,8 @@ A dataset and a folding constitute an *evaluation*, and many of the test functio
"folding": {"n_folds": 5, "n_repeats": 1,
"strategy": "stratified_sklearn"}}
Checking the consistency of performance scores
----------------------------------------------
Testing the consistency of performance scores
---------------------------------------------

Numerous experimental setups are supported by the package. In this section we go through them one by one giving some examples of possible use cases.

Expand Down
28 changes: 15 additions & 13 deletions docs/00_introduction.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
Introduction
=============

In a nutshell
-------------
The purpose
-----------

One comes across some performance scores of binary classification reported for a dataset and finds them suspicious (typo, unorthodox evaluation methodology, etc.). With the tools implemented in the ``mlscorecheck`` package one can test if the scores are consistent with each other and the assumptions on the experimental setup.
Performance scores for binary classification are reported on a dataset and look suspicious (exceptionally high scores possibly due to typo, uncommon evaluation methodology, data leakage in preparation, incorrect use of statistics, etc.). With the tools implemented in the package ``mlscorecheck``, one can test if the reported performance scores are consistent with each other and the assumptions on the experimental setup up to the numerical uncertainty due to rounding/truncation/ceiling.

The consistency tests are numerical and **not** statistical: if inconsistencies are identified, it means that either the assumptions on the evaluation protocol or the reported scores are incorrect.

In more detail
--------------

Binary classification is one of the most basic tasks in machine learning. The evaluation of the performance of binary classification techniques (whether it is original theoretical development or application to a specific field) is driven by performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Despite performance scores provide the basis to estimate the value of reasearch, the reported scores usually suffer from methodological problems, typos and the insufficient description of experimental settings, contributing to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and the skewing of entire fields ([RV]_, [EHG]_), as the reported but usually incomparable scores are usually rank approaches and ideas.
Binary classification is one of the most fundamental tasks in machine learning. The evaluation of the performance of binary classification techniques, whether for original theoretical advancements or applications in specific fields, relies heavily on performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Although reported performance scores are employed as primary indicators of research value, they often suffer from methodological problems, typos, and insufficient descriptions of experimental settings. These issues contribute to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and ultimately entire fields of research ([RV]_, [EHG]_). Even systematic reviews can suffer from using incomparable performance scores for ranking research papers [RV]_.

Most of the performance score are functions of the values of the binary confusion matrix (with four entries: true positives (``tp``), true negatives (``tn``), false positives (``fp``), false negatives (``fn``)). Consequently, when multiple performance scores are reported for some experiment, they cannot take any values independently.

Depending on the experimental setup, one can develop techniques to check if the performance scores reported for a dataset are consistent. This package implements such consistency tests for some common scenarios. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards. However, if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.
The majority of performance scores are calculated from the binary confusion matrix, or multiple confusion matrices aggregated across folds and/or datasets. For many commonly used experimental setups one can develop numerical techniques to test if there exists any confusion matrix (or matrices), compatible with the experiment and leading to the reported performance scores. This package implements such consistency tests for some common scenarios. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards or a presumed evaluation protocol. However, *if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty*. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.

For further information, see the preprint:

Expand All @@ -35,7 +33,7 @@ If you use the package, please consider citing the following paper:
Latest news
===========

* the 0.0.1 version of the package is released
* the 0.1.0 version of the package is released
* the paper describing the implemented techniques is available as a preprint at:

Installation
Expand All @@ -44,12 +42,16 @@ Installation
Requirements
------------

The package has only basic requirements when used for consistency checking.
The package has only basic requirements when used for consistency testing.

* ``numpy``
* ``pulp``

In order to execute the tests, one also needs ``scikit-learn``, in order to test the computer algebra components or reproduce the solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. In order to install ``sage`` into a conda environment one needs adding the ``conda-forge`` channel first:
.. code-block:: bash
> pip install numpy pulp
In order to execute the tests, one also needs ``scikit-learn``, in order to test the computer algebra components or reproduce the algebraic solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. To install ``sage`` in a ``conda`` environment, one needs to add the ``conda-forge`` channel first:

.. code-block:: bash
Expand All @@ -59,7 +61,7 @@ In order to execute the tests, one also needs ``scikit-learn``, in order to test
Installing the package
----------------------

In order to use for consistency testing, the package can be installed from the PyPI repository as:
For consistency testing, the package can be installed from the PyPI repository as:

.. code-block:: bash
Expand All @@ -71,14 +73,14 @@ For develompent purposes, one can clone the source code from the repository as
> git clone git@github.com:gykovacs/mlscorecheck.git
And install the source code into the actual virtual environment as
and install the source code into the actual virtual environment as

.. code-block:: bash
> cd mlscorecheck
> pip install -e .
In order to use and test all functionalities (including the symbolic computing part), please install the ``requirements.txt``:
In order to use and test all functionalities (including the algebraic and symbolic computing parts), please install the ``requirements.txt``:

.. code-block:: bash
Expand Down
8 changes: 4 additions & 4 deletions docs/01a_requirements.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ Requirements

In general, there are three inputs to the consistency testing functions:

* the specification of the dataset(s) involved;
* the collection of available performance scores. When aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported. When cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):
* **the specification of the experiment**;
* **the collection of available (reported) performance scores**: when aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported; when cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):

* accuracy (``acc``),
* sensitivity (``sens``),
Expand All @@ -26,7 +26,7 @@ In general, there are three inputs to the consistency testing functions:
* bookmaker informedness (``bm``),
* prevalence threshold (``pt``),
* diagnostic odds ratio (``dor``),
* jaccard index (``ji``),
* Jaccard index (``ji``),
* Cohen's kappa (``kappa``);

* the estimated numerical uncertainty: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case the numerical uncertainty becomes 0.0001 (10^(-4)).
* **the estimated numerical uncertainty**: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case, the numerical uncertainty becomes 0.0001 (10^(-4)).
32 changes: 16 additions & 16 deletions docs/01b_specifying_setup.rst
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
Specifying the experimental setup
---------------------------------
Specification of the experimental setup
---------------------------------------

In this subsection we illustrate the various ways the experimental setup can be specified.
In this subsection, we illustrate the various ways the experimental setup can be specified.

Specifying one testset or dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Specification of one testset or dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are multiple ways to specify datasets and entire experiments consisting of multiple datasets evaluated in differing ways of cross-validations.

A simple binary classification test-set consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as
A simple binary classification testset consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as

.. code-block:: Python
testset = {"p": 10, "n": 20}
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` statistics of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):

.. code-block:: Python
dataset = {"dataset_name": "common_datasets.ADA"}
To see the list of supported datasets and corresponding statistics, issue
To see the list of supported datasets and corresponding counts, issue

.. code-block:: Python
from mlscorecheck.experiments import dataset_statistics
print(dataset_statistics)
Specifying a folding
^^^^^^^^^^^^^^^^^^^^
Specification of a folding
^^^^^^^^^^^^^^^^^^^^^^^^^^

The specification of foldings is needed when the scores are computed in cross-validation scenarios. We distinguish two main cases: in the first case, the number of positive and negative samples in the folds are known, or can be derived from the attributes of the dataset (for example, by stratification); in the second case, the statistics of the folds are not known, but the number of folds and potential repetitions are known.

Expand All @@ -48,20 +48,20 @@ Knowing that the folding is derived by some standard stratification techniques,
folding = {"n_folds": 3, "n_repeats": 1, "strategy": "stratified_sklearn"}
In this specification it is prescribed that the samples are distributed to the folds according to the ``sklearn`` stratification implementation.
In this specification, it is assumed that the samples are distributed into the folds according to the ``sklearn`` stratification implementation.

Finally, if nor the folds or the folding strategy is known, one can simply specify the folding with its parameters:
Finally, if neither the folds nor the folding strategy is known, one can simply specify the folding with its parameters (assuming a repeated k-fold scheme):

.. code-block:: Python
folding = {"n_folds": 3, "n_repeats": 2}
Note, that not all consistency testing functions support the latter case (not knowing the folds).
Note that not all consistency testing functions support the latter case (not knowing the exact structure of the folds).

Specifying an evaluation
^^^^^^^^^^^^^^^^^^^^^^^^
Specification of an evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A dataset and a folding constitutes an *evaluation*, many of the test functions take evaluations as parameters describing the scenario:
A dataset and a folding constitute an *evaluation*, and many of the test functions take evaluations as parameters describing the scenario:

.. code-block:: Python
Expand Down
Loading

0 comments on commit e063bb4

Please sign in to comment.