From e063bb464452f34ae2c6d5733bb5461bc96ed759 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gy=C3=B6rgy=20Kov=C3=A1cs?= <gyorgy.kovacs@daalab.com>
Date: Tue, 17 Oct 2023 07:47:40 +0200
Subject: [PATCH] docs update

---
 README.rst                        |  4 +--
 docs/00_introduction.rst          | 28 +++++++++--------
 docs/01a_requirements.rst         |  8 ++---
 docs/01b_specifying_setup.rst     | 32 ++++++++++----------
 docs/01c_consistency_checking.rst | 50 +++++++++++++++----------------
 docs/02c_ehg.rst                  |  4 +--
 docs/02d_contribution.rst         |  6 ++--
 docs/03_api.rst                   |  6 ++--
 docs/04_contribution.rst          |  4 ---
 docs/05_references.rst            |  2 +-
 docs/index.rst                    |  8 ++---
 11 files changed, 72 insertions(+), 80 deletions(-)
 delete mode 100644 docs/04_contribution.rst

diff --git a/README.rst b/README.rst
index bd9227e..f8c9058 100644
--- a/README.rst
+++ b/README.rst
@@ -218,8 +218,8 @@ A dataset and a folding constitute an *evaluation*, and many of the test functio
                     "folding": {"n_folds": 5, "n_repeats": 1,
                                 "strategy": "stratified_sklearn"}}
 
-Checking the consistency of performance scores
-----------------------------------------------
+Testing the consistency of performance scores
+---------------------------------------------
 
 Numerous experimental setups are supported by the package. In this section we go through them one by one giving some examples of possible use cases.
 
diff --git a/docs/00_introduction.rst b/docs/00_introduction.rst
index b85e8c4..73ab087 100644
--- a/docs/00_introduction.rst
+++ b/docs/00_introduction.rst
@@ -1,21 +1,19 @@
 Introduction
 =============
 
-In a nutshell
--------------
+The purpose
+-----------
 
-One comes across some performance scores of binary classification reported for a dataset and finds them suspicious (typo, unorthodox evaluation methodology, etc.). With the tools implemented in the ``mlscorecheck`` package one can test if the scores are consistent with each other and the assumptions on the experimental setup.
+Performance scores for binary classification are reported on a dataset and look suspicious (exceptionally high scores possibly due to typo, uncommon evaluation methodology, data leakage in preparation, incorrect use of statistics, etc.). With the tools implemented in the package ``mlscorecheck``, one can test if the reported performance scores are consistent with each other and the assumptions on the experimental setup up to the numerical uncertainty due to rounding/truncation/ceiling.
 
 The consistency tests are numerical and **not** statistical: if inconsistencies are identified, it means that either the assumptions on the evaluation protocol or the reported scores are incorrect.
 
 In more detail
 --------------
 
-Binary classification is one of the most basic tasks in machine learning. The evaluation of the performance of binary classification techniques (whether it is original theoretical development or application to a specific field) is driven by performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Despite performance scores provide the basis to estimate the value of reasearch, the reported scores usually suffer from methodological problems, typos and the insufficient description of experimental settings, contributing to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and the skewing of entire fields ([RV]_, [EHG]_), as the reported but usually incomparable scores are usually rank approaches and ideas.
+Binary classification is one of the most fundamental tasks in machine learning. The evaluation of the performance of binary classification techniques, whether for original theoretical advancements or applications in specific fields, relies heavily on performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Although reported performance scores are employed as primary indicators of research value, they often suffer from methodological problems, typos, and insufficient descriptions of experimental settings. These issues contribute to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and ultimately entire fields of research ([RV]_, [EHG]_). Even systematic reviews can suffer from using incomparable performance scores for ranking research papers [RV]_.
 
-Most of the performance score are functions of the values of the binary confusion matrix (with four entries: true positives (``tp``), true negatives (``tn``), false positives (``fp``), false negatives (``fn``)). Consequently, when multiple performance scores are reported for some experiment, they cannot take any values independently.
-
-Depending on the experimental setup, one can develop techniques to check if the performance scores reported for a dataset are consistent. This package implements such consistency tests for some common scenarios. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards. However, if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.
+The majority of performance scores are calculated from the binary confusion matrix, or multiple confusion matrices aggregated across folds and/or datasets. For many commonly used experimental setups one can develop numerical techniques to test if there exists any confusion matrix (or matrices), compatible with the experiment and leading to the reported performance scores. This package implements such consistency tests for some common scenarios. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards or a presumed evaluation protocol. However, *if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty*. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.
 
 For further information, see the preprint:
 
@@ -35,7 +33,7 @@ If you use the package, please consider citing the following paper:
 Latest news
 ===========
 
-* the 0.0.1 version of the package is released
+* the 0.1.0 version of the package is released
 * the paper describing the implemented techniques is available as a preprint at:
 
 Installation
@@ -44,12 +42,16 @@ Installation
 Requirements
 ------------
 
-The package has only basic requirements when used for consistency checking.
+The package has only basic requirements when used for consistency testing.
 
 * ``numpy``
 * ``pulp``
 
-In order to execute the tests, one also needs ``scikit-learn``, in order to test the computer algebra components or reproduce the solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. In order to install ``sage`` into a conda environment one needs adding the ``conda-forge`` channel first:
+.. code-block:: bash
+
+    > pip install numpy pulp
+
+In order to execute the tests, one also needs ``scikit-learn``, in order to test the computer algebra components or reproduce the algebraic solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. To install ``sage`` in a ``conda`` environment, one needs to add the ``conda-forge`` channel first:
 
 .. code-block:: bash
 
@@ -59,7 +61,7 @@ In order to execute the tests, one also needs ``scikit-learn``, in order to test
 Installing the package
 ----------------------
 
-In order to use for consistency testing, the package can be installed from the PyPI repository as:
+For consistency testing, the package can be installed from the PyPI repository as:
 
 .. code-block:: bash
 
@@ -71,14 +73,14 @@ For develompent purposes, one can clone the source code from the repository as
 
     > git clone git@github.com:gykovacs/mlscorecheck.git
 
-And install the source code into the actual virtual environment as
+and install the source code into the actual virtual environment as
 
 .. code-block:: bash
 
     > cd mlscorecheck
     > pip install -e .
 
-In order to use and test all functionalities (including the symbolic computing part), please install the ``requirements.txt``:
+In order to use and test all functionalities (including the algebraic and symbolic computing parts), please install the ``requirements.txt``:
 
 .. code-block:: bash
 
diff --git a/docs/01a_requirements.rst b/docs/01a_requirements.rst
index f9217eb..ca9e518 100644
--- a/docs/01a_requirements.rst
+++ b/docs/01a_requirements.rst
@@ -3,8 +3,8 @@ Requirements
 
 In general, there are three inputs to the consistency testing functions:
 
-* the specification of the dataset(s) involved;
-* the collection of available performance scores. When aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported. When cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):
+* **the specification of the experiment**;
+* **the collection of available (reported) performance scores**: when aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported; when cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):
 
   * accuracy (``acc``),
   * sensitivity (``sens``),
@@ -26,7 +26,7 @@ In general, there are three inputs to the consistency testing functions:
   * bookmaker informedness (``bm``),
   * prevalence threshold (``pt``),
   * diagnostic odds ratio (``dor``),
-  * jaccard index (``ji``),
+  * Jaccard index (``ji``),
   * Cohen's kappa (``kappa``);
 
-* the estimated numerical uncertainty: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case the numerical uncertainty becomes 0.0001 (10^(-4)).
+* **the estimated numerical uncertainty**: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case, the numerical uncertainty becomes 0.0001 (10^(-4)).
diff --git a/docs/01b_specifying_setup.rst b/docs/01b_specifying_setup.rst
index ea7d786..a107d8a 100644
--- a/docs/01b_specifying_setup.rst
+++ b/docs/01b_specifying_setup.rst
@@ -1,34 +1,34 @@
-Specifying the experimental setup
----------------------------------
+Specification of the experimental setup
+---------------------------------------
 
-In this subsection we illustrate the various ways the experimental setup can be specified.
+In this subsection, we illustrate the various ways the experimental setup can be specified.
 
-Specifying one testset or dataset
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Specification of one testset or dataset
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 There are multiple ways to specify datasets and entire experiments consisting of multiple datasets evaluated in differing ways of cross-validations.
 
-A simple binary classification test-set consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as
+A simple binary classification testset consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as
 
 .. code-block:: Python
 
     testset = {"p": 10, "n": 20}
 
-One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` statistics of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):
+One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):
 
 .. code-block:: Python
 
     dataset = {"dataset_name": "common_datasets.ADA"}
 
-To see the list of supported datasets and corresponding statistics, issue
+To see the list of supported datasets and corresponding counts, issue
 
 .. code-block:: Python
 
     from mlscorecheck.experiments import dataset_statistics
     print(dataset_statistics)
 
-Specifying a folding
-^^^^^^^^^^^^^^^^^^^^
+Specification of a folding
+^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The specification of foldings is needed when the scores are computed in cross-validation scenarios. We distinguish two main cases: in the first case, the number of positive and negative samples in the folds are known, or can be derived from the attributes of the dataset (for example, by stratification); in the second case, the statistics of the folds are not known, but the number of folds and potential repetitions are known.
 
@@ -48,20 +48,20 @@ Knowing that the folding is derived by some standard stratification techniques,
 
     folding = {"n_folds": 3, "n_repeats": 1, "strategy": "stratified_sklearn"}
 
-In this specification it is prescribed that the samples are distributed to the folds according to the ``sklearn`` stratification implementation.
+In this specification, it is assumed that the samples are distributed into the folds according to the ``sklearn`` stratification implementation.
 
-Finally, if nor the folds or the folding strategy is known, one can simply specify the folding with its parameters:
+Finally, if neither the folds nor the folding strategy is known, one can simply specify the folding with its parameters (assuming a repeated k-fold scheme):
 
 .. code-block:: Python
 
     folding = {"n_folds": 3, "n_repeats": 2}
 
-Note, that not all consistency testing functions support the latter case (not knowing the folds).
+Note that not all consistency testing functions support the latter case (not knowing the exact structure of the folds).
 
-Specifying an evaluation
-^^^^^^^^^^^^^^^^^^^^^^^^
+Specification of an evaluation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-A dataset and a folding constitutes an *evaluation*, many of the test functions take evaluations as parameters describing the scenario:
+A dataset and a folding constitute an *evaluation*, and many of the test functions take evaluations as parameters describing the scenario:
 
 .. code-block:: Python
 
diff --git a/docs/01c_consistency_checking.rst b/docs/01c_consistency_checking.rst
index 7d1a17d..ea76242 100644
--- a/docs/01c_consistency_checking.rst
+++ b/docs/01c_consistency_checking.rst
@@ -1,28 +1,28 @@
-Checking the consistency of performance scores
-----------------------------------------------
+Testing the consistency of performance scores
+---------------------------------------------
 
-Numerous evaluation setups are supported by the package. In this section we go through them one by one giving some examples of possible use cases.
+Numerous experimental setups are supported by the package. In this section we go through them one by one giving some examples of possible use cases.
 
-We highlight again, that the tests detect inconsistencies. If the resulting ``inconsistency`` flag is ``False``, the scores can still be calculated in non-standard ways, however, if the ``inconsistency`` flag is ``True``, that is, inconsistencies are detected, then the reported scores are inconsistent with the assumptions with certainty.
+We emphasize again, that the tests are designed to detect inconsistencies. If the resulting ``inconsistency`` flag is ``False``, the scores can still be calculated in non-standard ways. However, **if the resulting ``inconsistency`` flag is ``True``, it conclusively indicates that inconsistencies are detected, and the reported scores could not be the outcome of the presumed experiment**.
 
-A note on the Ratio-of-Means and Mean-of-Ratios aggregations
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+A note on the *Score of Means* and *Mean of Scores* aggregations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Most of the performance scores are some sorts of ratios. When it comes to the aggregation of scores (either over multiple folds or multiple datasets or both), there are two approaches in the literature, both having advantages and disadvantages. In the Mean-of-Ratios (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the Ratio-of-Means (SoM) approach, first the overall confusion matrix (tp, tn, fp, fn) is determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted.
+When it comes to the aggregation of scores (either over multiple folds, multiple datasets or both), there are two approaches in the literature. In the *Mean of Scores* (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the *Score of Means* (SoM) approach, first the overall confusion matrix is determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted and some score might become undefined on when the folds are extremely small (typically in the case of small and imbalanced data).
 
 The two types of tests
 ^^^^^^^^^^^^^^^^^^^^^^
 
-Having one single testset, or a SoM type of aggregation (leading to one confusion matrix), one can iterate through all potential pairs of ``tp`` and ``tn`` values and check if any of them can produce the reported scores with the given numerical uncertainty. The test is sped up by using interval arithmetic to prevent the evaluation of all possible pairs. This test supports the performance scores ``acc``, ``sens``, ``spec``, ``ppv``, ``npv``, ``bacc``, ``f1``, ``f1n``, ``fbp``, ``fbn``, ``fm``, ``upm``, ``gm``, ``mk``, ``lrp``, ``lrn``, ``mcc``, ``bm``, ``pt``, ``dor``, ``ji``, ``kappa``. Note that when the f-beta positive or f-beta negative scores are used, one also needs to specify the ``beta_positive`` or ``beta_negative`` values.
+In the context of a single testset or a Score of Means (SoM) aggregation, which results in one confusion matrix, one can systematically iterate through all potential confusion matrices to assess whether any of them can generate the reported scores within the specified numerical uncertainty. To expedite this process, the test leverages interval arithmetic. The test supports the performance scores ``acc``, ``sens``, ``spec``, ``ppv``, ``npv``, ``bacc``, ``f1``, ``f1n``, ``fbp``, ``fbn``, ``fm``, ``upm``, ``gm``, ``mk``, ``lrp``, ``lrn``, ``mcc``, ``bm``, ``pt``, ``dor``, ``ji``, ``kappa``. Note that when the f-beta positive or f-beta negative scores are used, one also needs to specify the ``beta_positive`` or ``beta_negative`` parameters.
 
-With a MoS type of aggregation, only the averages of scores over folds or datasets are available. In this case the reconstruction of fold level or dataset level confusion matrices is possible only for the linear scores ``acc``, ``sens``, ``spec`` and ``bacc`` using linear programming. Based on the reported scores and the folding structures, these tests formulate a linear (integer) program of all confusion matrix entries and checks if the program is feasible to result in the reported values with the estimated numerical uncertainties.
+With a MoS type of aggregation, only the averages of scores over folds or datasets are available. In this case, it is feasible to reconstruct fold-level or dataset-level confusion matrices for the linear scores ``acc``, ``sens``, ``spec`` and ``bacc`` using linear integer programming. These tests formulate a linear integer program based on the reported scores and the experimental setup, and check if the program is feasible to produce the reported values within the estimated numerical uncertainties.
 
-1 testset with no kfold
-^^^^^^^^^^^^^^^^^^^^^^^
+1 testset with no k-fold
+^^^^^^^^^^^^^^^^^^^^^^^^
 
-A scenario like this is having one single test set to which classification is applied and the scores are computed from the resulting confusion matrix. For example, given a test image, which is segmented and the scores of the segmentation are calculated and reported.
+A scenario like this is having one single test set to which classification is applied and the scores are computed from the resulting confusion matrix. For example, given a test image, which is segmented and the scores of the segmentation (as a binary classification of pixels) are calculated and reported.
 
-In the example below, the scores values are generated to be consistent, and accordingly, the test did not identify inconsistencies at the ``1e-2`` level of numerical uncertainty.
+In the example below, the scores are artificially generated to be consistent, and accordingly, the test did not identify inconsistencies at the ``1e-2`` level of numerical uncertainty.
 
 .. code-block:: Python
 
@@ -54,10 +54,10 @@ If one of the scores is altered, like accuracy is changed to 0.92, the configura
 
 As the ``inconsistency`` flag shows, here inconsistencies were identified, there are no such ``tp`` and ``tn`` combinations which would end up with the reported scores. Either the assumption on the properties of the dataset, or the scores are incorrect.
 
-1 dataset with kfold mean-of-ratios (MoS)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1 dataset with k-fold, mean-of-scores (MoS)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-This scenario is the most common in the literature. A classification technique is executed to each fold in a (repeated) k-fold scenario, the scores are calculated for each fold, and the average of the scores is reported with some numerical uncertainty due to rounding/ceiling/flooring. Because of the averaging, this test supports only the linear scores (``acc``, ``sens``, ``spec``, ``bacc``) which usually are among the most commonly reported scores. The test constructs a linear integer program describing the scenario with the ``tp`` and ``tn`` parameters of all folds and checks its feasibility.
+This scenario is the most common in the literature. A classification technique is executed to each fold in a (repeated) k-fold scenario, the scores are calculated for each fold, and the average of the scores is reported with some numerical uncertainty due to rounding/ceiling/flooring. Because of the averaging, this test supports only the linear scores (``acc``, ``sens``, ``spec``, ``bacc``) which usually are among the most commonly reported scores. The test constructs a linear integer program describing the scenario with the true positive and true negative parameters of all folds and checks its feasibility.
 
 In the example below, a consistent set of figures is tested:
 
@@ -77,7 +77,7 @@ In the example below, a consistent set of figures is tested:
     >>> result['inconsistency']
     # False
 
-As one can from the output flag, there are no inconsistencies identified. The ``result`` dict contains some further details of the test. Most importantly, under the key ``lp_status`` one can find the status of the linear programming solver, and under the key ``lp_configuration``, one can find the values of all ``tp`` and ``tn`` variables in all folds at the time of the termination of the solver, and additionally, all scores are calculated for the folds and the entire dataset, too.
+As indicated by the output flag, no inconsistencies were identified. The ``result`` dictionary contains some further details of the test. Most notably, under the ``lp_status`` key, one can find the status of the linear programming solver. Additionally, under the ``lp_configuration`` key, one can find the values of all true positive and true negative variables in all folds at the time of the termination of the solver. Furthermore, all scores are calculated for the individual folds and the entire dataset, as well.
 
 If one of the scores is adjusted, for example, sensitivity is changed to 0.568, the configuration becomes infeasible:
 
@@ -91,8 +91,8 @@ If one of the scores is adjusted, for example, sensitivity is changed to 0.568,
     >>> result['inconsistency']
     # True
 
-Finally, we mention that if there are hints for bounds on the scores in the folds (for example, the minimum and maximum scores across the folds are reported), one can add these figures to strengthen the test. In the next example, score bounds on the accuracy have been added to each fold, that means the test checks if the reported scores can be satisfied
-with a ``tp`` and ``tn`` configuration under given these lower and upper bounds:
+Finally, we mention that if there are hints for bounds on the scores in the folds (for example, when the minimum and maximum scores across the folds are reported), one can add these figures to strengthen the test. In the next example, score bounds on accuracy have been added to each fold. This means the test checks if the reported scores can be achieved
+with a true positive and true negative configuration with the specified lower and upper bounds for the scores in the individual folds:
 
 .. code-block:: Python
 
@@ -112,10 +112,10 @@ with a ``tp`` and ``tn`` configuration under given these lower and upper bounds:
 
 Note that in this example, although ``f1`` is provided, it is completely ignored as the aggregated tests work only for the four linear scores.
 
-1 dataset with kfold ratio-of-means (SoM)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+1 dataset with k-folds, score-of-means (SoM)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-When the scores are calculated in the Ratio-of-Means (SoM) manner in a k-fold scenario, it means that the total confusion matrix (``tp`` and ``tn`` values) of all folds is calculated first, and then the score formulas are applied to it. The only difference compared to the "1 testset no kfold" scenario is that the number of repetitions of the k-fold multiples the ``p`` and ``n`` statistics of the dataset, but the actual structure of the folds is irrelevant. The result of the analysis is structured similarly to the "1 testset no kfold" case.
+When the scores are calculated in the Score-of-Means (SoM) manner in a k-fold scenario, it means that the total confusion matrix of all folds is calculated first, and then the score formulas are applied to it. The only difference compared to the "1 testset no kfold" scenario is that the number of repetitions of the k-fold scheme multiples the ``p`` and ``n`` statistics of the dataset, but the actual structure of the folds is irrelevant. The result of the analysis is structured similarly to the "1 testset no kfold" case.
 
 For example, testing a consistent scenario:
 
@@ -213,7 +213,7 @@ This scenario is about performance scores calculated for each dataset individual
     >>> result['inconsistency']
     # False
 
-However, if one of the scores is adjusted a little (accuracy changed to 0.412 and the score bounds also changed), the configuration becomes unfeasible:
+However, if one of the scores is adjusted a little (accuracy changed to 0.412 and the score bounds also changed), the configuration becomes infeasible:
 
 .. code-block:: Python
 
@@ -265,12 +265,12 @@ Again, the details of the analysis are accessible under the ``lp_status`` and ``
 If there are hints on the minimum and maximum scores across the datasets, one can add those bounds through the ``dataset_score_bounds`` parameter to strengthen the test.
 
 Not knowing the mode of aggregation
------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The biggest challenge with aggregated scores is that the ways of aggregation at the dataset and experiment level are rarely disclosed explicitly. Even in this case the tools presented in the previous section can be used since there are hardly any further ways of meaningful averaging than (MoS on folds, MoS on datasets), (SoM on folds, MoS on datasets), (SoM on folds, SoM on datasets), hence, if a certain set of scores is inconsistent with each of these possibilities, one can safely say that the results do not satisfy the reasonable expectations.
 
 Not knowing the k-folding scheme
---------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 In many cases, it is not stated explicitly if stratification was applied or not, only the use of k-fold is phrased in papers. Not knowing the folding structure, the MoS aggregated tests cannot be used. However, if the cardinality of the minority class is not too big (a couple of dozens), then all potential k-fold configurations can be generated, and the MoS tests can be applied to each. If the scores are inconsistent with each, it means that no k-fold could result the scores. There are two functions supporting these exhaustive tests, one for the dataset level, and one for the experiment level.
 
diff --git a/docs/02c_ehg.rst b/docs/02c_ehg.rst
index 6bb115f..2d1618f 100644
--- a/docs/02c_ehg.rst
+++ b/docs/02c_ehg.rst
@@ -1,5 +1,5 @@
-EHG
----
+Preterm delivery prediction from electrohysterogram (EHG) signals
+-----------------------------------------------------------------
 
 Electrohysterogram classification for the prediction of preterm delivery in pregnancy became a popular area for the applications of minority oversampling, however, it turned out that there were overly optimistic classification results reported due to systematic data leakage in the data preparation process [EHG]_. In [EHG]_, the implementations were replicated and it was shown that there is a decent gap in terms of performance when the data is prepared properly. However, data leakage changes the statistics of the dataset being cross-validated. Hence, the problematic scores could be identified with the tests implemented in the ``mlscorecheck`` package. In order to facilitate the use of the tools for this purpose, some functionalities have been prepared with the dataset already pre-populated.
 
diff --git a/docs/02d_contribution.rst b/docs/02d_contribution.rst
index 228229a..dc3b9d0 100644
--- a/docs/02d_contribution.rst
+++ b/docs/02d_contribution.rst
@@ -1,4 +1,4 @@
-Add a test bundles suitable for your field!
--------------------------------------------
+Add a test bundle suitable for your field!
+------------------------------------------
 
-Experts and scientists are invited to prepare test bundles (specify the usual experimental setup and its alternatives) to aid the fast consistency testing of performance scores in their field of expertise.
+We kindly encourage any experts to provide further, field specific dataset and experiment specifications and test bundles to facilitate the reporting of clean and reproducible results in any field related to binary classification!
\ No newline at end of file
diff --git a/docs/03_api.rst b/docs/03_api.rst
index 78733cb..a177868 100644
--- a/docs/03_api.rst
+++ b/docs/03_api.rst
@@ -32,8 +32,8 @@ Retinal Vessel Segmentation
 .. autofunction:: mlscorecheck.bundles.drive_image_all_pixels
 
 
-EHG
----
+Preterm delivery prediction by EHG signals
+------------------------------------------
 
 The test bundle dedicated to the testing of electrohsyterogram data.
 
@@ -82,7 +82,6 @@ Score functions (``scores``)
 .. autofunction:: mlscorecheck.scores.jaccard_index
 .. autofunction:: mlscorecheck.scores.balanced_accuracy
 .. autofunction:: mlscorecheck.scores.cohens_kappa
-.. autofunction:: mlscorecheck.scores.p4
 
 .. autofunction:: mlscorecheck.scores.accuracy_standardized
 .. autofunction:: mlscorecheck.scores.error_rate_standardized
@@ -111,7 +110,6 @@ Score functions (``scores``)
 .. autofunction:: mlscorecheck.scores.jaccard_index_standardized
 .. autofunction:: mlscorecheck.scores.balanced_accuracy_standardized
 .. autofunction:: mlscorecheck.scores.cohens_kappa_standardized
-.. autofunction:: mlscorecheck.scores.p4_standardized
 
 
 Testing logic for individual scores (``individual``)
diff --git a/docs/04_contribution.rst b/docs/04_contribution.rst
deleted file mode 100644
index 0f33712..0000000
--- a/docs/04_contribution.rst
+++ /dev/null
@@ -1,4 +0,0 @@
-Contribution
-************
-
-To contribute, please start a discussion in the GitHub repository at https://github.com/gykovacs/mlscorecheck.
\ No newline at end of file
diff --git a/docs/05_references.rst b/docs/05_references.rst
index 3e03d19..969ea45 100644
--- a/docs/05_references.rst
+++ b/docs/05_references.rst
@@ -1,5 +1,5 @@
 References
-==========
+----------
 
 .. [RV] Kovács, G. and Fazekas, A.: "A new baseline for retinal vessel segmentation: Numerical identification and correction of methodological inconsistencies affecting 100+ papers", Medical Image Analysis, 2022(1), pp. 102300
 
diff --git a/docs/index.rst b/docs/index.rst
index 9032d2d..6a8f4dd 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -27,6 +27,8 @@ Welcome to mlscorecheck's documentation!
     02a_bundle_intro
     02b_retina
     02c_ehg
+    02d_contribution
+    05_references
 
 .. toctree::
     :maxdepth: 2
@@ -34,12 +36,6 @@ Welcome to mlscorecheck's documentation!
 
     03_api
 
-.. toctree::
-    :maxdepth: 1
-    :caption: References
-
-    05_references
-
 Indices and tables
 ==================