-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
62f42f0
commit b7103e6
Showing
12 changed files
with
228 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,11 @@ | ||
# SplitClusterTest.jl Documentation | ||
|
||
> Wang, L., Lin, Y., & Zhao, H. (2024). False Discovery Rate Control via Data Splitting for Testing-after-Clustering (arXiv:2410.06451). arXiv. https://doi.org/10.48550/arXiv.2410.06451 | ||
> Wang, L., Lin, Y., & Zhao, H. (2024). False Discovery Rate Control via Data Splitting for Testing-after-Clustering (arXiv:2410.06451). arXiv. <https://doi.org/10.48550/arXiv.2410.06451> | ||
> | ||
|
||
Testing for differences in features between clusters in various applications often leads to inflated false positives when practitioners use the same dataset to identify clusters and then test features, an issue commonly known as “double dipping”. | ||
|
||
To address this challenge, inspired by data-splitting strategies for controlling the false discovery rate (FDR) in regressions ([Dai et al., 2023](https://www.tandfonline.com/doi/abs/10.1080/01621459.2022.2060113)), we present a novel method that applies data-splitting to control FDR while maintaining high power in unsupervised clustering. | ||
|
||
We first divide the dataset into two halves, then apply the conventional testing-after-clustering procedure to each half separately and combine the resulting test statistics to form a new statistic for each feature. The new statistic can help control the FDR due to its property of having a sampling distribution that is symmetric around zero for any null feature. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# This section demonstrates the data splitting procedure for selecting relevant features when there exists latent linear pseduotime under the Poisson setting. | ||
|
||
using SplitClusterTest | ||
using Plots | ||
|
||
|
||
x, cl = gen_data_pois(1000, 2000, 0.5, prop_imp=0.1, type = "continuous") | ||
|
||
# Plot the first two PCs of X, and color each point by the pseduotime variable `cl` | ||
pc1, pc2 = first_two_PCs(x) | ||
scatter(pc1, pc2, marker_z = cl, label = "") | ||
|
||
# Adopt the data splitting procedure to select the relevant features. | ||
ms = ds(x, ret_ms = true, type = "continuous"); | ||
τ = calc_τ(ms) | ||
|
||
# the mirror statistics of relevant features tend to be larger and away from null features, | ||
# where the null features still exhibit a symmetric distribution about zero. | ||
# Then we can properly take the cutoff to control the FDR, as shown by the red vertical line. | ||
histogram(ms, label = "") | ||
Plots.vline!([τ], label = "", lw = 3) | ||
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# This section demonstrates the data splitting procedure for selecting relevant features when there exists (or no) cluster structure under the Gaussian setting. | ||
|
||
using SplitClusterTest | ||
using Plots | ||
|
||
|
||
# ## Without cluster structure | ||
x, cl = gen_data_normal(1000, 2000, 0.0, prop_imp=0.1); | ||
|
||
# Plot the first two PCs of X | ||
pc1, pc2 = first_two_PCs(x) | ||
scatter(pc1[cl .== 1], pc2[cl .== 1]) | ||
scatter!(pc1[cl .== 2], pc2[cl .== 2]) | ||
|
||
# perform the data splitting procedure for selecting relevant features | ||
ms = ds(x, ret_ms = true); | ||
|
||
# the mirror statistics are symmetric about zero since all features are null features. | ||
histogram(ms, label = "") | ||
|
||
|
||
# ## With cluster structure | ||
|
||
x, cl = gen_data_normal(1000, 2000, 0.5, prop_imp=0.1); | ||
|
||
# Plot the first two PCs of X | ||
pc1, pc2 = first_two_PCs(x) | ||
scatter(pc1[cl .== 1], pc2[cl .== 1]) | ||
scatter!(pc1[cl .== 2], pc2[cl .== 2]) | ||
|
||
|
||
# the mirror statistics of relevant features tend to be larger and away from null features, | ||
# where the null features still exhibit a symmetric distribution about zero. | ||
ms = ds(x, ret_ms = true); | ||
|
||
# Then we can properly take the cutoff to control the FDR, as shown by the red vertical line. | ||
τ = calc_τ(ms) | ||
histogram(ms, label = "") | ||
Plots.vline!([τ], label = "", lw = 3) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# This section demonstrates the data splitting procedure for selecting relevant features when there exists cluster structure under the Poisson setting. | ||
|
||
using SplitClusterTest | ||
using Plots | ||
|
||
|
||
x, cl = gen_data_pois(1000, 2000, 0.5, prop_imp=0.1, type = "discrete") | ||
|
||
# Plot the first two PCs of X | ||
pc1, pc2 = first_two_PCs(x) | ||
scatter(pc1[cl .== 0], pc2[cl .== 0]) | ||
scatter!(pc1[cl .== 1], pc2[cl .== 1]) | ||
|
||
# Adopt the data splitting procedure to select the relevant features. | ||
ms = ds(x, ret_ms = true); | ||
τ = calc_τ(ms) | ||
|
||
# the mirror statistics of relevant features tend to be larger and away from null features, | ||
# where the null features still exhibit a symmetric distribution about zero. | ||
# Then we can properly take the cutoff to control the FDR, as shown by the red vertical line. | ||
histogram(ms, label = "") | ||
Plots.vline!([τ], label = "", lw = 3) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ export gen_data_normal, | |
ds, | ||
mds, | ||
calc_τ, | ||
calc_acc | ||
calc_acc, | ||
first_two_PCs | ||
|
||
end # module SplitClusterTest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.