-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-Structural_causal_models.Rmd
77 lines (54 loc) · 3.09 KB
/
11-Structural_causal_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Structural causal models {#scm}
[COLLIDER BIAS EXAMPLE]
```{r}
library(tidyverse)
```
Colliders are particularly dangerous due to a phenomenon called *collider bias*. We return to the basketball example to illustrate.
Suppose there is no association in the population between the height and the shooting accuracy of basketball players. (That may or not be true, but let's assume it for the time being.) Here is some simulated data of shooting percentages and height in inches:
```{r}
set.seed(1)
height <- rnorm(500, mean = 68, sd = 8)
shooting <- rnorm(500, mean = 0.5, sd = 0.2)
fake_basketball_data <- tibble(height, shooting)
ggplot(fake_basketball_data, aes(y = shooting, x = height)) +
geom_point()
```
There is no correlation between these two variables:
```{r}
cor(fake_basketball_data$height, fake_basketball_data$shooting)
```
But now we'll figure out who makes it into the professional league:
```{r}
fake_basketball_data <- fake_basketball_data %>%
mutate(pro = ifelse((shooting > 0.75 & height > 72) |
(shooting > 0.50 & height > 80),
"Pro", "Non-pro"))
```
The condition here for making it into the pros is that *either* the player is a very good shooter (and is reasonably tall), or is very tall (and is a reasonably good shooter).
Here is the plot, with the pros colored with red points.
```{r}
ggplot(fake_basketball_data,
aes(y = shooting, x = height, color = pro)) +
geom_point() +
scale_color_manual(values = c("black", "red"))
```
We'll isolate those pro players in the graph:
```{r}
fake_basketball_data_pros <- fake_basketball_data %>%
filter(pro == "Pro")
ggplot(fake_basketball_data_pros,
aes(y = shooting, x = height)) +
geom_point(color = "red")
```
Now there is a pretty large negative correlation.
```{r}
cor(fake_basketball_data_pros$shooting,
fake_basketball_data_pros$height)
```
So incorporating information into the model about the collider (the probability of going pro) induces an association that wasn't present in the population.
::: {.rmdnote}
Here's another way to think about this. If you know someone's height, do you know anything about their shooting ability? Or vice versa? In the population at large, probably not much.
But what if I tell you the person is a professional basketball player? And what if I tell you this person is quite short (at least for a pro player). What information does that provide about their shooting ability?
And what if I tell you that a professional basketball player is not a very good shooter. What information does that provide about their height?
:::
Collider bias is often called *selection bias* because it happens most often when you control for a variable that ends up selecting for certain traits in the population. In this example, if you try to "control" for the pro league variable $Y_{1}$, you end up looking at a set of players who have the same probability of going pro. In that set, players will vary in their height and shooting ability, but shorter players are forced to be better shooters, and players who don't shoot well may fare better if they are tall.