ch12_multilevel-models.Rmd

---
title: "Chapter 12. Multilevel Models"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 500)

library(glue)
library(mustashe)
library(magrittr)
library(broom)
library(patchwork)
library(MASS)
library(rethinking)
library(tidyverse)
library(conflicted)


conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("rename", "dplyr")

# To be able to use `map2stan()`:
conflict_prefer("collapse", "dplyr")
conflict_prefer("extract", "tidyr")
conflict_prefer("rstudent", "rethinking")
conflict_prefer("lag", "dplyr")
conflict_prefer("map", "purrr")
conflict_prefer("Position", "ggplot2")

# To be able to use 'MASS':
conflict_prefer("area", "patchwork")

theme_set(theme_minimal())
source_scripts()
set.seed(0)

# For 'rethinking'.
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

purple <- "#8067bf"
blue <- "#5896d1"
light_grey <- "grey80"
grey <- "grey50"
dark_grey <- "grey25"
```

- multi-level models remember features of each cluster in the data as they learn about all of the clusters
    * depending on the variation across clusters, the model pools information across clusters
    * *the pooling improves estimates about each cluster*
- benefits of the multilevel approach:
    1. improved estimates for repeat sampling
    2. improved estimates for imbalance in sampling
    3. estimates of variation
    4. avoid averaging and retain variation
- multilevel regression should be the default approach
- this chapter starts with the foundations and the following two are more advanced types of multilevel models

## 12.1 Example: Multilivel tadpoles

- example: Reed frog tadpole mortality
    * `surv`: number or survivors
    * `count`: initial number

```{r}
data("reedfrogs")
d <- as_tibble(reedfrogs)
skimr::skim(d)
```

- there is a lot of variation in the data
    * some from experimental treatment, other sources do exist
    * each row is a fish tank that is the experimental environment
    * each tank is a cluster variable and there are repeated measures from each
    * each tank may have a different baseline level of survival, but don't want to treat them as completely unrelated
        - a dummy variable for each tank would be the wrong solution
- *varying intercepts model*: a multilevel model that estimates an intercept for each tank and the variation among tanks
    * for each cluster in the data, use a unique intercept parameter, adaptively learning the prior common to all of the intercepts
    * what is learned about each cluster informs all the other clusters
- model for predicting tadpole mortality in each tank (nothing new)

$$
s_i \sim \text{Binomial}(n_i, p_i) \\
\text{logit}(p_i) = \alpha_{\text{tank}[i]} \\
\alpha_{\text{tank}} \sim \text{Normal}(0, 5) \\
$$

```{r}
d$tank <- 1:nrow(d)

stash("m12_1", {
    m12_1 <- map2stan(
        alist(
            surv ~ dbinom(density, p),
            logit(p) <- a_tank[tank],
            a_tank[tank] ~ dnorm(0, 5)
        ),
        data = d
    )
})


print(m12_1)
precis(m12_1, depth = 2)
```

- can get expected mortality for each tank by taking the logistic of the coefficients

```{r}
logistic(coef(m12_1)) %>%
    enframe() %>%
    mutate(name = str_remove_all(name, "a_tank\\[|\\]"),
           name = as.numeric(name)) %>%
    ggplot(aes(x = name, y = value)) +
    geom_col() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
    labs(x = "tank",
         y = "estimated probability survival",
         title = "Single-level categorical model estimates of tadpole survival")
```

- fit a multilevel model by adding a prior for the `a_tank` parameters as a function of its own parameters
    * now the priors have prior distributions, creating two *levels* of priors

$$
s_i \sim \text{Binomial}(n_i, p_i) \\
\text{logit}(p_i) = \alpha_{\text{tank}[i]} \\
\alpha_{\text{tank}} \sim \text{Normal}(\alpha, \sigma) \\
\alpha \sim \text{Normal}(0, 1) \\
\sigma \sim \text{HalfCauchy}(0, 1)
$$

```{r}
stash("m12_2", {
    m12_2 <- map2stan(
        alist(
            surv ~ dbinom(density, p),
            logit(p) <- a_tank[tank],
            a_tank[tank] ~ dnorm(a, sigma),
            a ~ dnorm(0, 1),
            sigma ~ dcauchy(0, 1)
        ),
        data = d,
        iter = 4000,
        chains = 4,
        cores = 1
    )
})


print(m12_2)
precis(m12_2, depth = 2)
```
- interpretation:
    * $\alpha$: one overall sample intercept
    * $\sigma$: variance among tanks
    * 48 per-tank intercepts

```{r}
compare(m12_1, m12_2)
```

- from the comparison, see that the multilevel model only has ~38 effective parameters
    * 12 fewer than the single-level model because the prior assigned to each intercept shrinks them all towards the mean $\alpha$
        - *$\alpha$ is acting like a regularizing prior, but it has been learned from the data*
- plot and compare the posterior medians from both models

```{r}
post <- extract.samples(m12_2)

d %>%
    mutate(propsurv_estimate = logistic(apply(post$a_tank, 2, median)),
           pop_size = case_when(
               density == 10 ~ "small tank",
               density == 25 ~ "medium tank",
               density == 35 ~ "large tank"
           ),
           pop_size = fct_reorder(pop_size, density)) %>%
    ggplot(aes(tank)) +
    facet_wrap(~ pop_size, nrow = 1, scales = "free_x") +
    geom_hline(yintercept = logistic(median(post$a)), 
               lty = 2, color = dark_grey) +
    geom_linerange(aes(x = tank, ymin = propsurv, ymax = propsurv_estimate), 
                   color = light_grey, size = 1) +
    geom_point(aes(y = propsurv), 
               color = grey, size = 1) +
    geom_point(aes(y = propsurv_estimate), 
               color = purple, size = 1) +
    labs(x = "tank",
         y = "proportion surivival",
         title = "Propotion of survival among tadpoles from different tanks.")
```

- comments on above plot:
    * note that all of the purple points $\alpha_\text{tank}$ are skewed towards to the dashed line $\alpha$
        - this is often called *shrinkage* and comes from regularization
    * note that the smaller tanks have shifted more than in the larger tanks
        - there are fewer starting tadpoles, so the shrinkage has a stronger effect
    * the shift of the purple points is large the further the empirical value (grey points) are from the dashed line $\alpha$
- sample from the posterior distributions:
    * first plot 100 Gaussian distributions from samples of the posteriors for $\alpha$ and $\sigma$
    * then sample 8000 new log-odds of survival for individual tanks

```{r}
x <- seq(-3, 5, length.out = 300)
log_odds_gaussian_samples <- map_dfr(1:100, function(i) {
    tibble(i, x, prob = dnorm(x, post$a[i], post$sigma[i]))
})

p1 <- log_odds_gaussian_samples %>%
    ggplot(aes(x, prob, group = factor(i))) +
    geom_line(alpha = 0.5, size = 0.1) +
    labs(x = "log-odds survival",
         y = "density",
         title = "Sampled probability density curves")

p2 <- tibble(sim_tanks = logistic(rnorm(8000, post$a, post$sigma))) %>%
    ggplot(aes(sim_tanks)) +
    geom_density(size = 1, fill = grey, alpha = 0.5) +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
    labs(x = "probability survive",
         y = "density",
         title = "Simulated survival proportions")

p1 | p2
```

- there is uncertainty about both the location $\alpha$ and scale $\sigma$ of the population distribution of log-odds of survival
    * this uncertainty is propagated into the simulated probabilities of survival

## 12.2 Varying effects and the underfitting/overfitting trade-off

- *"Varying intercepts are just regularized estimates, but adaptivelyy regulraized by estimating how diverse the cluster are while estimating the features of each cluster."*
    * varying effect estimates are more accurate estimates of the individual cluster intercepts
- partial pooling helps prevent overfitting and underfitting
    * pooling all of the tanks into a single intercept would make an underfit model
    * having completely separate intercepts for each tank would overfit
- demonstration: simulate tadpole data so we know the true per-pond survival probabilities
    * this is also a demonstration of the important skill of simulation and model validation

### 12.2.1 The model

- we will use the same multilevel binomial model as before (using "ponds" instead of "tanks")

$$
s_i \sim \text{Binomial}(n_i, p_i) \\
\text{logit}(p_i) = \alpha_{\text{pond[i]}} \\
\alpha_\text{pond} \sim \text{Normal}(\alpha, \sigma) \\
\alpha \sim \text{Normal}(0, 1) \\
\sigma \sim \text{HalfCauchy}(0, 1)
$$
- need to assign values for:
    * $\alpha$: the average log-odds of survival for all of the ponds
    * $\sigma$: the standard deviation of the distribution of log-odds of survival among ponds
    * $\alpha_\text{pond}$: the individual pond intercepts
    * $n_i$: the number of tadpoles per pond

### 12.2.2 Assign values to the parameters

- steps in code:
    1. initialize $\alpha$, $\sigma$, number of ponds, number of tadpoles per ponds
    2. use these parameters to generate $\alpha_\text{pond}$
    3. put data into a data frame

```{r}
set.seed(0)

# 1. Initialize top level parameters.
a <- 1.4
sigma <- 1.5
nponds <- 60
ni <- as.integer(rep(c(5, 10, 25, 35), each = 15))

# 2. Sample second level parameters for each pond.
a_pond <- rnorm(nponds, mean = a, sd = sigma)

# 3. Organize into a data frame.
dsim <- tibble(pond = seq(1, nponds), 
               ni = ni,
               true_a = a_pond)
dsim
```
### 12.2.3 Simulate survivors

- simulate the binomial survival process
    * each pond $i$ has $n_i$ potential survivors with probability of survival $p_i$
    * from the model definition (using the logit link function), $p_i$ is:

$$
p_i = \frac{\exp(\alpha_i)}{1 + \exp(\alpha_i)}
$$

```{r}
dsim$si <- rbinom(nponds, 
                  prob = logistic(dsim$true_a), 
                  size = dsim$ni)
```

### 12.2.4 Compute the no-pooling estiamtes

- the estimates from not pooling information across ponds is the same as calculating the proportion of survivors in each pond
    * would get same values if used a dummy variable for each pond and weak priors
- calculate these value and keep on the probability scale

```{r}
dsim$p_nopool <- dsim$si / dsim$ni
```

### 12.2.5 Compute the partial-pooling estimates

- now fit the multilevel model

```{r}
stash("m12_3", {
    m12_3 <- map2stan(
        alist(
            si ~ dbinom(ni, p),
            logit(p) <- a_pond[pond],
            a_pond[pond] ~ dnorm(a, sigma),
            a ~ dnorm(0, 1),
            sigma ~ dcauchy(0, 1)
        ),
        data = dsim,
        iter = 1e4,
        warmup = 1000
    )
})

precis(m12_3, depth = 2)
```

- compute the predicted survival proportions

```{r}
estimated_a_pond <- as.numeric(coef(m12_3)[1:nponds])
dsim$p_partpool <- logistic(estimated_a_pond)
```

- compute known survival proportions from the real $\alpha_\text{pond}$ values

```{r}
dsim$p_true <- logistic(dsim$true_a)
```

- plot the results and compute error between the estimated and true varying effects

```{r}
dsim %>%
    transmute(nopool_error = abs(p_nopool - p_true),
              partpool_error = abs(p_partpool - p_true),
              pond, ni) %>%
    pivot_longer(-c(pond, ni),
                 names_to = "model_type", values_to = "absolute_error") %>%
    group_by(ni, model_type) %>%
    mutate(avg_error = mean(absolute_error)) %>%
    ungroup() %>%
    ggplot(aes(x = pond, y = absolute_error)) +
    facet_wrap(~ ni, scales = "free_x", nrow = 1) +
    geom_line(aes(y = avg_error, color = model_type, group = model_type), size = 1.5, alpha = 0.7) +
    geom_line(aes(group = factor(pond)), color = light_grey, size = 0.8) +
    geom_point(aes(color = model_type)) +
    scale_color_brewer(palette = "Dark2") +
    theme(legend.position = c(0.9, 0.7)) +
    labs(x = "pond number",
         y = "absolute error",
         color = "model type",
         title = "Comparing the error between estimates from amnesiac and multilevel models")
```

- interpretation:
    * both models perform better with larger ponds becasue more data
    * the partial pooling model performs better, on average, than the no pooling model

## 12.3 More than one type of cluster

- often are multiple clusters of data in the same model
- example: chimpanzee data
    * one block for each chimp
    * one block for each day of testing
    
### 12.3.1 Multilevel chimpanzees

- similar model as before
    * add varying intercepts for actor
    * put both the $\alpha$ and $\alpha_\text{actor}$ in the linear model
        - it is to allow for adding other varying effects
        - instead of having $\alpha$ as the mean for $\alpha_\text{actor}$, the mean for $\alpha_\text{actor} = 0$ and the mean $\alpha$ is in the linear model instead 

$$
L_i \sim \text{Binomial}(1, p_i) \\
\text{logit}(p_i) = \alpha + \alpha_{\text{actor}[i]} + (\beta_P + \beta_{PC} C_i) P_i \\
\alpha_\text{actor} \sim \text{Normal}(0, \sigma_\text{actor}) \\
\alpha \sim \text{Normal}(0, 10) \\
\beta_P \sim \text{Normal}(0, 10) \\
\beta_{PC} \sim \text{Normal}(0, 10) \\
\alpha_\text{actor} \sim \text{HalfCauchy}(0, 1) \\
$$

```{r}
data("chimpanzees")
d <- as_tibble(chimpanzees) %>%
    select(-recipient)

stash("m12_4", {
    m12_4 <- map2stan(
        alist(
            pulled_left ~ dbinom(1, p),
            logit(p) <- a + a_actor[actor] + (bp + bpc*condition)*prosoc_left,
            a_actor[actor] ~ dnorm(0, sigma_actor),
            a ~ dnorm(0, 10),
            bp ~ dnorm(0, 10),
            bpc ~ dnorm(0, 10),
            sigma_actor ~ dcauchy(0, 1)
        ),
        data = d,
        warmup = 1e3,
        iter = 5e3,
        chains = 4
    )
})

print(m12_4)
precis(m12_4, depth = 2)
```

- note that the mean population of actors $\alpha$ and the individual deviations from that mean $\alpha_\text{actor}$ must be summed to calculate the entrie intercept: $\alpha + \alpha_\text{actor}$

```{r}
post <- extract.samples(m12_4)
total_a_actor <- map(1:7, ~ post$a + post$a_actor[, .x])
round(map_dbl(total_a_actor, mean), 2)
```

### 12.3.2 Two types of cluster

- add a second cluster on `block`
    * replicate the structure for `actor`
    * keep only a single global mean parameter $\alpha$ and have the varying intercepts with a mean of 0

$$
L_i \sim \text{Binomial}(1, p_i) \\
\text{logit}(p_i) = \alpha + \alpha_{\text{actor}[i]} + \alpha_{\text{block}[i]} + (\beta_P + \beta_{PC} C_i) P_i \\
\alpha_\text{actor} \sim \text{Normal}(0, \sigma_\text{actor}) \\
\alpha_\text{block} \sim \text{Normal}(0, \sigma_\text{block}) \\
\alpha \sim \text{Normal}(0, 10) \\
\beta_P \sim \text{Normal}(0, 10) \\
\beta_{PC} \sim \text{Normal}(0, 10) \\
\alpha_\text{actor} \sim \text{HalfCauchy}(0, 1) \\
\alpha_\text{block} \sim \text{HalfCauchy}(0, 1) \\
$$

```{r}

d$block_id <- d$block  # 'block' is a reserved name in Stan.

stash("m12_5", {
    m12_5 <- map2stan(
        alist(
            pulled_left ~ dbinom(1, p),
            logit(p) <- a + a_actor[actor] + a_block[block_id] + (bp + bpc*condition)*prosoc_left,
            a_actor[actor] ~ dnorm(0, sigma_actor),
            a_block[block_id] ~ dnorm(0, sigma_block),
            a ~ dnorm(0, 10),
            bp ~ dnorm(0, 10),
            bpc ~ dnorm(0, 10),
            sigma_actor ~ dcauchy(0, 1),
            sigma_block ~ dcauchy(0, 1)
        ),
        data = d,
        warmup = 1e3,
        iter = 6e3,
        chains = 4
    )
})

print(m12_5)
precis(m12_5, depth = 2)
```

- there was a warning message, though it can be safely ignored:

> There were 11 divergent iterations during sampling.
> Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.

- interpretation:
    * normal to have variance of `n_eff` across parameters of these more complex models
    * $\sigma_\text{block}$ is much smaller than $\sigma_\text{actor}$ so there is more variation between actors
        - therefore, adding `block` hasnt added much overfitting risk

```{r}
post <- extract.samples(m12_5)
enframe(post) %>%
    filter(name %in% c("sigma_actor", "sigma_block")) %>%
    unnest(value) %>%
    ggplot(aes(value)) +
    geom_density(aes(color = name, fill = name), size = 1.4, alpha = 0.4) +
    scale_x_continuous(limits = c(0, 4),
                       expand = c(0, 0)) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
    scale_color_brewer(palette = "Dark2") +
    scale_fill_brewer(palette = "Dark2") +
    theme(legend.title = element_blank(),
          legend.position = c(0.8, 0.5)) +
    labs(x = "posterior sample",
         y = "probability density",
         title = "Posterior distribitions for cluster variances")    
```

```{r}
compare(m12_4, m12_5)
```

- there are 7 more parameters in `m12_5` than `m12_4`, but the `pWAIC` (effective number of parameters) shows there are only about 2 more effective parameters
    * because the variance from `block` is so low
- the models have very close WAIC values because they make very similar predictions
    * `block` had very little influence on the model
    * keeping and reporting on both models is important to demonstrate this fact

### 12.3.3 Even more clusters

- MCMC can handle thousands of varying effects
- need not be shy to include a varying effect if there is theoretical reason it would introduce variance
    * overfitting risk is low as $\sigma$ for the parameters will shrink
    * indicates the importance of the cluster

## 12.4 Multilevel posterior predictions

- *model checking*: a robust way to check the fit of a model is to compare the sample to the posterior predictions
    * *information criteria are also useful indicators of model flexibility and risk of overfitting
- for a multilevel model:       
    * should not expect to "retrodict" the sample because shrinkage will distort some predictions
    * will want predictions for existing clusters of data and new clusters of data

### 12.4.1 Posterior prediction for same clusters

- example uing `chimpanzees` dataset and model `12_4`
    * each `actor` is a cluster of the data

```{r}
# A data frame of all possible conditions
d_conditions <- tibble(
    prosoc_left = c(0, 1, 0, 1),
    condition = c(0, 0, 1, 1)
)

# A data frame of all possible conditions for each actor (chimp)
d_pred <- tibble(actor = 1:7) %>%
    mutate(data = rep(list(d_conditions), 7)) %>%
    unnest(data)

# make predictions
link_m12_4 <- link(m12_4, data = d_pred)

d_pred %>%
    mutate(post_pred_mean = apply(link_m12_4, 2, mean)) %>%
    bind_cols(apply(link_m12_4, 2, PI) %>% pi_to_df()) %>%
    mutate(x = paste(prosoc_left, condition, sep = ", ")) %>%
    ggplot(aes(x = x, y = post_pred_mean, color = factor(actor))) +
    geom_linerange(aes(ymin = x5_percent, ymax = x94_percent), alpha = 0.2) +
    geom_point(alpha = 0.8) +
    geom_line(aes(group = factor(actor)), alpha = 0.5) +
    scale_color_brewer(palette = "Dark2") +
    labs(x = "prosoc_left, condition",
         y = "probability of pulling left lever",
         title = "Multi-level model posterior predictions",
         color = "actor")
```

### 12.4.2 Posterior prediction for new clusters

- often we do not care about the individual clusters in the data
    * we don't necessarily want predictions for the 7 chimps in the data, but for all of the species
- first attempt: construct a posterior prediciton for the *average* actor using $\alpha$
    * however, does not show the variation among actors

```{r}
d_pred$actor <- 1  # A non-zero placeholder
a_actor_zeros = matrix(0, nrow = 1e3, ncol = 7)

link_m12_4 <- link(m12_4, n = 1e3, data = d_pred, 
                   replace = list(a_actor = a_actor_zeros))

d_pred %>%
    mutate(x = paste(prosoc_left, condition, sep = ", "),
           pred_p_mean = apply(link_m12_4, 2, mean)) %>%
    bind_cols(apply(link_m12_4, 2, PI, prob = 0.8) %>% pi_to_df()) %>%
    ggplot(aes(x = x, y = pred_p_mean)) +
    geom_linerange(aes(ymin = x10_percent, ymax = x90_percent), color = grey) +
    geom_line(aes(group = factor(actor)), color = grey) +
    geom_point(color = dark_grey) +
    scale_y_continuous(limits = c(0, 1), expand = c(0, 0)) +
    labs(x = "prosoc_left, condition",
         y = "probability of pulling left lever",
         title = "Multi-level model posterior predictions for average actor")
```

- second attempt: show variation amongst actors by including the `sigma_actor` in the calculation

```{r}
post <- extract.samples(m12_4)
a_actor_sims <- rnorm(7e3, 0, post$sigma_actor)
a_actor_sims <- matrix(a_actor_sims, nrow = 1e3, ncol = 7)

link_m12_4 <- link(m12_4, n = 1e3, data = d_pred, 
                   replace = list(a_actor = a_actor_sims))

d_pred %>%
    mutate(x = paste(prosoc_left, condition, sep = ", "),
           pred_p_mean = apply(link_m12_4, 2, mean)) %>%
    bind_cols(apply(link_m12_4, 2, PI, prob = 0.8) %>% pi_to_df()) %>%
    ggplot(aes(x = x, y = pred_p_mean)) +
    geom_linerange(aes(ymin = x10_percent, ymax = x90_percent), color = grey) +
    geom_line(aes(group = factor(actor)), color = grey) +
    geom_point(color = dark_grey) +
    scale_y_continuous(limits = c(0, 1), expand = c(0, 0)) +
    labs(x = "prosoc_left, condition",
         y = "probability of pulling left lever",
         title = "Multi-level model posterior predictions marginal of actor")
```

- choosing which plot to use/present depends on the context and what you are trying to learn
    * the average actor plot shows the effect of treatment
    * the marginal of actor plot shows how variable actors can be
- another option is to try and show both by showing the results for a bunch of new simulated actors

```{r}
post <- extract.samples(m12_4, n = 1e2)
sim_actor <- function(i) {
    sim_a_actor <- rnorm(1, 0, post$sigma_actor[i])
    P <- c(0, 1, 0, 1)
    C <- c(0, 0, 1, 1)
    p <- logistic(
        post$a[i] + sim_a_actor + (post$bp[i] + post$bpc[i] * C) * P
    )
    return(
        tibble(i = i, prosoc_left = P, condition = C, pred = p)
    )
}

map_df(1:100, sim_actor) %>%
    mutate(x = paste(prosoc_left, condition, sep = ", ")) %>%
    ggplot(aes(x = x, y = pred)) +
    geom_line(aes(group = factor(i)), alpha = 0.3) +
    scale_y_continuous(limits = c(0, 1), expand = c(0, 0)) +
    labs(x = "prosoc_left, condition",
         y = "probability of pulling left lever",
         title = "Multi-level model posterior predictions of simulated actors")
```

### 12.4.3 Focus and multilevel prediction

- can use varying effects to model *over-dispersion*
    * example: with Oceanic societies data with an intercept for each society
        - $T$ is the `total_tools`, $P$ is population, $i$ indexes each society
        - $\sigma_\text{society}$ is the estimate of over-dispersion among societies

$$
T_i \sim \text{Poisson}(\mu_i) \\
\log(\mu_i) = \alpha + \alpha_{\text{society}_{[i]}} + \beta_P \log P_i \\
\alpha \sim \text{Normal}(0, 10) \\
\beta_P \sim \text{Normal}(0, 1) \\
\alpha_\text{society} \sim \text{Normal}(0, \sigma_\text{society}) \\
\sigma_\text{society} \sim \text{HalfCauchy}(0, 1)
$$

```{r}
data("Kline")
d <- as_tibble(Kline) %>%
    janitor::clean_names() %>%
    mutate(logpop = log(population),
           society = row_number())

stash("m12_6", {
    m12_6 <- map2stan(
        alist(
            total_tools ~ dpois(mu),
            log(mu) <- a + a_society[society] + bp * logpop,
            a ~ dnorm(0, 10),
            bp ~ dnorm(0, 1),
            a_society[society] ~ dnorm(0, sigma_society),
            sigma_society ~ dcauchy(0, 1)
        ),
        data = d,
        iter = 4e3,
        chains = 3
    )
})

precis(m12_6, depth = 2)
```

- plot posterior predictions that visualize the over-dispersion
    * the `postcheck()` function uses the `a_society` values directly, not the hyperparameters `a` and `sigma_society` that describe the dispersion
    * instead need to simulate counterfactual societies using these hyperparameters $\alpha$ and $\sigma_\text{society}$

```{r}
post <- extract.samples(m12_6)

d_pred <- tibble(
    logpop = seq(6, 14, length.out = 100),
    society = rep(1, 100)
)

# Sample possible alpha society values.
a_society_sims <- rnorm(2e4, mean = 0, post$sigma_society)
a_society_sims <- matrix(a_society_sims, nrow = 2e3, ncol = 10)

# Make predictions using the simulated a_society values.
link_m12_6 <- link(m12_6, n = 2e3, data = d_pred,
                   replace = list(a_society = a_society_sims))

d_pred_res <- d_pred %>%
    mutate(mu_median = apply(link_m12_6, 2, median)) %>%
    bind_cols(
        apply(link_m12_6, 2, PI, prob = 0.67) %>% pi_to_df(),
        apply(link_m12_6, 2, PI, prob = 0.89) %>% pi_to_df(),
        apply(link_m12_6, 2, PI, prob = 0.97) %>% pi_to_df()
    )

d_pred_res %>%
    mutate(x84_percent = scales::squish(x84_percent, range = c(0, 72)),
           x94_percent = scales::squish(x94_percent, range = c(0, 72)),
           x98_percent = scales::squish(x98_percent, range = c(0, 72))) %>%
    ggplot(aes(x = logpop)) +
    geom_ribbon(aes(ymin = x2_percent, ymax = x98_percent),
                alpha = 0.15) +
    geom_ribbon(aes(ymin = x5_percent, ymax = x94_percent),
                alpha = 0.15) +
    geom_ribbon(aes(ymin = x16_percent, ymax = x84_percent),
                alpha = 0.15) +
    geom_point(aes(y = total_tools),
               data = d) +
    geom_line(aes(y = mu_median)) +
    scale_x_continuous(limits = c(7, 13), expand = c(0, 0)) +
    scale_y_continuous(limits = c(5, 72), expand = c(0, 0)) +
    labs(x = "log population",
         y = "total tools",
         title = "Posterior predictions for the over-dispersed Poisson island model",
         subtitle = "Shaded regions indicate 67%, 89%, and 97% intervals of the expected mean.")
```

## 12.6 Practice

### Easy

**12E2. Make the following model into a multilevel model.**

$$
y_i \sim \text{Binomial}(1, p_i) \\
\text{logit}(p_i) = \alpha + \alpha_{\text{group}[i]} + \beta x_i \\
\alpha \sim \text{Normal}(0, 10) \\
\beta \sim \text{Normal}(0, 1) \\
\alpha_\text{group} \sim \text{Normal}(0, \sigma_\text{group}) \\ 
\sigma_\text{group} \sim \text{HalfCauchy(0, 1)}
$$