-
Notifications
You must be signed in to change notification settings - Fork 41
/
Copy path99-25-many_models.Rmd
580 lines (436 loc) · 16.9 KB
/
99-25-many_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
# Many models {-}
**Learning objectives:**
- Create nested data frames to organize data by groups.
- Create list-columns to generate new data in an organized structure.
- Simplify list-columns to manipulate the data they contain.
## Introduction
- The main purpose of this section.....is to tidy your data and how model summaries can help us pick out outliers and unusual trends in our data
- As your experience grows with Exploratory Data Analysis (EDA) you will find your models grow as well
> This chapter is somewhat aspirational: if this book is your first introduction to R, this chapter is likely to be a struggle. It requires you to have deeply internalised ideas about modelling, data structures, and iteration. So don't worry if you don't get it --- just put this chapter aside for a few months, and come back when you want to stretch your brain.
- We are not going to discuss Model Building (even though it is the name of the Chapter)
- We are going to push through to the end....(but please feel welcome to ask questions!)
### Prerequisites
We will use both `tidyverse` and `modelr` from here.
```{r 99-25-Load Packages, eval = FALSE}
library(modelr)
library(tidyverse)
```
## gapminder
Where do I begin to express the awesomness of Hans Rosling (27 July 1948 -- 7 February 2017). I HIGHLY encourage you to watch the youtube video!
[Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four](https://www.youtube.com/watch?v=jbkSRLYSojo%22)
Furthermore, we can thank Jenny Bryan for authoring the gapminder package!
```{r 99-25-Load Gapminder}
library(gapminder)
```
To gain some insight to the data, we ask the question: "How does life expectancy (lifeExp) change over time (year) for each country (country)?"
```{r 99-25-Plot Gapmider}
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 1/3)
```
At first glance, it appears life expectancy is increasing...but not for all countries.
To make it easier to view, we fit a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
```{r 99-25-Linear Trend}
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", linewidth = 3) +
geom_line() +
ggtitle("Remaining pattern")
```
This is good....but how do we do this for every country?
### Nested data
To make life easier for us (and ensure we are not copying and pasting forever), we use the `map` function from the `purrr` package.
Here, we are going to make a **nested data frame**.
```{r 99-25-Nested Data Frame}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country
```
This creates a data frame that has one row per group (per country), and a rather unusual column:`data`. `data` is a list of data frames (or tibbles, to be precise).
>Note: Don't use the Structure function `str()` as it will be difficult to view. Instead, just view a single line of your nested dataframe.
```{r 99-25-View the Nested Dataframe}
by_country$data[[1]]
```
>Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group.
### List-columns
Now, lets fit some models!
```{r 99-25-Fit some Models}
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
models <- map(by_country$data, country_model)
```
Yet, this may be too costly, so instead, lets modify (be more elegant) with our data. Storing related objects in columns is a key part of the value of data frames.
```{r 99-25-Mutate}
by_country <- by_country %>%
mutate(model = map(data, country_model))
by_country
```
This is a huge advantage: because all the related objects are stored together, you don’t need to manually keep them in sync when you filter or arrange.
```{r 99-25-Filter by Country}
by_country %>%
filter(continent == "Europe")
by_country %>%
arrange(continent, country)
```
***If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!***
### Unnesting
Previously, we were working with only one country. Now, lets compute the residuals of ALL countries.
```{r 99-25-Compute Residuals}
by_country <- by_country %>%
mutate(
resids = map2(data, model, add_residuals)
)
by_country
```
By now, you should ask yourself: "How can I plot a bunch of dataframes?". This isn't required. We have to *un-nest*.
```{r 99-25-Un-Nest}
resids <- unnest(by_country, resids)
resids
```
Now, we can plot ALL the residuals.
```{r 99-25-Plot Residuals}
resids %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
```
We can now plot each country as a facet.
```{r 99-25-Facet by Country}
resids %>%
ggplot(aes(year, resid, group = country)) +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
```
We can note that we still have very large residuals, namely in Africa suggesting our model isn't fitting very well.
### Model quality
Instead of looking at the residuals from the model, we could look at some general measurements of model quality.
We’ll use `broom::glance()` to extract some model quality metrics.
```{r 99-25-Broom::Glance}
broom::glance(nz_mod)
```
We can use mutate() and unnest() to create a data frame with a row for each country.
```{r Broom:Glance Un-Nested}
by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
```
This isn’t quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `.drop = TRUE`
>Note, `.drop=TRUE` has been depricated.
```{r 99-25-"Un-Nest w/ .drop=TRUE"}
glance <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
glance
```
Now, we can look for countries that don't fit our model well.
```{r 99-25-Sort by Residuals}
glance %>%
arrange(r.squared)
```
The worst fitting models seem to be in Africa. We'll add `geom_jitter()` to make it more apparent.
We can also plot the particular bad $$R^2$$ and plot the data.
```{r 99-25-Geom_Jitter plus bad R2}
glance %>%
ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5)
bad_fit <- filter(glance, r.squared < 0.25)
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
```
The relation of this visual is the tragidies of HIV/AIDs epidemic and the Rwanda genocide.
## List-columns
- List-Columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors.
- Base-R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns.
```{r 99-25-List of Columns}
data.frame(x = list(1:3, 3:5))
```
- You can prevent `data.frame()` from treating a lists of lists by adding `I()` to the argument. However, this doesn't print well.
`I()` stands for **Inhibit Interpretation/Conversion of Objects**: Change the class of an object to indicate that is should be treated *as is*.
```{r 99-25-using I}
data.frame(
x = I(list(1:3, 3:5)),
y = c("1, 2", "3, 4, 5")
)
```
- Tibble alleviates this problem by being lazier (`tibble()` doesn’t modify its inputs) and by providing a better print method.
>Note where the quotes are placed
```{r 99-25-List with Tibble}
tibble(
x = list(1:3, 3:5),
y = c("1, 2", "3, 4, 5")
)
```
- `tribble()` can automatically work out that you need a list.
```{r 99-25-Tribble Example}
tribble(
~x, ~y,
1:3, "1, 2",
3:5, "3, 4, 5"
)
```
- List-columns are often most useful as intermediate data structure.
- Advantage of keeping related items together in a data frame is worth a little hassle.
There are three parts of an effective list-column pipeline:
1. You create the list-column using one of: `nest()`, `summarise() + list()`, or `mutate() + a map function`
2. You create other intermediate list-columns by transforming existing list columns with `map()`, `map2()`, or `pmap()`.
3. You simplify the list-column back down to a data frame or atomic vector.
## Creating list-columns
Typically, you won’t create list-columns with `tibble()`. Instead, you’ll create them from regular columns, using one of three methods:
1. With `tidyr::nest()` to convert a grouped data frame into a nested data frame where you have list-column of data frames.
2. With `mutate()` and vectorised functions that return a list.
3. With `summarise()` and summary functions that return multiple results.
>Alternatively, you might create them from a named list, using `tibble::enframe()`
When creating list-columns, make sure they are homogeneous.
### With nesting
- `nest()` creates a nested data frame, meaning, each row is a *meta-observation*.
- When applied to a group data frame, `nest()` keeps the grouping columns *as is*.
```{r 99-25-Gapminder Grouped}
gapminder %>%
group_by(country, continent) %>%
nest()
```
- You can also use it on an un-grouped data frame, specifying which columns you want to nest.
```{r 99-25-Specify Columns}
gapminder %>%
nest(data = c(year:gdpPercap))
```
### From vectorised functions
- If you use `stringr::str_split() + mutate()` you get a list-column.
> Again, note where the quotes are placed.
```{r 99-25-Vectorized Functions}
df <- tribble(
~x1,
"a,b,c",
"d,e,f,g"
)
df %>%
mutate(x2 = stringr::str_split(x1, ","))
```
- And now `unnest()` knows how to handle these list of vectors.
```{r 99-25-Un-Nested list of vectors}
df %>%
mutate(x2 = stringr::str_split(x1, ",")) %>%
unnest(x2)
```
>If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern.
Another example uses `map(), `map2()`, and pmap()`. We could re-write [Invoking different functions](https://r4ds.had.co.nz/iteration.html#invoking-different-functions) and rewrite it to use `mutate()`.
Previous Example Code:
```{r 99-25-Chapter 21 Code Example}
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sim = invoke_map(f, params, n = 10))
```
Refactored Code using `mutate()`
```{r 99-25-Refactored using Mutate}
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
> I donm't understand what is being expressed here.....the two code snippets are identical, except in chapter 25, the name is `sims` instead. Thoughts?
### From multivalued summaries
One restriction of `summarise()` is it only works with summary functions that return a single value. Implying, you can't use it with functions like `quantile()` that return a vector of arbitrary length.
```{r 99-25-mtcars with summarise}
mtcars %>%
group_by(cyl) %>%
summarise(q = quantile(mpg))
```
You can however, wrap the result in a list! This obeys the contract of summarise(), because each summary is now a list (a vector) of length 1.
```{r 99-25-Summarise with list}
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg)))
```
To make useful results with `unnest90`, you'll also ned to capture probabilities.
```{r 99-25-Capture Probabilities}
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>%
group_by(cyl) %>%
summarise(p = list(probs), q = list(quantile(mpg, probs))) %>%
unnest(c(p, q))
```
### From a named list
What do you do if you want to iterate over both the contents of a list and its elements?
- Make a data frame with one column containing the elements and another column containing the list!
You can use `tibble::enframe()`.
```{r 99-25-Tibble Enframe}
x <- list(
a = 1:5,
b = 3:4,
c = 5:6
)
df <- enframe(x)
df
```
Now, if we want to iterate over names and values in parrallel, we can use `map2()`.
```{r 99-25-map2}
df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
```
## Simplifying list-columns
To simplify the list-column back to a regular column (an atomic vector) or set of columns:
1. If you want a single value, use `mutate()` with `map_lgl()`, `map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector.
2. If you want many values, use `unnest()` to convert list-columns back to regular columns, repeating the rows as many times as necessary.
### List to vector
You can always summarise an object with its type and length, so this code will work regardless of what sort of list-column you have.
```{r 99-25-List Column}
df <- tribble(
~x,
letters[1:5],
1:3,
runif(5)
)
df %>% mutate(
type = map_chr(x, typeof),
length = map_int(x, length)
)
```
Although being the same basic information you get from the default `tbl` print method, you can now use if for filtering.
Don’t forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in apple for each element of x.
Use the `.null` argument to provide a value to use if the element is missing (instead of returning NULL).
```{r 99-25-List with Null}
df <- tribble(
~x,
list(a = 1, b = 2),
list(a = 2, c = 4)
)
df %>% mutate(
a = map_dbl(x, "a"),
b = map_dbl(x, "b", .null = NA_real_)
)
```
### Unnesting
`unnest()` works by repeating the regular columns once for each element of the list-column.
```{r 99-25-Unnest with Tibble}
tibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y)
```
You cannot simultaneously unnest two columns that contain different number of elements.
```{r 99-25-Unnest Error}
# Ok, because y and z have the same number of elements in
# every row
df1 <- tribble(
~x, ~y, ~z,
1, c("a", "b"), 1:2,
2, "c", 3
)
df1
df1 %>% unnest(c(y, z))
# Doesn't work because y and z have different number of elements
df2 <- tribble(
~x, ~y, ~z,
1, "a", 1:2,
2, c("b", "c"), 3
)
df2
df2 %>% unnest(c(y, z))
```
## Making tidy data with broom
The broom package provides three general tools for turning models into tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a model summary: either a measure of model quality, or complexity, or a combination of the two.
2. `broom::tidy(model)` returns a row for each coefficient in the model. Each column gives information about the estimate or its variability.
3. `broom::augment(model, data)` returns a row for each row in data, adding extra values like residuals, and influence statistics.
***Any Questions?***
## Meeting Videos
### Cohort 5
`r knitr::include_url("https://www.youtube.com/embed/lO_qtXM8OH8")`
<details>
<summary> Meeting chat log </summary>
```
00:09:25 Jon Harmon (jonthegeek): dslc.io/bookclubber
01:07:14 Jon Harmon (jonthegeek): ggrepel
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/d5FPrn4vTzc")`
<details>
<summary> Meeting chat log </summary>
```
00:10:13 Njoki Njuki Lucy: It will be nice to see familiar faces :)
00:36:21 Jon Harmon (jonthegeek): sim <- tribble(
~params,
list(min = -1, max = 1),
list(min = -1, max = 2),
list(min = -1, max = 3)
)
set.seed(424242)
use_purrr <- sim %>%
mutate(
sims = map(params, runif)
)
set.seed(424242)
use_rowwise <- sim %>%
rowwise() %>%
mutate(
sims = list(runif(params))
) %>%
ungroup()
identical(use_purrr, use_rowwise)
00:37:40 Jon Harmon (jonthegeek): > use_purrr
# A tibble: 3 x 2
params sims
<list> <list>
1 <named list [2]> <dbl [2]>
2 <named list [2]> <dbl [2]>
3 <named list [2]> <dbl [2]>
00:52:42 Jon Harmon (jonthegeek): > tibble(x = 1:2, y = list(1:4, 1))
# A tibble: 2 x 2
x y
<int> <list>
1 1 <int [4]>
2 2 <dbl [1]>
00:57:04 Jon Harmon (jonthegeek): > df2 %>% unnest(c(y))
# A tibble: 3 x 3
x y z
<dbl> <chr> <list>
1 1 a <int [2]>
2 2 b <dbl [1]>
3 2 c <dbl [1]>
00:57:23 Jon Harmon (jonthegeek): > df2 %>% unnest(y) %>% unnest(z)
# A tibble: 4 x 3
x y z
<dbl> <chr> <dbl>
1 1 a 1
2 1 a 2
3 2 b 3
4 2 c 3
01:11:05 Njoki Njuki Lucy: that's what I have been doing - doing the surgery😄
01:11:33 Njoki Njuki Lucy: :)
```
</details>
### Cohort 6
`r knitr::include_url("https://www.youtube.com/embed/v1OawROf1ZQ")`
`r knitr::include_url("https://www.youtube.com/embed/PX7y2jlRYuU")`
<details>
<summary> Meeting chat log </summary>
```
00:14:37 Marielena Soilemezidi: they're called quotes/quote marks!
00:14:43 Marielena Soilemezidi: I also got a brain freeze xD
00:42:28 Daniel: https://github.com/tidyverse/broom
```
</details>