-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
2809 lines (1855 loc) · 145 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Open Case Studies: Predicting Annual Air Pollution"
css: style.css
output:
html_document:
includes:
in_header:
- header.html
- GA_Script.Rhtml
self_contained: yes
code_download: yes
highlight: tango
number_sections: no
theme: cosmo
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
<style>
#TOC {
background: url("https://opencasestudies.github.io/img/icon-bahi.png");
background-size: contain;
padding-top: 240px !important;
background-repeat: no-repeat;
}
</style>
<!-- Open all links in new tab-->
<base target="_blank"/>
```{r setup, include=FALSE}
library(knitr)
library(here)
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = FALSE,
fig.align = "center", out.width = '90%')
```
#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "main_plot_maps.png"))
```
####
#### {.disclaimer_block}
**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
####
#### {.license_block}
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.
####
#### {.reference_block}
To cite this case study please use:
Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com//opencasestudies/ocs-bp-air-pollution](https://github.com//opencasestudies/ocs-bp-air-pollution/). Predicting Annual Air Pollution (Version v1.0.0).
####
To access the GitHub Repository for this case study see here: https://github.com/opencasestudies/ocs-bp-air-pollution.
This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).
# **Motivation**
***
A variety of different sources contribute different types of pollutants to what we call air pollution.
Some sources are natural while others are anthropogenic (human derived):
<p align="center">
<img width="600" src="https://www.nps.gov/subjects/air/images/Sources_Graphic_Huge.jpg?maxwidth=1200&maxheight=1200&autorotate=false">
</p>
##### [[source]](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.nps.gov%2Fsubjects%2Fair%2Fsources.htm&psig=AOvVaw2v7AVxSF8ZSAPEhNudVtbN&ust=1585770966217000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPDN66q_xegCFQAAAAAdAAAAABAD){target="_blank"}
### Major types of air pollutants
1) **Gaseous** - Carbon Monoxide (CO), Ozone (O~3~), Nitrogen Oxides(NO, NO~2~), Sulfur Dioxide (SO~2~)
2) **Particulate** - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
3) **Dust** - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
4) **Biological** - pollen, bacteria, viruses, mold spores
See [here](http://www.redlogenv.com/worker-safety/part-1-dust-and-particulate-matter) for more detail on the types of pollutants in the air.
### Particulate pollution
Air pollution particulates are generally described by their **size**.
There are 3 major categories:
1) **Large Coarse** Particulate Matter - has diameter of >10 micrometers (10 µm)
2) **Coarse** Particulate Matter (called **PM~10-2.5~**) - has diameter of between 2.5 µm and 10 µm
3) **Fine** Particulate Matter (called **PM~2.5~**) - has diameter of < 2.5 µm
**PM~10~** includes any particulate matter <10 µm (both coarse and fine particulate matter)
Here you can see how these sizes compare with a human hair:
```{r, echo = FALSE, out.width= "600 px"}
knitr::include_graphics(here::here("img", "pm2.5_scale_graphic-color_2.jpg"))
```
##### [[source]](https://www.epa.gov/pm-pollution/particulate-matter-pm-basics){target="_blank"}
<!-- <p align="center"> -->
<!-- <img width="500" src="https://www.sensirion.com/images/sensirion-specialist-article-figure-1-cdd70.jpg"> -->
<!-- </p> -->
<u>The following plot shows the relative sizes of these different pollutants in micrometers (µm):</u>
```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "particulate-size-chart.png"))
```
##### [[source]](https://en.wikipedia.org/wiki/Particulates){target="_blank"}
<u>This table shows how deeply some of the smaller fine particles can penetrate within the human body:</u>
```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "sizes.jpg"))
```
##### [[source]](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full){target="_blank"}
### Negative impact of particulate exposure on health
Exposure to air pollution is associated with higher rates of [mortality](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5783186/){target="_blank"} in older adults and is known to be a risk factor for many diseases and conditions including but not limited to:
1) [Asthma](https://www.ncbi.nlm.nih.gov/pubmed/29243937){target="_blank"} - fine particle exposure (**PM~2.5~**) was found to be associated with higher rates of asthma in children
2) [Inflammation in type 1 diabetes](https://www.ncbi.nlm.nih.gov/pubmed/31419765){target="_blank"} - fine particle exposure (**PM~2.5~**) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
3) [Lung function and emphysema](https://www.ncbi.nlm.nih.gov/pubmed/31408135){target="_blank"} - higher concentrations of ozone (O~3~), nitrogen oxides (NO~x~), black carbon, and fine particle exposure **PM~2.5~** , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
4) [Low birthweight](https://www.ncbi.nlm.nih.gov/pubmed/31386643){target="_blank"} - fine particle exposure(**PM~2.5~**) was associated with lower birth weight in full-term live births
5) [Viral Infection](https://www.tandfonline.com/doi/full/10.1080/08958370701665434){target="_blank"} - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (**PM~2.5~**)
See this [review article](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full){target="_blank"} for more information about sources of air pollution and the influence of air pollution on health.
### Sparse monitoring is problematic for Public Health
Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country.
However, as can be seen in the following figure, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources. We will see later when we evaluate the data, that even in certain relatively large cities there is only one monitor!
Furthermore, dramatic differences in pollution rates can be seen even within the same city. In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.
```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "map_of_monitors.jpg"))
```
##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}
This lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.
### Machine learning offers a solution
An [article](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"} published in the *Environmental Health* journal dealt with this issue by using data, including population density, road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.
```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "thepaper.png"))
```
##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}
#### {.reference_block}
Yanosky, J. D. et al. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. *Environ Health* 13, 63 (2014).
####
The authors of this article state that:
> "Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations."
##### [[source]](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"}
The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.
We will use similar methods to predict annual air pollution levels spatially within the US.
# **Main Question**
***
#### {.main_question_block}
<b><u> Our main question: </u></b>
1) Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?
####
# **Learning Objectives**
***
In this case study, we will walk you through importing data from CSV files and performing machine learning methods to predict our outcome variable of interest (in this case annual fine particle air pollution estimates).
We will especially focus on using packages and functions from the [`tidyverse`](https://www.tidyverse.org/){target="_blank"}, and more specifically the [`tidymodels`](https://cran.r-project.org/web/packages/tidymodels/tidymodels.pdf){target="_blank"} package/ecosystem primarily developed and maintained by [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} and [Davis Vaughan](https://resources.rstudio.com/authors/davis-vaughan){target="_blank"}.
This package loads more modeling related packages like `rsample`, `recipes`, `parsnip`, `yardstick`, `workflows`, and `tune` packages.
The tidyverse is a library of packages created by RStudio.
While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.
```{r, echo = FALSE, fig.show = "hold", out.width = "20%", fig.align = "default"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
include_graphics("https://pbs.twimg.com/media/DkBFpSsW4AIyyIN.png")
```
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Familiarity with the tidymodels ecosystem
2. Ability to evaluate correlation among predictor variables (`corrplot` and `GGally`)
3. Ability to implement tidymodels packages such as `rsample` to split the data into training and testing sets as well as cross validation sets.
4. Ability to use the `recipes`, `parsnip`, and `workflows` to train and test a linear regression model and random forest model
5. Demonstrate how to visualize geo-spatial data using `ggplot2`
<u>**Statistical Learning Objectives:**</u>
1. Basic understanding the utility of machine learning for prediction and classification
2. Understanding of the need for training and test sets
3. Understanding of the utility of cross validation
4. Understanding of random forest
5. How to interpret root mean squared error (rmse) to assess performance for prediction
We will begin by loading the packages that we will need:
```{r}
# Load packages for data import and data wrangling
library(here)
library(readr)
library(dplyr)
library(skimr)
library(summarytools)
library(magrittr)
# Load packages for making correlation plots
library(corrplot)
library(RColorBrewer)
library(GGally)
library(tidymodels)
# Load packages for building machine learning algorithm
library(workflows)
library(vip)
library(tune)
library(randomForest)
library(doParallel)
# Load packages for data visualization/creating map
library(ggplot2)
library(stringr)
library(tidyr)
library(lwgeom)
library(sf)
library(maps)
library(rnaturalearth)
library(rgeos)
library(patchwork)
```
<u>**Packages used in this case study:** </u>
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"} | to import CSV files
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to view/arrange/filter/select/compare specific subsets of data
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"} | to get an overview of data
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"} | to get an overview of data in a different style
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"} | to use the `%<>%` pipping operator
[corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html){target="_blank"} | to make large correlation plots
[GGally](https://cran.r-project.org/web/packages/GGally/GGally.pdf){target="_blank"} | to make smaller correlation plots
[tidymodels](https://www.tidymodels.org){target="_blank"} | to load in a set of packages (broom, dials, infer, parsnip, purrr, recipes, rsample, tibble, yardstick)
[rsample](https://tidymodels.github.io/rsample/articles/Basics.html){target="_blank"} | to split the data into testing and training sets; to split the training set for cross-validation
[recipes](https://tidymodels.github.io/recipes/){target="_blank"} | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()`, `prep()` and various transformation `step_*()` functions, as well as `bake` which extracts pre-processed training data (used to require `juice()`) and applies recipe preprocessing steps to testing data). See [here](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"} for more info.
[parsnip](https://tidymodels.github.io/parsnip/){target="_blank"} | an interface to create models (major functions are `fit()`, `set_engine()`)
[yardstick](https://tidymodels.github.io/yardstick/){target="_blank"} | to evaluate the performance of models
[broom](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/){target="_blank"} | to get tidy output for our model fit and performance
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations with multiple layers
[dials](https://www.tidyverse.org/blog/2019/10/dials-0-0-3/){target="_blank"} | to specify hyper-parameter tuning
[tune](https://tune.tidymodels.org/){target="_blank"} | to perform cross validation, tune hyper-parameters, and get performance metrics
[workflows](https://www.rdocumentation.org/packages/workflows/versions/0.1.1){target="_blank"}| to create modeling workflow to streamline the modeling process
[vip](https://cran.r-project.org/web/packages/vip/vip.pdf){target="_blank"} | to create variable importance plots
[randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf){target="_blank"} | to perform the random forest analysis
[doParallel](https://cran.r-project.org/web/packages/doParallel/doParallel.pdf) | to fit cross validation samples in parallel
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"} | to manipulate the text the map data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to separate data within a column into multiple columns
[rnaturalearth](https://cran.r-project.org/web/packages/rnaturalearth/README.html){target="_blank"} | to get the geometry data for the earth to plot the US
[maps](https://cran.r-project.org/web/packages/maps/maps.pdf){target="_blank"} | to get map database data about counties to draw them on our US map
[sf](https://r-spatial.github.io/sf/){target="_blank"} | to convert the map data into a data frame
[lwgeom](https://cran.r-project.org/web/packages/lwgeom/lwgeom.pdf){target="_blank"} | to use the `sf` function to convert map geographical data
[rgeos](https://cran.r-project.org/web/packages/rgeos/rgeos.pdf){target="_blank"} | to use geometry data
[patchwork](https://cran.r-project.org/web/packages/patchwork/patchwork.pdf){target="_blank"} | to allow plots to be combined
___
The first time we use a function, we will use the `::` to indicate which package we are using.
Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
# **Context**
***
The [State of Global Air](https://www.stateofglobalair.org/){target="_blank"} is a report released every year to communicate the impact of air pollution on public health.
The [State of Global Air 2019 report](https://www.stateofglobalair.org/sites/default/files/soga_2019_report.pdf){target="_blank"}
which uses data from 2017 stated that:
> Air pollution is the **fifth** leading risk factor for mortality worldwide. It is responsible for more
deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity.
Each year, **more** people die from air pollution–related disease than from road **traffic injuries** or **malaria**.
<p align="center">
<img width="600" src="https://www.healtheffects.org/sites/default/files/SoGA-Figures-01.jpg">
</p>
##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_report.pdf){target="_blank"}
The report also stated that:
> In 2017, air pollution is estimated to have contributed to close to 5 million
deaths globally — nearly **1 in every 10 deaths**.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017deaths.png"))
```
##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf){target="_blank"}
The [State of Global Air 2018 report](https://www.stateofglobalair.org/sites/default/files/soga-2018-report.pdf){target="_blank"} using data from 2016 which separated different types of air pollution, found that **particulate pollution was particularly associated with mortality**.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017mortality.png"))
```
##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga-2018-report.pdf){target="_blank"}
The 2019 report shows that the highest levels of fine particulate pollution occurs in Africa and Asia and that:
> More than **90%** of people worldwide live in areas **exceeding** the World Health Organization (WHO) **Guideline** for healthy air. More than half live in areas that do not even meet WHO's least-stringent air quality target.
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","PMworld.png"))
```
##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf){target="_blank"}
Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 [*Our Nation's Air*](https://gispub.epa.gov/air/trendsreport/2019/#home){target="_blank"} report from the US Environmental Protection Agency (EPA):
```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "US.png"))
```
##### [[source]](https://gispub.epa.gov/air/trendsreport/2019/documentation/AirTrends_Flyer.pdf){target="_blank"}
However, air pollution **continues to contribute to health risk for Americans**, in particular in **regions with higher than national average rates** of pollution that actually at time exceed the WHO's recommended level.
Thus, it is important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.
You can see that current air quality conditions at this [website](https://aqicn.org/city/usa/){target="_blank"} and you will notice variation across different cities.
For example, here are the conditions in Topeka Kansas at the time this case study was created:
```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "Kansas.png"))
```
##### [[source]](https://aqicn.org/city/usa/){target="_blank"}
It reports particulate values using what is called the [Air Quality Index](https://www.airnow.gov/index.cfm?action=aqibasics.aqi){target="_blank"} (AQI).
This [calculator](https://airnow.gov/index.cfm?action=airnow.calculator){target="_blank"} indicates that 114 AQI is equivalent to 40.7 ug/m^3^ and is considered unhealthy for sensitive individuals.
Thus, some areas exceed the WHO annual exposure guideline (10 ug/m^3^) and this may adversely affect the health of people living in these locations.
Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines.
Also, it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution.
For example, see this [article](https://www.nejm.org/doi/full/10.1056/NEJMoa1702747){target="_blank"} for more details.
The monitor data that we will use in this case study come from a system of monitors in which roughly 90% are located within cities.
Hence, there is an **equity issue** in terms of capturing the air pollution levels of more rural areas.
To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in **areas with little to no monitoring**.
Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:
<p align="center">
<img width="600" src="https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/SAWOEGBXMVGQ7AS5PZ6UUOX6FY.png">
</p>
##### [[source]](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.washingtonpost.com%2Fbusiness%2F2019%2F10%2F23%2Fair-pollution-is-getting-worse-data-show-more-people-are-dying%2F&psig=AOvVaw3v-ZDTBPnLP2MYtKf3Undj&ust=1585784479068000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPCyn9fxxegCFQAAAAAdAAAAABAd){target="_blank"}
This is what we aim to achieve in this case study.
# **Limitations**
***
There are some important considerations regarding the data analysis in this case study to keep in mind:
1. The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.
2. Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.
3. Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data, however we are interested in long term exposures, as these appear to be the most influential for health outcomes, so we chose to use annual level data.
# **What are the data?** {#whatarethedata}
***
When using machine learning for prediction, there are two main types of data of interest:
1. A **continuous** outcome variable that we want to predict
2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable
The **outcome variable** is what we are trying to **predict**.
To build (or train) our model, we use both the outcome and features.
The goal is to identify informative features that can explain a large amount of variation in our outcome variable.
Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.
As a simple example, imagine that we have data about the sales and characteristics of cars from last year and we want to predict which cars might sell well this year.
We do not have the sales data yet for this year, but we do know the characteristics of our cars for this year.
We can build a model of the characteristics that explained sales last year to estimate what cars might sell well this year.
In this case, our outcome variable is the sales of cars, while the different characteristics of the cars make up our features.
### **Start with a question**
***
This is the most commonly missed step when developing a machine learning algorithm.
Machine learning can very easily be turned into an engineering problem.
Just dump the outcome and the features into a black box algorithm and viola!
But this kind of thinking can lead to major problems. In general good machine learning questions:
1. Have a plausible explanation for why the features predict the outcome.
2. Consider potential variation in both the features and the outcome over time
3. Are consistently re-evaluated on criteria 1 and 2 over time.
In this case study, we want to **predict** air pollution levels.
To build this machine learning algorithm, our **outcome variable** is fine particulate matter (PM~2.5~) captured from air pollution monitors in the contiguous US from 2008.
Our **features** (or predictor variables) include data about population density, road density, urbanization levels, and NASA satellite data.
All of our data was previously collected by a [researcher](http://www.biostat.jhsph.edu/~rpeng/) at the [Johns Hopkins School of Public Health](https://www.jhsph.edu/) who studies air pollution and climate change.
### **Our outcome variable**
***
The monitor data that we will be using comes from **[gravimetric monitors](https://publiclab.org/wiki/filter-pm){target="_blank"}** (see picture below) operated by the US [Environmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}.
```{r, echo = FALSE, out.width="100px"}
knitr::include_graphics(here::here("img","monitor.png"))
```
##### [image curtesy of [Kirsten Koehler](https://www.jhsph.edu/faculty/directory/profile/2928/kirsten-koehler)]
These monitors use a filtration system to specifically capture fine particulate matter.
```{r, echo = FALSE, out.width="150px"}
knitr::include_graphics(here::here("img","filter.png"))
```
##### [[source]](https://publiclab.org/wiki/filter-pm){target="_blank"}
The weight of this particulate matter is manually measured daily or weekly.
For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to [here](https://www3.epa.gov/ttnamti1/files/ambient/pm25/spec/RTIGravMassSOPFINAL.pdf){target="_blank"}.
<details><summary>For more on Gravimetric analysis, you can expand here </summary>
Gravimetric analysis is also used for [emission testing](https://www.mt.com/us/en/home/applications/Laboratory_weighing/emissions-testing-particulate-matter.html){target="_blank"}.
The same idea applies: a fresh filter is applied and the desired amount of time passes, then the filter is removed and weighed.
There are [other monitoring systems](https://www.sensirion.com/en/about-us/newsroom/sensirion-specialist-articles/particulate-matter-sensing-for-air-quality-measurements/){target="_blank"} that can provide hourly measurements, but we will not be using data from these monitors in our analysis.
Gravimetric analysis is considered to be among the most accurate methods for measuring particulate matter.
</details>
In our data set, the `value` column indicates the PM~2.5~ monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors.
The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m^3^).
Recall the WHO exposure guideline is < 10 ug/m^3^ on average annually for PM~2.5~.
### **Our features (predictor variables)**
***
There are 48 features with values for each of the 876 monitors (observations).
The data comes from the US [Environmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}, the [National Aeronautics and Space Administration (NASA)](https://www.nasa.gov/){target="_blank"}, the US [Census](https://www.census.gov/about/what/census-at-a-glance.html){target="_blank"}, and the [National Center for Health Statistics (NCHS)](https://www.cdc.gov/nchs/about/index.htm){target="_blank"}.
<details><summary> Click here to see a table about the set of features </summary>
Variable | Details
---------- |-------------
**id** | Monitor number <br> -- the county number is indicated before the decimal <br> -- the monitor number is indicated after the decimal <br> **Example**: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
**fips** | Federal information processing standard number for the county where the monitor is located <br> -- 5 digit id code for counties (zero is often the first value and sometimes is not shown) <br> -- the first 2 numbers indicate the state <br> -- the last three numbers indicate the county <br> **Example**: Alabama's state code is 01 because it is first alphabetically <br> (note: Alaska and Hawaii are not included because they are not part of the contiguous US)
**Lat** | Latitude of the monitor in degrees
**Lon** | Longitude of the monitor in degrees
**state** | State where the monitor is located
**county** | County where the monitor is located
**city** | City where the monitor is located
**CMAQ** | Estimated values of air pollution from a computational model called [**Community Multiscale Air Quality (CMAQ)**](https://www.epa.gov/cmaq){target="_blank"} <br> -- A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution <br> -- ***Does not use any of the PM~2.5~ gravimetric monitoring data.*** (There is a version that does use the gravimetric monitoring data, but not this one!) <br> -- Data from the EPA
**zcta** | [Zip Code Tabulation Area](https://www2.census.gov/geo/pdfs/education/brochures/ZCTAs.pdf){target="_blank"} where the monitor is located <br> -- Postal Zip codes are converted into "generalized areal representations" that are non-overlapping <br> -- Data from the 2010 Census
**zcta_area** | Land area of the zip code area in meters squared <br> -- Data from the 2010 Census
**zcta_pop** | Population in the zip code area <br> -- Data from the 2010 Census
**imp_a500** | Impervious surface measure <br> -- Within a circle with a radius of 500 meters around the monitor <br> -- Impervious surface are roads, concrete, parking lots, buildings <br> -- This is a measure of development
**imp_a1000** | Impervious surface measure <br> -- Within a circle with a radius of 1000 meters around the monitor
**imp_a5000** | Impervious surface measure <br> -- Within a circle with a radius of 5000 meters around the monitor
**imp_a10000** | Impervious surface measure <br> -- Within a circle with a radius of 10000 meters around the monitor
**imp_a15000** | Impervious surface measure <br> -- Within a circle with a radius of 15000 meters around the monitor
**county_area** | Land area of the county of the monitor in meters squared
**county_pop** | Population of the county of the monitor
**Log_dist_to_prisec** | Log (Natural log) distance to a primary or secondary road from the monitor <br> -- Highway or major road
**log_pri_length_5000** | Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) <br> -- Highways only
**log_pri_length_10000** | Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) <br> -- Highways only
**log_pri_length_15000** | Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) <br> -- Highways only
**log_pri_length_25000** | Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) <br> -- Highways only
**log_prisec_length_500** | Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_prisec_length_1000** | Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_prisec_length_5000** | Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_prisec_length_10000** | Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_prisec_length_15000** | Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_prisec_length_25000** | Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) <br> -- Highway and secondary roads
**log_nei_2008_pm25_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
**log_nei_2008_pm25_sum_15000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
**log_nei_2008_pm25_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
**log_nei_2008_pm10_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
**log_nei_2008_pm10_sum_15000**| Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
**log_nei_2008_pm10_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
**popdens_county** | Population density (number of people per kilometer squared area of the county)
**popdens_zcta** | Population density (number of people per kilometer squared area of zcta)
**nohs** | Percentage of people in zcta area where the monitor is that **do not have a high school degree** <br> -- Data from the Census
**somehs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was **some high school education** <br> -- Data from the Census
**hs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a **high school degree** <br> -- Data from the Census
**somecollege** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing **some college education** <br> -- Data from the Census
**associate** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an **associate degree** <br> -- Data from the Census
**bachelor** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **bachelor's degree** <br> -- Data from the Census
**grad** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **graduate degree** <br> -- Data from the Census
**pov** | Percentage of people in zcta area where the monitor is that lived in [**poverty**](https://aspe.hhs.gov/2008-hhs-poverty-guidelines) in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines <br> -- Data from the Census
**hs_orless** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **high school degree or less** (sum of nohs, somehs, and hs)
**urc2013** | [2013 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br> -- Data from the [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}
**urc2006** | [2006 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_154.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br> -- Data from the [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}
**aod** | Aerosol Optical Depth measurement from a NASA satellite <br> -- based on the diffraction of a laser <br> -- used as a proxy of particulate pollution <br> -- unit-less - higher value indicates more pollution <br> -- Data from NASA
</details>
Many of these features have to do with the circular area around the monitor called the "buffer". These are illustrated in the following figure:
```{r, echo = FALSE, out.width = "800px",}
knitr::include_graphics(here::here("img", "regression.png"))
```
##### [[source]](https://www.ncbi.nlm.nih.gov/pubmed/15292906){target="_blank"}
# **Data Import**
***
All of our data was previously collected by a [researcher](http://www.biostat.jhsph.edu/~rpeng/) at the [Johns Hopkins School of Public Health](https://www.jhsph.edu/) who studies air pollution and climate change.
We have one CSV file that contains both our single **outcome variable** and all of our **features** (or predictor variables).
Next, we import our data into R now so that we can explore the data further.
We will call our data object `pm` for particulate matter.
We import the data using the `read_csv()` function from the `readr` package.
```{r}
pm <- readr::read_csv(here("docs", "pm25_data.csv"))
```
# **Data Exploration and Wrangling**
***
The first step in performing any data analysis is to explore the data.
For example, we might want to better understand the variables included in the data, as we may learn about important details about the data that we should keep in mind as we try to predict our outcome variable.
First, let's just get a general sense of our data.
We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).
We will also use the `%>%` pipe, which can be used to define the input for later sequential steps.
This will make more sense when we have multiple sequential steps using the same data object.
To use the pipe notation we need to install and load `dplyr` as well.
For example, here we start with `pm` data object and "pipe" the object into as input into the `glimpse()` function.
The output is an overview of what is in the `pm` object such as the number of rows and columns, all the column names, the data types for each column and the first view values in each column.
The output below is scrollable so you can see everything from the `glimpse()` function.
#### {.scrollable }
```{r}
# Scroll through the output!
pm %>%
dplyr::glimpse()
```
####
We can see that there are 876 monitors (rows) and that we have 50 total variables (columns) - one of which is the outcome variable. In this case, the outcome variable is called `value`.
Notice that some of the variables that we would think of as factors (or categorical data) are currently of class character as indicated by the `<chr>` just to the right of the column names/variable names in the `glimpse()` output. This means that the variable values are character strings, such as words or phrases.
The other variables are of class `<dbl>`, which stands for double precision which indicates that they are numeric and that they have decimal values. In contrast, one could have integer values which would not allow for decimal numbers. Here is a [link](https://en.wikipedia.org/wiki/Double-precision_floating-point_format){target="_blank"} for more information on double precision numeric values.
Another common data class is factor which is abbreviated like this: `<fct>`. A factor is something that has unique levels but there is no appreciable order to the levels. For example we can have a numeric value that is just an id that we want to be interpreted as just a unique level and not as the number that it would typically indicate. This would be useful for several of our variables:
1. the monitor ID (`id`)
2. the Federal Information Processing Standard number for the county where the monitor was located (`fips`)
3. the zip code tabulation area (`zcta`)
None of the values actually have any real numeric meaning, so we want to make sure that R does not interpret them as if they do.
So let's convert these variables into factors.
We can do this using the `across()` function of the `dplyr` package and the `as.factor()` base function.
The `across()` function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column.
In this case, we are also using the `magrittr` assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package.
This allows us use the `pm` data as input, but also reassigns the output to the same data object name.
#### {.scrollable }
```{r}
# Scroll through the output!
pm %<>%
mutate(across(c(id, fips, zcta), as.factor))
glimpse(pm)
```
####
Great! Now we can see that these variables are now factors as indicated by `<fct>` after the variable name.
## **`skim` package**
***
The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.
By design, it provides summary statistics about variables in the data set.
#### {.scrollable }
```{r}
# Scroll through the output!
skimr::skim(pm)
```
####
Notice how there is a column called `n_missing` about the number of values that are missing.
This is also indicated by the `complete_rate` variable (or missing/number of observations).
In our data set, it looks like our data do not contain any missing data.
Also notice how the function provides separate tables of summary statistics for each data type: character, factor and numeric.
Next, the `n_unique` column shows us the number of unique values for each of our columns.
We can see that there are 49 states represented in the data.
We can see that for many variables there are many low values as the distribution shows two peaks, one near zero and another with a higher value.
This is true for the `imp` variables (measures of development), the `nei` variables (measures of emission sources) and the road density variables.
We can also see that the range of some of the variables is very large, in particular the area and population related variables.
Let's take a look to see which states are included using the `distinct()` function of the `dplyr` package:
```{r, eval = FALSE}
pm %>%
dplyr::distinct(state)
```
Scroll through the output:
#### {.scrollable }
```{r, echo = FALSE}
# Scroll through the output!
pm %>%
distinct(state) %>%
# this allows us to show the full output in the rendered rmarkdown
print(n = 1e3)
```
####
It looks like "District of Columbia" is being included as a state.
We can see that Alaska and Hawaii are not included in the data.
Let's also take a look to see how many monitors there are in a few cities. We can use the `filter()` function of the `dplyr` package to do so. For example, let's look at Albuquerque, New Mexico.
```{r}
pm %>% dplyr::filter(city == "Albuquerque")
```
We can see that there were only two monitors in the city of Albuquerque in 2006. Let's compare this with Baltimore.
```{r}
pm %>% dplyr::filter(city == "Baltimore")
```
There were in contrast five monitors for the city of Baltimore, despite the fact that if we take a look at the land area and population of the counties for Baltimore City and Albuquerque, we can see that they had very similar land area and populations.
```{r}
pm %>%
dplyr::filter(city == "Baltimore") %>%
select(county_area:county_pop)
pm %>%
dplyr::filter(city == "Albuquerque") %>%
select(county_area:county_pop)
```
In fact, the county containing Albuerque had a larger population. Thus the measurements for Albuquerque were not as thorough as they were for Baltimore.
This may be due to the fact that the monitor values were lower in Albuquerque. It is interesting to note here that the CMAQ values are quite similar for both cities.
## **Evaluate correlation**
***
In prediction analyses, it is also useful to evaluate if any of the variables are correlated. Why should we care about this?
If we are using a linear regression to model our data, then we might run into a problem called multicollinearity which can lead us to misinterpret what is really predictive of our outcome variable. This phenomenon occurs when the predictor variables actually predict one another. See [this case study](https://opencasestudies.github.io/ocs-bp-RTC-analysis/) for a deeper explanation about this.
Another reason we should look out for correlation is that we don't want to include redundant variables. This can add unnecessary noise to our algorithm causing a reduction in prediction accuracy and it can cause our algorithm to be unnecessarily slower. Finally, it can also make it difficult to interpret what variables are actually predictive.
Intuitively we can expect some of our variables to be correlated.
Let's first take a look at all of our numeric variables with the`corrplot` package:
The `corrplot` package is another option to look at correlation among possible predictors, and particularly useful if we have many predictors.
First, we calculate the Pearson correlation coefficients between all features pairwise using the `cor()` function of the `stats` package (which is loaded automatically). Then we use the `corrplot::corrplot()` function.
```{r}
PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)
```
The `tl.cex = 0.5` argument controls the size of the text label.
We can also plot the absolute value of the Pearson correlation coefficients using the `abs()` function from base R and change the order of the columns.
```{r}
corrplot(abs(PM_cor), order = "hclust", tl.cex = 0.5, cl.lim = c(0, 1))
```
There are several options for ordering the variables. See [here](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) for more options. Here we will use the "hclust" option for ordering by [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) - which will order the variables by how similar they are to one another.
The `cl.lim = c(0, 1)` argument limits the color label to be between 0 and 1.
We can see that the development variables (`imp`) variables are correlated with each other as we might expect.
We also see that the road density variables seem to be correlated with each other, and the emission variables seem to be correlated with each other.
Also notice that none of the predictors are highly correlated with our outcome variable (`value`).
We can take also take a closer look using the `ggcorr()` function and the `ggpairs()` function of the `GGally` package.
To select our variables of interest we can use the `select()` function with the `contains()` function of the `tidyr` package.
First let's look at the `imp`/development variables.
We can change the default color palette (`palette = "RdBu"`) and add on
correlation coefficients to the plot (`label = TRUE`).
```{r, out.width = "400px"}
select(pm, contains("imp")) %>%
ggcorr(palette = "RdBu", label = TRUE)
select(pm, contains("imp")) %>%
ggpairs()
```
Indeed, we can see that `imp_a1000` and `imp_a500` are highly correlated, as well as `imp_a10000`, `imp_a15000`.
Next, let's take a look at the road density data:
```{r, fig.weight=12}
select(pm, contains("pri")) %>%
ggcorr(palette = "RdBu", hjust = .85, size = 3,
layout.exp=2, label = TRUE)
```
We can see that many of the road density variables are highly correlated with one another, while others are less so.
Finally let's look at the emission variables.
```{r}
select(pm, contains("nei")) %>%
ggcorr(palette = "RdBu", hjust = .85, size = 3,
layout.exp=2, label = TRUE)
select(pm, contains("nei")) %>%
ggpairs()
```
We would also expect the population density data might correlate with some of these variables.
Let's take a look.
```{r}
pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county,
log_pri_length_10000, imp_a10000) %>%
ggcorr(palette = "RdBu", hjust = .85, size = 3,
layout.exp=2, label = TRUE)
pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county,
log_pri_length_10000, imp_a10000, county_pop) %>%
ggpairs()
```
Interesting, so these variables don't appear to be highly correlated, therefore we might need variables from each of the categories to predict our monitor PM~2.5~ pollution values.
Because some variables in our data have extreme values, it might be good to take a log transformation. This can affect our estimates of correlation.
```{r}
pm %>%
mutate(log_popdens_county= log(popdens_county)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county,
log_pri_length_10000, imp_a10000) %>%
ggcorr(palette = "RdBu", hjust = .85, size = 3,
layout.exp=2, label = TRUE)
pm %>%
mutate(log_popdens_county= log(popdens_county)) %>%
mutate(log_pop_county = log(county_pop)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county,
log_pri_length_10000, imp_a10000, log_pop_county) %>%
ggpairs()
```
Indeed this increased the correlation, but variables from each of these categories may still prove to be useful for prediction.
Now that we have a sense of what our data are, we can get started with building a machine learning model to predict air pollution.
## **Exercise**
***
<!---AP_DEW_Quiz-->
<iframe style="margin:0 auto; min-width: 100%;" id="AP_DEW_QuizIframe" class="interactive" src="https://rsconnect.biostat.jhsph.edu/OCS_AP_DEW_Quiz/" scrolling="no" frameborder="no"></iframe>
<!---------------->
Suppose we have a dataframe called `mydata`. There are five variables in this dataframe: `var1`, `var2`, `var3`, `var4`, and `var5`. Try to come up with the code for creating the plot called `myplot`. This plot visualizes correlations between variables. (Hint: start from `mydata` and use a function from the GGally package)
<!---AP_DEW_Exercise1-->
<iframe style="margin:0 auto; min-width: 100%;" id="AP_DEW_Exercise1Iframe" class="interactive" src="https://rsconnect.biostat.jhsph.edu/OCS_AP_DEW_Exercise/" scrolling="no" frameborder="no"></iframe>
<!---------------->
# **What is machine learning?** {#whatisml}
***
You may have learned about the central dogma of statistics that you sample from a population.
![](img/cdi1.png)
Then you use the sample to try to guess what is happening in the population.
![](img/cdi2.png)
For prediction we have a similar sampling problem
![](img/cdp1.png)
But now we are trying to build a rule that can be used to predict a single observation's value of some characteristic using characteristics of the other observations.
![](img/cdp2.png)
Let's make this more concrete.
If you recall from the [What are the data?](#whatarethedata) section above, when we are using machine learning for prediction, our data consists of:
1. An **continuous** outcome variable that we want to predict
2. A set of feature(s) (or predictor variables) that we use to predict the outcome variable
We will use $Y$ to denote the outcome variable and $X = (X_1, \dots, X_p)$ to denote $p$ different features (or predictor variables).
Because our outcome variable is **continuous** (as opposed to categorical), we are interested in a particular type of machine learning algorithm.
Our goal is to build a machine learning algorithm that uses the features $X$ as input and predicts an outcome variable (or air pollution levels) in the situation where we do not know the outcome variable.
The way we do this is to use data where we have both the features $(X_1=x_1, \dots X_p=x_p)$ and the actual outcome $Y$ data to _train_ a machine learning algorithm to predict the outcome, which we call $\hat{Y}$.
When we say train a machine learning algorithm we mean that we estimate a function $f$ that uses the predictor variables $X$ as input or $\hat{Y} = f(X)$.
## **ML as an optimization problem**
If we are doing a good job, then our predicted outcome $\hat{Y}$ should closely match our actual outcome $Y$ that we observed.
In this way, we can think of machine learning (ML) as an optimization problem that tries to minimize the distance between $\hat{Y} = f(X)$ and $Y$.
$$d(Y - f(X))$$
The choice of distance metric $d(\cdot)$ can be the mean of the absolute or squared difference or something more complicated.
Much of the fields of statistics and computer science are focused on defining $f$ and $d$.
## **The parts of an ML problem**
To set up a machine learning (ML) problem, we need a few components.
To solve a (standard) machine learning problem you need:
1. A data set to train from.
2. An algorithm or set of algorithms you can use to try values of $f$
3. A distance metric $d$ for measuring how close $Y$ is to $\hat{Y}$
4. A definition of what a "good" distance is
While each of these components is a _technical_ problem, there has been a ton of work addressing those technical details. The most pressing open issue in machine learning is realizing that though these are _technical_ steps they are not _objective_ steps. In other words, how you choose the data, algorithm, metric, and definition of "good" says what you value and can dramatically change the results. A couple of cases where this was a big deal are:
1. [Machine learning for recidivism](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) - people built ML models to predict who would re-commit a crime. But these predictions were based on historically biased data which led to biased predictions about who would commit new crimes.
2. [Deciding how self driving cars should act](https://www.nature.com/articles/d41586-018-07135-0) - self driving cars will have to make decisions about how to drive, who they might injure, and how to avoid accidents. Depending on our choices for $f$ and $d$ these might lead to wildly different kinds of self driving cars. Try out the [moralmachine](http://moralmachine.mit.edu/) to see how this looks in practice.
Now that we know a bit more about machine learning, let's build a model to predict air pollution levels using the `tidymodels` framework.
# **Machine learning with `tidymodels`**
***
The goal is to build a machine learning algorithm that uses the features as input and predicts a outcome variable (or air pollution levels) in the situation where we do not know the outcome variable.
The way we do this is to use data where we have both the input and output data to _train_ a machine learning algorithm.
To train a machine learning algorithm, we will use the `tidymodels` package ecosystem.
## **Overview**
***
### **The `tidymodels` ecosystem**
***
To perform our analysis we will be using the `tidymodels` suite of packages.
You may be familiar with the older packages `caret` or `mlr` which are also for machine learning and modeling but are not a part of the `tidyverse`.
[Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} describes `tidymodels` like this:
> "Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret.
The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do."
There are many R packages in the `tidymodels` ecosystem, which assist with various steps in the process of building a machine learning algorithm. These are the main packages, but there are others.
```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","simpletidymodels.png"))
```
This is a schematic of how these packages work together to build a machine learning algorithm:
```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","MachineLearning.png"))
```
### **Benefits of `tidymodels`**
***
The two major benefits of `tidymodels` are:
1. Standardized workflow/format/notation across different types of machine learning algorithms
Different notations are required for different algorithms as the algorithms have been developed by different people. This would require the painstaking process of reformatting the data to be compatible with each algorithm if multiple algorithms were tested.
2. Can easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy
Modifying a piece of the overall process is now easier than before because many of the steps are specified using the `tidymodels` packages in a convenient manner. Thus the entire process can be rerun after a simple change to pre-processing without much difficulty.
## **`tidymodels` Steps**
***
```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","tidymodelsBasics.png"))
```
## **Splitting the data**
***
The first step after data exploration in machine learning analysis is to [split the data](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7){target="_blank"} into **training** and **testing** data sets.
The training data set will be used to build and tune our model.
This is the data that the model "learns" on.
The testing data set will be used to evaluate the performance of our model in a more generalizable way. What do we mean by "generalizable"?
Remember that our main goal is to use our model to be able to predict air pollution levels in areas where there are no gravimetric monitors.
Therefore, if our model is really good at predicting air pollution with the data that we use to build it, it might not do the best job for the areas where there are few to no monitors.
This would cause us to have really good prediction accuracy and we might assume that we were going to do a good job estimating air pollution any time we use our model, but in fact this would likely not be the case.
This situation is what we call **[overfitting](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6){target="_blank"} **.
Overfitting happens when we end up modeling not only the major relationships in our data but also the noise within our data.