-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
4545 lines (3450 loc) · 191 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: |
![](https://opencasestudies.github.io/img/icon-bahi.png){width=120px align=left style="padding-right: 20px"}
Exploring global patterns of dietary behaviors associated with health risk
css: www/style.css
output:
html_document:
includes:
in_header: www/GA_Script.Rhtml
self_contained: yes
code_download: yes
highlight: tango
number_sections: no
theme: cosmo
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
runtime: shiny_prerendered
---
<!-- Open all links in new tab-->
<base target="_blank"/>
<div align="left" id="google_translate_element",></div>
<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>
<script type="text/javascript">
function googleTranslateElementInit() {
new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>
```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = FALSE,
fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(learnr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)
rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```
#### {.outline }
```{r, echo = FALSE}
knitr::include_graphics("www/img/mainplot.png")
```
#### {.disclaimer_block}
**Disclaimer**: The purpose of the [Open Case Studies](https://www.opencasestudies.org){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
####
#### {.license_block}
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.
####
#### {.reference_block}
To cite this case study please use:
Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com/opencasestudies/ocs-bp-diet](https://github.com/opencasestudies/ocs-bp-diet). Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).
####
To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-diet.
You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:
```{r, eval=FALSE}
install.packages("OCSdata")
```
This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).
***
The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below:
```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```
***
**Readability Score: **
A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.
```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>% # reading table into dataframe, selecting relevant factors
tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
age = round(as.numeric(age))
)
knitr::kable(df, format="markdown")
```
***
Please help us by filling out our survey.
<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>
## **Motivation**
***
An [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8){target="_blank"} recently published in The
Lancet evaluated global dietary trends and the relationship of dietary factors with mortality and fertility.
```{r, echo = FALSE}
knitr::include_graphics("www/img/thepaper.png")
```
#### {.reference_block}
GBD 2017 Diet Collaborators. Health effects of dietary risks in 195 countries, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. *The Lancet* 393, 1958–1972 (2019).
####
This article evaluated food consumption patterns in 195 countries for 15 different dietary risk factors that have probable associations with non-communicable disease (NCD). For example, over-consumption of sodium is associated with high blood pressure. These consumption levels were then used to estimate levels of mortality and morbidity due to NCD, as well as disability-adjusted life-years (DALYs) attributable to sub-optimal consumption of foods related to these dietary risk factors. The authors found that:
> "High intake of sodium ..., low intake of whole grains ..., and low intake of fruits ... were the leading dietary risk factors for deaths and DALYs globally and in many countries."
This figure from the paper's supplementary materials shows the ranking of the 15 dietary risk factors based on the estimated number of attributable deaths. Here, the numbers and colors of the little squares imply rankings of the risk factors (rows) by regions (columns). The color red indicates risk factors that are associated with larger number of attributable deaths. The column on the right is the overall global data. As you can see here, the top 3 risk factors are often issues for many different regions of the world.
```{r, echo = FALSE, out.width= "700 px"}
knitr::include_graphics("www/img/deaths.png")
```
This case study will evaluate the data reported in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017.
## **Main Questions**
***
#### {.main_question_block}
<b><u> Our main questions are: </u></b>
1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?
####
## **Learning Objectives**
***
In this case study, we will walk you through importing data from PDF files and CSV files, cleaning data, wrangling data, comparing data, joining data, visualizing data, and <b> comparing two or more groups </b> using well-established and commonly used packages, including `stringr`, `tidyr`, `dplyr`, `purrr`, and `ggplot2`. We will especially focus on using packages and functions from the [Tidyverse](https://www.tidyverse.org/){target="_blank"}. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Importing/extracting data from PDF (`dplyr`, `stringr`)
2. How to reshape data by pivoting between "long" and "wide" formats (`tidyr`)
3. Perform functions on all columns of a tibble (`purrr`)
4. Data cleaning with regular expressions (`stringr`)
5. Specific data value reassignment
6. Separate data within a column into multiple columns (`tidyr`)
7. Methods to Compare data (`dplyr`)
8. Combining data from two sources (`dplyr`)
9. Make interactive plots (`ggiraph`)
10. Make a zoom facet for plot (`ggforce`)
11. Combine plots together (`cowplot`)
<u>**Statistical Learning Objectives:**</u>
1. Understanding of how the *t*-test and the ANOVA are specialized
regressions
2. Basic understanding of the utility of a regression analysis
3. How to implement a linear regression analysis in R
4. How to interpret regression coefficients
5. Awareness of *t*-test assumptions
6. Awareness of linear regression assumptions
7. How to use Q-Q plots to check for normality
8. Difference between fixed effects and random effects
9. How to perform paired *t*-test
10. How to perform a linear mixed effects regression
```{r, out.width = "20%", echo = FALSE, fig.align = "center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```
***
We will begin by loading the packages that we will need:
```{r}
library(here)
library(readr)
library(dplyr)
library(skimr)
library(pdftools)
library(stringr)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(forcats)
library(lme4)
library(lmerTest)
library(car)
library(ggiraph)
library(ggforce)
library(viridis)
library(cowplot)
library(OCSdata)
```
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"} | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to arrange/filter/select/compare specific subsets of the data
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"} | to get an overview of data
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"} | to read a PDF into R
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"} | to manipulate the text within the PDF of the data
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"} | to use the `%<>%` piping operator
[purrr](https://purrr.tidyverse.org/){target="_blank"} | to perform functions on all columns of a tibble
[tibble](https://tibble.tidyverse.org/){target="_blank"} | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to separate data within a column into multiple columns
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations with multiple layers
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"} | to easily add regression line equations to plots
[forcats](https://forcats.tidyverse.org/){target="_blank"} | to change details about factors (categorical variables)
[lme4](https://cran.r-project.org/web/packages/lme4/lme4.pdf)| to fit a linear mixed effects model
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in a color palette that is easily interpreted by colorblind individuals
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
___
The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
## **Context**
***
Here is an excerpt from the article itself about the context of the work:
```{r, echo = FALSE}
knitr::include_graphics("www/img/context.png")
```
Many dietary factors have well-established associations with health risk. The authors that generated this data set identified 15 dietary factors that have probable health risk based on literature search.
Here you can see a table of the sources for the health risks associated with the dietary factors. The first column shows the risk factors and the second column shows the health outcomes. This table is part of "Supplemental Table 1. Epidemiological evidence supporting causality between dietary risk factors and disease endpoints" from the paper’s [supplementary materials](https://www.thelancet.com/cms/10.1016/S0140-6736(19)30041-8/attachment/3d4c0258-c2ea-405f-8d11-e9ae65e6f996/mmc1.pdf){target="_blank"}.
```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics("www/img/dietaryrisk.png")
```
In the article the authors found that most of the mortality associated with each factor is related to cardiovascular disease.
```{r, echo = FALSE, out.width= "500 px"}
knitr::include_graphics("www/img/cardiorisk.png")
```
## **Limitations**
***
There are some important limitations regarding the data from this article to keep in mind. The definition of certain dietary factors varied across some of the collection sources. Intakes of certain healthy foods like vegetables and fruits are likely positively correlated with each other and likely negatively correlated with intakes of unhealthy foods. Much of the data was collected with 24 hour recall surveys which are prone to issues due to inaccuracy of memory recall or other biases such as a tendency for some people to report healthier behaviors. The guidelines in the PDF are not specified by gender even though it is known that there are different dietary requirements for optimal health for certain nutrients by gender. The article discusses some limitations about accounting for overall food consumption when calculating consumption of particular foods:
> "To remove the effect of energy intake as a potential confounder and address measurement error in dietary assessment tools, most cohorts have adjusted for total energy intake in their statistical models. This energy adjustment means that diet components are defined as risks in terms of the share of diet and not as absolute levels of exposure. In other words, an increase in intake of foods and macronutrients should be compensated by a decrease in intake of other dietary factors to hold total energy intake constant. Thus, the relative risk of change in each component of diet depends on the other components for which it is substituted. However, the relative risks estimated from meta-analyses of cohort studies do not generally specify the type of substitution.
There are also important nuances to keep in mind regarding some of the dietary factors. For example calcium consumption was calculated based on consumption of dairy products, while calcium can be acquired from other sources including plant-based sources. However in these data, the influence of plant-based consumption of calcium was not accounted for, nor was supplementation through vitamin sources.
Also, while [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contains information for groups of individuals described as male or female.
## **What are the data?**
***
We will be using data that we requested from the [Global Burden of Disease (GBD)](http://www.healthdata.org/gbd){target="_blank"} of the [Institute for Health Metrics and Evaluation (IHME)](http://www.healthdata.org/about) about dietary intake, as well as the guideline data about optimal consumption amounts for different foods contained within the PDF of the article. We have two CSV files, dietary_risk_exposure_all_ages_2017.csv and dietary_risk_exposure_sep_ages_2017.csv. The first one includes consumption levels at the global level and for different countries for all ages combined.
Looking at the CSV file in excel:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/csv.png")
```
Here you can see that the data contains mean consumption values for both men and women in various countries at the national level in 2017 for various foods that may be problematic for health. The units for the food varies. So for example, the mean column in row that says "Diet low in fiber" indicates the average consumption level per person in that region and of that gender of fiber in grams per day.
The second CSV file has similar data, but consumption levels for different age groups are separated.
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/age_sep3.png")
```
The authors of this article obtained the data from a variety of sources including household budget surveys and nutritional surveys regarding 24 hour recall of food consumption and 24 hour urinary sodium analysis. The data was derived from sales data from [Euromonitor](https://www.euromonitor.com/){target="_blank"}, estimates about national availability of specific nutrients from the [United Nations Food and Agriculture Organization (FAO)](http://www.fao.org/home/en/){target="_blank"} and the [United States Department of Agriculture](https://www.usda.gov/){target="_blank"}'s [National Nutrition Database](https://data.nal.usda.gov/dataset/usda-national-nutrient-database-standard-reference-legacy-release){target="_blank"}.
## **Data Import**
***
If you have trouble accessing the [GitHub Repository](https://github.com/opencasestudies/ocs-bp-diet), the data can be downloaded from [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_all_ages_2017.csv) and [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_sep_ages_2017.csv).
Let's import our data into R now so that we can explore the data further.
In our case, we downloaded this data and put it within a "data" directory within a subdirectory called "raw" for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for importing this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the files within the "docs" directory within a directory where our `.Rproj` file is located by separating the name of the "data" directory, the "raw" subdirectory, and the file name using commas.
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects.
</details>
***
```{r, eval=FALSE}
diet_data <- readr::read_csv(here("data", "raw",
"dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("data", "raw",
"dietary_risk_exposure_sep_ages_2017.csv"))
```
```{r, echo=FALSE}
diet_data <- readr::read_csv(here("www", "data", "raw",
"dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("www", "data", "raw",
"dietary_risk_exposure_sep_ages_2017.csv"))
```
You may also use the `OCSdata` package to download the raw data:
```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-diet", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" sub-folder
# in your current working directory
```
If you used the `OCSdata` package to download the raw data, you can import the data into R like so:
```{r, eval=FALSE}
diet_data <- readr::read_csv(here("OCSdata", "data", "raw",
"dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("OCSdata", "data", "raw",
"dietary_risk_exposure_sep_ages_2017.csv"))
```
First let's just get a general sense of our data. We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).
```{r}
dplyr::glimpse(diet_data)
```
```{r}
glimpse(sep_age_diet_data)
```
Here we can tell that the `sep_age_diet_data` is much larger than the `diet_data`. The `diet_data` has only 5,880 rows while the `sep_age_diet_data` has 88,200 rows!
However, both files appear to have the same column structure with 11 variables each.
The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.
```{r}
skim(diet_data)
```
Notice how there is a column providing the number of missing observations for each variable. It looks like our data is very complete and we do not have any missing data.
We also get a sense about the size of our data.
The `n_unqiue` column shows us the number of unique values for each of our columns.
Let's take a look at `sep_age_diet_data`.
```{r}
skim(sep_age_diet_data)
```
We can see that there are many more rows in this data set.
Let's change the variable name `rei_name` to `dietary_risk` so that it makes more sense. We can use the `rename()` function from the `dplyr` package.
```{r}
diet_data <- dplyr::rename(diet_data, dietary_risk = rei_name)
sep_age_diet_data <- dplyr::rename(sep_age_diet_data, dietary_risk = rei_name)
glimpse(diet_data)
glimpse(sep_age_diet_data)
```
Looks good!
We will then take a look at the different dietary risk factors considered.
To do this we will use the `distinct()` function of the `dplyr` package.
This function grabs only the distinct or unique rows from a given variable (`dietary_risk`, in our case) of a given data frame (`diet_data`, in our case).
```{r}
dplyr::distinct(diet_data, dietary_risk)
```
Both over and under consumption could be a health problem!
We will be using the `%>%` pipe for sequential steps in our code later on.
This will make more sense when we have multiple sequential steps using the same data object.
We could do the same code as above using this notation. For example we first grab the `diet_data`, then we select the distinct values of the `dietary_risk` variable.
```{r}
diet_data %>%
distinct(dietary_risk)
```
OK, so that gives us an idea of what dietary factors we can explore, and we can see that there are 15 of them.
Let's see if the `location_name` values are the same between both CSV files. To do this we will use the `setequal()` function of `dplyr`.
```{r}
dplyr::setequal(
distinct(diet_data, location_name),
distinct(sep_age_diet_data, location_name))
```
OK, we got the value of TRUE, so it looks like the same locations are in both files.
Note: In this case were comparing two different objects so using the pipe is not as useful.
Let's take a look at the locations included in the data.
#### {.scrollable }
```{r}
# scroll through the output!
sep_age_diet_data %>%
distinct(location_name) %>%
pull()
```
####
OK, so there are global values, as well as values for 195 countries.
Let's take a look at the data when we order it by the mean consumption rate column. We can do so using the `arrange()` function of the `dplyr` package.
```{r}
diet_data %>%
dplyr::arrange(mean) %>%
glimpse()
```
OK, so it looks like people in Lebanon don't eat very many trans fatty acids.
Let's also figure out how many values there are in each age group of the data that is separated by age. We will use the `count()` function of the `dplyr` package to do this.
```{r}
sep_age_diet_data %>%
dplyr::count(age_group_name)
```
That's a lot of values!
Let's look a bit deeper to try to understand why.
We can use the `count()` function again but get the number of values for each category within `sex`, `age_group_name` and `location_name` of the data.
```{r}
sep_age_diet_data %>%
count(sex, age_group_name, location_name)
```
OK, so it looks like these are probably the consumption values for each of the different dietary factors (since there were 15 different factors) for each age group and gender combination within each country.
We can confirm this by filtering the data to one of the age groups, for a single gender, and for a single location. To do this we can use the `filter()` function of the `dplyr` package. Notice that we need to use two equal signs `==` to specify what values we would like for each variable.
```{r}
sep_age_diet_data %>%
dplyr::filter(sex == "Female",
age_group_name == "25 to 29",
location_name == "Afghanistan")
```
This confirms that for each of the 15 dietary factors, our unit of observation is a combination of gender, age and country.
However, before we proceed with our analysis, we will want to perform some additional data wrangling. To do this, we will introduce the `pdftools` package, which will allow us to pull additional data from the manuscript itself.
While all of the mean consumption values are reported in grams, each dietary factor has a different amount that is considered optimal for consuming. To make the consumption values more comparable across factors, let's also get some data from the PDF of the paper so that we can calculate consumption of these dietary factors as percentages of the daily optimum.
We are interested in this table on page 3:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Table.png")
```
First let's import the PDF using the `pdf_text()` function of the `pdftools` package.
You can find this file [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/Afshin_et_al_2019.pdf).
```{r, eval=FALSE}
paper <- pdftools::pdf_text(here("data", "raw",
"Afshin_et_al_2019.pdf"))
```
```{r, echo=FALSE}
paper <- pdftools::pdf_text(here("www", "data", "raw",
"Afshin_et_al_2019.pdf"))
```
We can save our imported data as an rda file (stands for R data file) using the `save()` function.
```{r, eval=FALSE}
save(diet_data, sep_age_diet_data, paper, file = here::here("data", "imported", "imported_data.rda"))
```
## **Data Wrangling**
***
If you have been following along but stopped, we could load our imported data like so:
```{r, eval=FALSE}
load(here::here("data", "imported", "imported_data.rda"))
```
```{r, echo=FALSE}
load(here::here("www", "data", "imported", "imported_data.rda"))
```
***
<details> <summary> If you skipped the data import section click here. </summary>
First you need to install and load the `OCSdata` package:
```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```
Then, you may load the imported data using the following code:
```{r, eval=FALSE}
imported_data("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))
```
If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/imported_data.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.
```{r, eval=FALSE}
load(here::here("data", "imported", "imported_data.rda"))
```
```{r, echo=FALSE}
load(here::here("www", "data", "imported", "imported_data.rda"))
```
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.
</details>
***
</details>
***
Let's take a look at our manuscript data.
We can use the `base` `summary()` function to get a sense of what the data looks like. By `base` we mean that these functions are part of the `base` package and are loaded automatically on startup of R. Thus, `library(base)` is not required.
```{r}
summary(paper)
```
We can see that we have 15 different character strings. Each one contains the text on each of the 15 different pages of the PDF.
Again, the table we are interested in is on the third page, so let's grab just that portion of the PDF. The top of this page looks like:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/page3.png")
```
```{r}
# Here we will select the 3rd value in the paper object
pdf_table <- paper[3]
summary(pdf_table)
# specifying nchar.max truncates the output
glimpse(pdf_table, nchar.max = 800)
```
Here we can see that the `pdf_table` object now contains the text from the 3rd page as a **single large character string**. However the text is difficult to read because of the column structure in the PDF. Now let's try to grab just the text in the table.
One way to approach this is to split the string by some pattern that we notice in the table.
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Table.png")
```
All the rows of interest of the table appear to start with the word `"Diet"`. Moreover, only the capitalized form of the word `"Diet"` appears to be within the table, and it is not present in the preceding text (although `"diet"` is).
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Diet_on_page3.png")
```
Let's use the `str_split()` function of the `stringr` package to split the data within the object called `pdf_table` by the word `"Diet"`. Only lines from page 3 that contain the word `"Diet"` will be selected (and not `"diet"` as this function is case-sensitive). Each section of the text that contains `"Diet"` will be split into individual pieces every time the word `"Diet"` occurs and the word itself will be removed.
In this case we are also using the magrittr assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package. This allows us use the `pdf_table` data as input to the later steps but also reassign the output to the same data object name.
```{r}
pdf_table %<>%
stringr::str_split(pattern = 'Diet')
```
Using the `base::summary()` and `dplyr::glimpse()` function we can see that we created a list of the rows in the table that contained the word `"Diet"`. We can see that we start with the row that contains `"low in fruits"`.
```{r}
pdf_table %>%
summary()
```
```{r}
pdf_table %>%
glimpse()
```
In order to extract the values that we want from these character strings, we will use some additional functions from the `stringr` package. RStudio creates really helpful cheat sheets like this one which shows you all the major functions in the `stringr` package. You can download others [here](https://rstudio.com/resources/cheatsheets/){target="_blank"}.
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/strings-1_str_split.png")
```
You can see that we could have also used the `str_split_fixed()` function which would also separate the substrings into different columns of a matrix, however we would need to know the number of substrings or pieces that we would like returned.
For more information about `str_split()` see [here](http://rfunction.com/archives/1499){target="_blank"}.
Let's separate the values within the list using the base `unlist` function, this will allow us to easily select the different substrings within the object called `pdf_table`.
```{r}
pdf_table %<>%
unlist()
```
It's important to realize that the first split will split the text before the first occurrence of `"Diet"` as the first value in the output. (This is why there are 17 elements in `pdf_table` rather than 15, the number of rows in the table.) We could use the `first()` function of the `dplyr` package to look at this value. However, we will suppress the output as this is quite large.
```{r, eval = FALSE}
dplyr::first(pdf_table)
```
Instead we can take a look at the second element of the list. using the `nth()` function of `dplyr`.
```{r}
nth(pdf_table, 2)
```
Indeed this looks like the first row of interest in our table:
```{r,echo = FALSE,out.width= "800px"}
knitr::include_graphics("www/img/firstrow.png")
```
Using the `last()` and the `nth()` functions of the `dplyr` package we can take a look at the last values of the list.
```{r}
# to see the second to last value we can use nth()
# the -2 specifies that we want the second-to-last value
# -3 would be third-to-last and -1 would be the last value
dplyr::nth(pdf_table, -2)
# to see the very last value we can use last()
dplyr::last(pdf_table)
```
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/end_of_table.png")
```
We don't need this part of the table or the text before the table if we just want the consumption recommendations.
So we will select the second through the second-to-last of the substrings. Since we have seventeen substrings, we will select the second through the sixteenth. However a better way to do this rather than selecting by index, would be to select phrases that are unique to the text within the table that we want. We will use the `str_subset()` function of `stringr` package to select the table rows with consumption guidelines. Most of the rows have the phrase "Mean daily consumption", however, there are other phrases for some of the rows, including "Mean daily intake" and "24 h sodium". So we will subset for each of these phrases.
```{r}
# one could subset the pdf_table like this:
# pdf_table <- pdf_table[2:16]
pdf_table %<>%
str_subset(pattern = "Mean daily consumption|Mean daily intake|24 h")
```
Notice that we separate the different patterns to look for using vertical bar character `"|"` and that all of the patterns are within quotation marks together.
#### {.think_question_block}
<u>Question opportunity:</u>
1) What other string patterns could you use to subset the rows of the table that we want?
2) Why might it be better to subset based on the text rather than the index?
####
Now the first row is what we want:
```{r}
first(pdf_table)
```
And the last row is what we want:
```{r}
last(pdf_table)
```
At this point, we have a better look at the current representation of the table data in R, and we might notice something that will need to be fixed. In the string above, the decimal points from the PDF are being recognized as something called an interpunct instead of a period or decimal. An interpunct is a centered dot, as opposed to a period or decimal that is aligned to the bottom of the line.
The interpunct was previously used to separate words in certain languages, like ancient Latin.
<p align="center">
<img width="400" src="https://www.yourdictionary.com/image/articles/3417.Latin.jpg">
</p>
###### [[source](https://www.yourdictionary.com/image/articles/3417.Latin.jpg)]
You can produce an interpunct on a Mac like this:
<p align="center">
<img width="400" src="https://www.shorttutorials.com/mac-os-special-characters-shortcuts/images/middle-dot.png">
</p>
###### [[source](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html)]
It is important to replace these for later when we want these values to be converted from character strings to numeric. We will again use the `stringr` package. This time we will use the `str_replace_all()` function which replaces all instances of a pattern in an individual string. In this case we want to replace all instances of the interpunct with a decimal point.
```{r,}
pdf_table %<>%
stringr::str_replace_all(pattern = "·",
replacement = ".")
last(pdf_table)
```
Looks good!
Now we will try to split the strings for each row based on the presence of two spaces to create the columns of the table, as there appears to be more than one space between the columns. The resulting substrings will be separated by quotes.
For additional details, the second page of the `stringr` cheat sheet has more information about using "Special Characters" in `stringr`. For example `\\s` is interpreted as a space as the `\\` indicates that the `s` should be interpreted as a special character and not simply the letter s. The `{2,}` indicates two or more spaces, while `{2}` would indicate exactly two spaces.
```{r, echo = FALSE,out.width = "800px"}
knitr::include_graphics("www/img/strings-2_highlight.png")
```
#### {.scrollable }
```{r}
table_split <- str_split(string = pdf_table,
pattern = "\\s{2,}")
glimpse(table_split) #scroll the output!
```
####
Now we can see that each of our 15 strings has been split into pieces, but unfortunately, it was not completely consistent across dietary factors. Why did this happen? If we look closely, we can see that the sugar-sweetened beverage and the seafood category had only one space between the first and second columns. These are the columns about the dietary category and the one that describes in more detail what the consumption suggestion is about.
The values for these two columns appear to be together still in the same substring for these two categories. We can see this because there are no quotation marks adjacent to the word `"Mean"`.
Here you can see how the next substring should have started with the word `"Mean"` by the new inclusion of a quotation mark `"`. The red rectangles indicate the problematic substrings, while the green rectangles show examples where the split worked correctly.
```{r, echo = FALSE, out.width = "700px"}
knitr::include_graphics("www/img/substring_sep.png")
```
We can add an extra space in front of the word `"Mean"` for these particular categories and then try splitting again.
Since we originally split based on two or more spaces, we can just add a space in front of the word "Mean" for all the `pdf_table` strings and then try subsetting again. We can use the `str_which()` function of the `stringr` package to find the index of these particular cases.
```{r}
pdf_table %>%
str_which(pattern = "seafood|sugar")
```
Here we can use the `str_subset()` function of the `stringr` package to see just the strings that match these patterns within `pdf_table`:
```{r}
pdf_table %>%
str_subset(pattern = "seafood|sugar")
```
This is equivalent to using the `str_which()` function with `[]`:
```{r, eval = FALSE}
pdf_table[str_which(pdf_table, pattern = "seafood|sugar")]
```
Now we can replace these values within the pdf_table object after adding a space in front of "Mean":
```{r, eval=FALSE}
pdf_table[str_which(pdf_table,
pattern =
"seafood|sugar")] <- str_replace(
string = pdf_table[str_which(pdf_table,
pattern =
"seafood|sugar")],
pattern = "Mean",
replacement = " Mean")
```
And now we can try splitting again by two or more spaces:
```{r, eval=FALSE}
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```
We could also just add a space in front of all the values of "Mean" in `pdf_table` since the split was performed based on two or more spaces. Thus the other elements in `pdf_table` would also be split just as before despite the additional space. Try this out yourself!
```{r, echo=FALSE}
save(pdf_table, file = here::here("www", "exercise", "dw_code1.rda"))
```
```{r DW_Code1-setup}
library(tidyverse)
library(magrittr)
load(here::here("www", "exercise", "dw_code1.rda"))
```
```{r DW_Code1, exercise=TRUE, eval=FALSE}
# fill in the blanks
pdf_table <- pdf_table %>%
stringr::___________(_______ = "Mean",
___________= " Mean")
table_split <- str______(pdf_table, pattern = "_______")
glimpse(table_split) # compare your output with the one below
```
```{r DW_Code1-hint-1}
pdf_table <- pdf_table %>%
stringr::str_replace(pattern = "Mean",
replacement = " Mean")
table_split <- str______(pdf_table, pattern = "_______")
glimpse(table_split) # compare your output with the one below
```
```{r DW_Code1-solution}
pdf_table <- pdf_table %>%
stringr::str_replace(pattern = "Mean",
replacement = " Mean")
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
glimpse(table_split) # compare your output with the one below
```
```{r, echo = FALSE}
pdf_table <- pdf_table %>%
stringr::str_replace(pattern = "Mean",
replacement = " Mean")
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```
***
<details> <summary> Click here to see desired output </summary>
```{r}
#scroll the output!
glimpse(table_split)
```
Looks better!
</details>
***
We want just the first (the food **category**) and third column (the optimal consumption **amount** suggested) for each row in the table. However, the table is currently stored as a list of character vectors, so it is not quite so simple to extract these values.
We can use the `map` function of the `purrr` package to accomplish this.
The `map` function allows us to perform the same action multiple times across each element within an object, in this case, a list.
The following will allow us to select the first or third substring from each element of the `pdf_table` object.
```{r}
category <- map(table_split, 1)
amount <- map(table_split, 3)
head(category)
head(amount)
```
Now we will create a `tibble` using this data. However, currently both `category` and `amount` are of class `list`. To create a `tibble` we need to unlist the data to create vectors.
```{r}
class(category)
category %<>% unlist()
amount %<>% unlist()
class(category)
```
#### {.scrollable }
```{r}
category
amount
```
####
We could have done all of this at once in one command like this:
```{r, eval = FALSE}
category <- unlist(map(table_split,1))
amount <- unlist(map(table_split,3))
```
Now we will create a `tibble`, which is an important data frame structure in the tidyverse which allows us to use other packages in the tidyverse with our data.
We will name our `tibble` columns now as we create our `tibble` using the `tibble()` function of both the `tidyr` and the `tibble` packages, as names are required in tibbles.
```{r}
guidelines <- tibble::tibble(
category = category,
amount = amount
)
guidelines
```
Looking pretty good!
### **Separating values within a variable**
***
Recall that the main goal of this data wrangling is to extract the optimal intake level for each dietary factor. So while we have managed to pull and organize the data from the pdf table, we need to further process the results to isolate this numeric value.
Do to this, we want to separate the different numbers within the `amount` column, to isolate the optimal amount, and the optimal range, and eventually convert them to numeric values.
Recall what the original table looked like:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/firstrow.png")
```
We can use the `tidyr::separate()` function to separate the data within the amount column into three new columns based on the optimal level and the optimal range. We can separate the values based on the open parentheses `"("` and the long dash `"–"` characters. Again we will use the bar `"|"` to indicate that we want to separate by either character.
```{r}
# The first column will be called optimal
# It will contain the 1st part of the amount column data before the "("
# The 2nd column will be called lower
# It will contain the data after the "("
# The 3rd column will be called upper
# It will contain the 2nd part of the data based on the "–"
# The "\\" are necessary - we will explain very soon
guidelines %<>%
tidyr::separate(amount,