-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
279 lines (188 loc) · 18.8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
output: md_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# OpenCaseStudies
### Important Links
- Static version: https://www.opencasestudies.org/ocs-bp-diet
- Interactive version: https://rsconnect.biostat.jhsph.edu/ocs-bp-diet-interactive/
- GitHub: https://github.com/opencasestudies/ocs-bp-diet
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
### Disclaimer
The purpose of the [Open Case
Studies](https://www.opencasestudies.org) project is **to demonstrate
the use of various data science methods, tools, and software in the
context of messy, real-world data**. A given case study does not cover
all aspects of the research process, is not claiming to be the most
appropriate way to analyze a given dataset, and should not be used in
the context of making policy decisions without external consultation
from scientific experts.
### License
This case study is part of the [OpenCaseStudies](https://www.opencasestudies.org) project.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 ([CC BY-NC 3.0](https://www.opencasestudies.org/ocs-bp-diet/)) United States License.
### Citation
To cite this case study please use:
Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com/opencasestudies/ocs-bp-diet](https://github.com/opencasestudies/ocs-bp-diet). Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).
### Acknowledgments
We would like to acknowledge [Jessica Fanzo](https://bioethics.jhu.edu/people/profile/jessica-fanzo/) for assisting in framing the major direction of the case study, as well as [Ashkan Afshin](https://globalhealth.washington.edu/faculty/ashkan-afshin) and [Erin Mullany](http://www.healthdata.org/about/erin-mullany) for giving us access to the data.
We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.
We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work.
### Reading Metrics
The total reading time for this case study was calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **~ 100 minutes**
The Flesch-Kincaid Readability Index was also calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **Grade 10, Age 15**
### Title
Exploring global patterns of dietary behaviors associated with health risk
### Motivation
According to this [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8) that evaluated food consumption patterns in 185 countries for 15 dietary risk factors with probable associations with non-communicable disease:
> High intake of sodium …, low intake of whole grains …, and low intake of fruits … were the leading dietary risk factors for deaths and DALYs globally and in many countries.”
In this case study we evaluate the data used in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. We particularly focus on dietary consumption patterns within the United States (US) and how these compare to other that of other countries.
### Motivating questions
<b><u> Our main questions: </u></b>
1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?
### Data
In this case study we will be using data that we requested form the [Global Burden of Disease (GBD)](http://www.healthdata.org/gbd) about consumption of dietary factors associated with health risks.
We will also be using data from a PDF of an article about the optimal consumption guidelines for these dietary factors.
Their methods for identifying and authenticating incidents are outlined [here](https://www.chds.us/ssdb/methods/).
Previously according to their website:
*"The database compiles information from more than 25 different sources including peer-reviewed studies, government reports, mainstream media, non-profits, private websites, blogs, and crowd-sourced lists that have been analyzed, filtered, deconflicted, and cross-referenced. **All of the information is based on open-source information and 3rd party reporting... and may include reporting errors.**"*
#### Learning Objectives
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Importing/extracting data from PDF (`dplyr`, `stringr`)
2. How to reshape data by pivoting between "long" and "wide" formats (`tidyr`)
3. Perform functions on all columns of a tibble (`purrr`)
4. Data cleaning with regular expressions (`stringr`)
5. Specific data value reassignment
6. Separate data within a column into multiple columns (`tidyr`)
7. Methods to Compare data (`dplyr`)
8. Combining data from two sources (`dplyr`)
9. Make interactive plots (`ggiraph`)
10. Make a zoom facet for plot (`ggforce`)
11. Combine plots together (`cowplot`)
<u>**Statistical Learning Objectives:**</u>
1. Understanding of how the *t*-test and the ANOVA are specialized
regressions
2. Basic understanding of the utility of a regression analysis
3. How to implement a linear regression analysis in R
4. How to interpret regression coefficients
5. Awareness of *t*-test assumptions
6. Awareness of linear regression assumptions
7. How to use Q-Q plots to check for normality
8. Difference between fixed effects and random effects
9. How to perform paired *t*-test
10. How to perform a linear mixed effects regression
#### Data import
In this case study we demonstrate how to import data from a csv and from a PDF.
#### Data wrangling
This case study also covers many of the `stringr` functions to manipulate character strings, including `str_split()`, `str_subset()`, `str_replace()`, `str_replace_all()`, `str_which()`, `str_count()`, `str_remove_all()`, and `str_trim()`.
This case study also covers how to use the `tidyr` functions such as `pivot_wider()` and `pivot_longer()` for reshaping data and the `separate()` function for creating new columns from an existing column. In addition, the case study covers how to replace `NA` values with a specific value using the `replace_na()` function.
This case study also goes over how to use many of the `dplyr` functions to modify, select and filter data, such as: `rename()`, `mutate()`, `arrange()`, `select()` and `filter()` as well as functions to compare data like the `setequal()`, `all_equal()`, and `setdiff()` functions, as well as similar functions to look for overlapping similarities like the `intersect()` function. The case study describes the differences of these functions. We also introduce how to recode data using the `if_else()` and `case_when()` functions and how to join data using the `full_join()` function.
We also cover how to use the `purrr` package `map()` function to apply the same function to multiple columns in a tibble.
#### Data Visualization
In this case study we show how to make faceted plots, as well as plots with a facet that is zoomed in using the `facet_zoom()` function of the `ggforce` package. We cover how to specifically highlight specific data points, as well as how to add annotations and horizontal lines to make the plot more interpretable.
We also demonstrate how to make interactive plots where the data points link you to other websites using the `ggiraph` package. Finally, we demonstrate how to combine plots using the `cowplot` package.
We also cover how to use the `viridis` package to make plots that are more interpretable for those who are colorblind.
### Analysis
This case study has a particularly thorough analysis section, which describes many ways of added complexity to examine the data. We describe how the $t$-test and the ANOVA are actually specialized forms of the regression analysis.
We provide an introduction to regression analysis.
We also describe paired data and how to interpret this using both a paired $t$-test and a linear model with fixed effects or a linear model with mixed effects. We also describe the difference between random and fixed effects.
See [this other case study](https://opencasestudies.github.io/ocs-bp-rural-and-urban-obesity/) for more introductory material about comparing groups, hypothesis testing, probability, distributions, normality, paired data, and the paired $t$-test.
### Other notes and resources
[RStudio](https://rstudio.com/products/rstudio/features/){target="_blank"}
[Cheatsheet on RStuido IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf){target="_blank"}
[Other RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}
[RStudio projects](https://r4ds.had.co.nz/workflow-projects.html)
[Tidyverse](https://www.tidyverse.org/){target="_blank"}
[Piping in R](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}
[String manipulation cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"}
[Table formats](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"}
### Helpful Links
<u>Terms and concepts covered:</u>
[Interpunct](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html){target="_blank"}
[Regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}
[Inference](https://www.britannica.com/science/inference-statistics){target="_blank"}
[Regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}
[Different types of regression](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"}
[Ordinary least squares method](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"}
[Residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}
[$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"}
[ANOVA](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}
[$t$-tests and ANOVA are equivalent to regression](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} also see [here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} about how many commonly known statistical tests are specialized forms of regression
[Normally Distribution](https://www.physiology.org/doi/full/10.1152/advan.00064.2017){target="_blank"}
[Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}
[Guide to residual diagnostic plots](https://data.library.virginia.edu/diagnostic-plots/) and [Examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"}
[Residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"}
[Scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"}
[Homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}
[Heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}
[Interpreting `lm()` output](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}
[Coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"}
[Linear mixed effects regression](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}
[Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"}
[Mood's Two-Sample Scale Test](https://files.eric.ed.gov/fulltext/ED065559.pdf){target="_blank"}
[Standard deviation](https://www.statsdirect.com/help/basic_descriptive_statistics/standard_deviation.htm){target="_blank"}
[Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"}
[polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat){target="_blank"}
<u>Tests of Homogeneity of Variance for 3 or more groups:</u>
[Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"}
[Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"}
[Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}
<u>Other helpful links:</u>
[Long and Wide Data Formats](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}
[Distributions](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"}
[Skewed Distributions](http://onlinestatbook.com/2/glossary/skew.html){target="_blank"}
[Bimodal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"}
[ggplot2](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}
[Shapiro-Wilk Test](http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html){target="_blank"}
[Paired Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5579465/){target="_blank"}
[Welch's $t$-test](https://www.statisticshowto.datasciencecentral.com/welchs-test-for-unequal-variances/){target="_blank"}
[Parametric and Nonparametric Methods](https://www.mayo.edu/research/documents/parametric-and-nonparametric-demystifying-the-terms/doc-20408960){target="_blank"}
[Variance](https://stattrek.com/statistics/dictionary.aspx?definition=variance){target="_blank"}
[Balanced Study Design](https://www.statisticshowto.datasciencecentral.com/balanced-and-unbalanced-designs/){target="_blank"}
[Independent Observations](https://www.stat.cmu.edu/~cshalizi/36-220/lecture-5.pdf){target="_blank"}
[Transformation](https://www.statisticshowto.datasciencecentral.com/transformation-statistics/){target="_blank"}
[Permutation/Resampling Methods](https://jhu-advdatasci.github.io/2019/lectures/21-resampling-techniques.html){target="_blank"}
[Central Limit Theorem](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/){target="_blank"} [Wilcoxon Signed Rank Test](http://www.biostathandbook.com/wilcoxonsignedrank.html)
[Wilcoxon Rank Sum Test](http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html){target="_blank"}
[Two-sample Kolmogorov-Smirnov Test](https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm){target="_blank"}
[Type 1 Error](https://web.ma.utexas.edu/users/mks/statmistakes/errortypes.html){target="_blank"}
[p-value](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8){target="_blank"}
[Multiple Testing](https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf){target="_blank"}
[Bonferroni Method of Multiple Testing Correction](http://mathworld.wolfram.com/BonferroniCorrection.html){target="_blank"}
<u>Packages used in this case study: </u>
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"} | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to arrange/filter/select/compare specific subsets of the data
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"} | to get an overview of data
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"} | to read a PDF into R
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"} | to manipulate the text within the PDF of the data
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"} | to use the `%<>%` pipping operator
[purrr](https://purrr.tidyverse.org/){target="_blank"} | to perform functions on all columns of a tibble
[tibble](https://tibble.tidyverse.org/){target="_blank"} | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to separate data within a column into multiple columns
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations with multiple layers
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"} | to easily add regression line equations to plots
[forcats](https://forcats.tidyverse.org/){target="_blank"} | to change details about factors (categorical variables)
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
#### For users
There is a [`Makefile`](Makefile) in this folder that allows you to type `make` to knit the case study contained in the `index.Rmd` to `index.html` and it will also knit the [`README.Rmd`](README.Rmd) to a markdown file (`README.md`).
Users can skip the Data Import and Data Wrangling sections to start with the Data Analysis and Visualization section if they wish.
#### For instructors
Instructors can skip the Data Import and Data Wrangling sections and start with either the Data Exploration, Data Analysis, or Data Visualization sections if they wish.
#### Target audience
This case study is appropriate for those new to R programming. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some introductory knowledge of R programming, particularly for creating visualizations.
#### Suggested homework
Students can evaluate consumption estimates of another dietary factor besides red meat.
#### Estimate of RMarkdown Compilation Time:
~ About 85 - 95 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.