README.Rmd

---
output: md_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

# OpenCaseStudies

### Important Links

- Static version: https://www.opencasestudies.org/ocs-bp-diet
- Interactive version: https://rsconnect.biostat.jhsph.edu/ocs-bp-diet-interactive/
- GitHub: https://github.com/opencasestudies/ocs-bp-diet
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies


### Disclaimer 

The purpose of the [Open Case
Studies](https://www.opencasestudies.org) project is **to demonstrate
the use of various data science methods, tools, and software in the
context of messy, real-world data**. A given case study does not cover
all aspects of the research process, is not claiming to be the most
appropriate way to analyze a given dataset, and should not be used in
the context of making policy decisions without external consultation
from scientific experts.

### License 

This case study is part of the [OpenCaseStudies](https://www.opencasestudies.org) project. 
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 ([CC BY-NC 3.0](https://www.opencasestudies.org/ocs-bp-diet/)) United States License.

### Citation 

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com/opencasestudies/ocs-bp-diet](https://github.com/opencasestudies/ocs-bp-diet). Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).

### Acknowledgments

We would like to acknowledge [Jessica Fanzo](https://bioethics.jhu.edu/people/profile/jessica-fanzo/) for assisting in framing the major direction of the case study, as well as [Ashkan Afshin](https://globalhealth.washington.edu/faculty/ashkan-afshin) and [Erin Mullany](http://www.healthdata.org/about/erin-mullany) for giving us access to the data. 

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 

### Reading Metrics

The total reading time for this case study was calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **~ 100 minutes**

The Flesch-Kincaid Readability Index was also calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **Grade 10, Age 15**

### Title 

Exploring global patterns of dietary behaviors associated with health risk

### Motivation 

According to this [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8) that evaluated food consumption patterns in 185 countries for 15 dietary risk factors with probable associations with non-communicable disease:

> High intake of sodium …, low intake of whole grains …, and low intake of fruits … were the leading dietary risk factors for deaths and DALYs globally and in many countries.”

In this case study we evaluate the data used in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. We particularly focus on dietary consumption patterns within the United States (US) and how these compare to other that of other countries.

### Motivating questions

<b><u> Our main questions: </u></b>

1) What are the global trends for potentially harmful diets?  
2) How do males and females compare?  
3) How do different age groups compare for these dietary factors?  
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?  

### Data

In this case study we will be using data that we requested form the [Global Burden of Disease (GBD)](http://www.healthdata.org/gbd) about consumption of dietary factors associated with health risks.

We will also be using data from a PDF of an article about the optimal consumption guidelines for these dietary factors.


Their methods for identifying and authenticating incidents are outlined [here](https://www.chds.us/ssdb/methods/).

Previously according to their website: 

*"The database compiles information from more than 25 different sources including peer-reviewed studies, government reports, mainstream media, non-profits, private websites, blogs, and crowd-sourced lists that have been analyzed, filtered, deconflicted, and cross-referenced. **All of the information is based on open-source information and 3rd party reporting... and may include reporting errors.**"*

#### Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:


<u>**Data Science Learning Objectives:**</u>

1. Importing/extracting data from PDF (`dplyr`, `stringr`)  
2. How to reshape data by pivoting between "long" and "wide" formats (`tidyr`)    
3. Perform functions on all columns of a tibble (`purrr`)  
4. Data cleaning with regular expressions (`stringr`)  
5. Specific data value reassignment  
6. Separate data within a column into multiple columns (`tidyr`) 
7. Methods to Compare data (`dplyr`)  
8. Combining data from two sources (`dplyr`)  
9. Make interactive plots (`ggiraph`)  
10. Make a zoom facet for plot (`ggforce`) 
11. Combine plots together (`cowplot`)

<u>**Statistical Learning Objectives:**</u> 

1.  Understanding of how the *t*-test and the ANOVA are specialized
    regressions
2.  Basic understanding of the utility of a regression analysis
3.  How to implement a linear regression analysis in R
4.  How to interpret regression coefficients
5.  Awareness of *t*-test assumptions
6.  Awareness of linear regression assumptions
7.  How to use Q-Q plots to check for normality
8.  Difference between fixed effects and random effects
9.  How to perform paired *t*-test
10. How to perform a linear mixed effects regression


#### Data import 

In this case study we demonstrate how to import data from a csv and from a PDF. 

#### Data wrangling 

This case study also covers many of the `stringr` functions to manipulate character strings, including `str_split()`, `str_subset()`, `str_replace()`, `str_replace_all()`,  `str_which()`, `str_count()`, `str_remove_all()`, and  `str_trim()`.

This case study also covers how to use the `tidyr` functions such as `pivot_wider()` and `pivot_longer()` for reshaping data and the `separate()` function for creating new columns from an existing column. In addition, the case study covers how to replace `NA` values with  a specific value using the `replace_na()` function. 

This case study also goes over how to use many of the `dplyr` functions to modify, select and filter data, such as: `rename()`, `mutate()`, `arrange()`, `select()` and `filter()` as well as functions to compare data like the `setequal()`, `all_equal()`, and  `setdiff()` functions, as well as similar functions to look for overlapping similarities like the `intersect()` function. The case study describes the differences of these functions. We also introduce how to recode data using the `if_else()` and `case_when()` functions and how to join data using the `full_join()` function.

We also cover how to use the `purrr` package `map()` function to apply the same function to multiple columns in a tibble.

#### Data Visualization

In this case study we show how to make faceted plots, as well as plots with a facet that is zoomed in using the `facet_zoom()` function of the `ggforce` package. We cover how to specifically highlight specific data points, as well as how to add annotations and horizontal lines to make the plot more interpretable.

We also demonstrate how to make interactive plots where the data points link you to other websites using the `ggiraph` package. Finally, we demonstrate how to combine plots using the `cowplot` package.

We also cover how to use the `viridis` package to make plots that are more interpretable for those who are colorblind.

### Analysis 

This case study has a particularly thorough analysis section, which describes many ways of added complexity to examine the data. We describe how the $t$-test and the ANOVA are actually specialized forms of the regression analysis. 

We provide an introduction to regression analysis.

We also describe paired data and how to interpret this using both a paired $t$-test and a linear model with fixed effects or a linear model with mixed effects. We also describe the difference between random and fixed effects. 


See [this other case study](https://opencasestudies.github.io/ocs-bp-rural-and-urban-obesity/) for more introductory material about comparing groups, hypothesis testing, probability, distributions, normality, paired data, and the paired $t$-test.

### Other notes and resources

[RStudio](https://rstudio.com/products/rstudio/features/){target="_blank"}  
[Cheatsheet on RStuido IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf){target="_blank"}  
[Other RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}   
[RStudio projects](https://r4ds.had.co.nz/workflow-projects.html)

[Tidyverse](https://www.tidyverse.org/){target="_blank"}   

[Piping in R](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"}   

[String manipulation cheatsheet](https://rstudio.com/resources/cheatsheets/){target="_blank"}  
[Table formats](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"}

### Helpful Links

<u>Terms and concepts covered:</u>  

[Interpunct](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html){target="_blank"}  
[Regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}  
[Inference](https://www.britannica.com/science/inference-statistics){target="_blank"}  
[Regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}  
[Different types of regression](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"}  
[Ordinary least squares method](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"}  
[Residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}  
[$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"}  
[ANOVA](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}  
[$t$-tests and ANOVA are equivalent to regression](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} also see [here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} about how many commonly known statistical tests are specialized forms of regression  
[Normally Distribution](https://www.physiology.org/doi/full/10.1152/advan.00064.2017){target="_blank"}  
[Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}  
[Guide to residual diagnostic plots](https://data.library.virginia.edu/diagnostic-plots/) and [Examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"}  
[Residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"}  
[Scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"}  
[Homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}  
[Heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}  
[Interpreting `lm()` output](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}  
[Coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"}  
[Linear mixed effects regression](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}  
[Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"}  
[Mood's Two-Sample Scale Test](https://files.eric.ed.gov/fulltext/ED065559.pdf){target="_blank"}   
[Standard deviation](https://www.statsdirect.com/help/basic_descriptive_statistics/standard_deviation.htm){target="_blank"}  
[Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"}   
[polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat){target="_blank"} 


<u>Tests of Homogeneity of Variance for 3 or more groups:</u>

[Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"}  
[Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"}  
[Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}  
 

<u>Other helpful links:</u>

[Long and Wide Data Formats](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Distributions](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[Skewed Distributions](http://onlinestatbook.com/2/glossary/skew.html){target="_blank"} 
[Bimodal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[ggplot2](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Shapiro-Wilk Test](http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html){target="_blank"}   
[Paired Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5579465/){target="_blank"}  
[Welch's $t$-test](https://www.statisticshowto.datasciencecentral.com/welchs-test-for-unequal-variances/){target="_blank"}    
[Parametric and Nonparametric Methods](https://www.mayo.edu/research/documents/parametric-and-nonparametric-demystifying-the-terms/doc-20408960){target="_blank"}   
[Variance](https://stattrek.com/statistics/dictionary.aspx?definition=variance){target="_blank"}  
[Balanced Study Design](https://www.statisticshowto.datasciencecentral.com/balanced-and-unbalanced-designs/){target="_blank"}  
[Independent Observations](https://www.stat.cmu.edu/~cshalizi/36-220/lecture-5.pdf){target="_blank"}  
[Transformation](https://www.statisticshowto.datasciencecentral.com/transformation-statistics/){target="_blank"}  
[Permutation/Resampling Methods](https://jhu-advdatasci.github.io/2019/lectures/21-resampling-techniques.html){target="_blank"}   
[Central Limit Theorem](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/){target="_blank"} [Wilcoxon Signed Rank Test](http://www.biostathandbook.com/wilcoxonsignedrank.html)   
[Wilcoxon Rank Sum Test](http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html){target="_blank"}  
[Two-sample Kolmogorov-Smirnov Test](https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm){target="_blank"}  
[Type 1 Error](https://web.ma.utexas.edu/users/mks/statmistakes/errortypes.html){target="_blank"}  
[p-value](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8){target="_blank"}  
[Multiple Testing](https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf){target="_blank"}    
[Bonferroni Method of Multiple Testing Correction](http://mathworld.wolfram.com/BonferroniCorrection.html){target="_blank"}

<u>Packages used in this case study: </u>

 Package   | Use in this case study                                                                       
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data  
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data   
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"}   | to read a PDF into R   
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text within the PDF of the data   
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` pipping operator  
[purrr](https://purrr.tidyverse.org/){target="_blank"}      | to perform functions on all columns of a tibble   
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with  dplyr/stringr/tidyr/purrr  
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers  
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)  
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing   
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances   
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive   
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots  
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette    
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 

#### For users 

There is a [`Makefile`](Makefile) in this folder that allows you to type `make` to knit the case study contained in the `index.Rmd` to `index.html` and it will also knit the [`README.Rmd`](README.Rmd) to a markdown file (`README.md`). 

Users can skip the Data Import and Data Wrangling sections to start with the Data Analysis and Visualization section if they wish. 

#### For instructors

Instructors can skip the Data Import and Data Wrangling sections and start with either the Data Exploration, Data Analysis, or Data Visualization sections if they wish. 


#### Target audience 

This case study is appropriate for those new to R programming. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some introductory knowledge of R programming, particularly for creating visualizations. 

#### Suggested homework

Students can evaluate consumption estimates of another dietary factor besides red meat.

#### Estimate of RMarkdown Compilation Time:

~ About 85 - 95 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.