index.Rmd

---
title: |
  ![](https://opencasestudies.github.io/img/icon-bahi.png){width=120px align=left style="padding-right: 20px"} 
  Exploring global patterns of dietary behaviors associated with health risk
css: www/style.css
output:
  html_document:
    includes:
      in_header: www/GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes
runtime: shiny_prerendered
---

<!-- Open all links in new tab-->  
<base target="_blank"/> 

<div align="left" id="google_translate_element",></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(learnr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```


#### {.outline }
```{r, echo = FALSE}
knitr::include_graphics("www/img/mainplot.png")
```

#### {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://www.opencasestudies.org){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####

#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

####

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). [https://github.com/opencasestudies/ocs-bp-diet](https://github.com/opencasestudies/ocs-bp-diet). Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).

####

To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-diet.

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


## **Motivation**
***

An [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8){target="_blank"} recently published in The 
Lancet evaluated global dietary trends and the relationship of dietary factors with mortality and fertility.

```{r, echo = FALSE}
knitr::include_graphics("www/img/thepaper.png")
```

#### {.reference_block}
GBD 2017 Diet Collaborators. Health effects of dietary risks in 195 countries, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. *The Lancet* 393, 1958–1972 (2019).

####

This article evaluated food consumption patterns in 195 countries for 15 different dietary risk factors that have probable associations with non-communicable disease (NCD). For example, over-consumption of sodium is associated with high blood pressure. These consumption levels were then used to estimate levels of mortality and morbidity due to NCD, as well as disability-adjusted life-years (DALYs) attributable to sub-optimal consumption of foods related to these dietary risk factors. The authors found that: 

> "High intake of sodium ..., low intake of whole grains ..., and low intake of fruits ... were the leading dietary risk factors for deaths and DALYs globally and in many countries." 

This figure from the paper's supplementary materials shows the ranking of the 15 dietary risk factors based on the estimated number of attributable deaths. Here, the numbers and colors of the little squares imply rankings of the risk factors (rows) by regions (columns). The color red indicates risk factors that are associated with larger number of attributable deaths. The column on the right is the overall global data. As you can see here, the top 3 risk factors are often issues for many different regions of the world.

```{r, echo = FALSE, out.width= "700 px"}
knitr::include_graphics("www/img/deaths.png")
```

This case study will evaluate the data reported in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. 

## **Main Questions**
***

#### {.main_question_block}
<b><u> Our main questions are: </u></b>

1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

####

## **Learning Objectives**
***

In this case study, we will walk you through importing data from PDF files and CSV files, cleaning data, wrangling data, comparing data, joining data, visualizing data, and <b> comparing two or more groups </b> using well-established and commonly used packages, including `stringr`, `tidyr`, `dplyr`, `purrr`, and `ggplot2`. We will especially focus on using packages and functions from the [Tidyverse](https://www.tidyverse.org/){target="_blank"}. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

<u>**Data Science Learning Objectives:**</u>

1. Importing/extracting data from PDF (`dplyr`, `stringr`)  
2. How to reshape data by pivoting between "long" and "wide" formats (`tidyr`)    
3. Perform functions on all columns of a tibble (`purrr`)  
4. Data cleaning with regular expressions (`stringr`)  
5. Specific data value reassignment  
6. Separate data within a column into multiple columns (`tidyr`) 
7. Methods to Compare data (`dplyr`)  
8. Combining data from two sources (`dplyr`)  
9. Make interactive plots (`ggiraph`)  
10. Make a zoom facet for plot (`ggforce`) 
11. Combine plots together (`cowplot`)

<u>**Statistical Learning Objectives:**</u> 

1.  Understanding of how the *t*-test and the ANOVA are specialized
    regressions
2.  Basic understanding of the utility of a regression analysis
3.  How to implement a linear regression analysis in R
4.  How to interpret regression coefficients
5.  Awareness of *t*-test assumptions
6.  Awareness of linear regression assumptions
7.  How to use Q-Q plots to check for normality
8.  Difference between fixed effects and random effects
9.  How to perform paired *t*-test
10. How to perform a linear mixed effects regression


```{r, out.width = "20%", echo = FALSE, fig.align = "center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

***

We will begin by loading the packages that we will need:

```{r}
library(here)
library(readr)
library(dplyr)
library(skimr)
library(pdftools)
library(stringr)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(forcats)
library(lme4)
library(lmerTest)
library(car)
library(ggiraph)
library(ggforce)
library(viridis)
library(cowplot)
library(OCSdata)
```


 Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"}   | to read a PDF into R 
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text within the PDF of the data
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` piping operator
[purrr](https://purrr.tidyverse.org/){target="_blank"}      | to perform functions on all columns of a tibble
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)
[lme4](https://cran.r-project.org/web/packages/lme4/lme4.pdf)| to fit a linear mixed effects model
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in a color palette that is easily interpreted by colorblind individuals
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
___


The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
***

Here is an excerpt from the article itself about the context of the work:
```{r, echo = FALSE}
knitr::include_graphics("www/img/context.png")
```

Many dietary factors have well-established associations with health risk. The authors that generated this data set identified 15 dietary factors that have probable health risk based on literature search.

Here you can see a table of the sources for the health risks associated with the dietary factors. The first column shows the risk factors and the second column shows the health outcomes. This table is part of "Supplemental Table 1. Epidemiological evidence supporting causality between dietary risk factors and disease endpoints" from the paper’s [supplementary materials](https://www.thelancet.com/cms/10.1016/S0140-6736(19)30041-8/attachment/3d4c0258-c2ea-405f-8d11-e9ae65e6f996/mmc1.pdf){target="_blank"}.


```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics("www/img/dietaryrisk.png")
```


In the article the authors found that most of the mortality associated with each factor is related to cardiovascular disease.

```{r, echo = FALSE, out.width= "500 px"}
knitr::include_graphics("www/img/cardiorisk.png")
```

## **Limitations**
***

There are some important limitations regarding the data from this article to keep in mind. The definition of certain dietary factors varied across some of the collection sources. Intakes of certain healthy foods like vegetables and fruits are likely positively correlated with each other and likely negatively correlated with intakes of unhealthy foods. Much of the data was collected with 24 hour recall surveys which are prone to issues due to inaccuracy of memory recall or other biases such as a tendency for some people to report healthier behaviors. The guidelines in the PDF are not specified by gender even though it is known that there are different dietary requirements for optimal health for certain nutrients by gender. The article discusses some limitations about accounting for overall food consumption when calculating consumption of particular foods:

> "To remove the effect of energy intake as a potential confounder and address measurement error in dietary assessment tools, most cohorts have adjusted for total energy intake in their statistical models. This energy adjustment means that diet components are defined as risks in terms of the share of diet and not as absolute levels of exposure. In other words, an increase in intake of foods and macronutrients should be compensated by a decrease in intake of other dietary factors to hold total energy intake constant. Thus, the relative risk of change in each component of diet depends on the other components for which it is substituted. However, the relative risks estimated from meta-analyses of cohort studies do not generally specify the type of substitution.

There are also important nuances to keep in mind regarding some of the dietary factors. For example calcium consumption was calculated based on consumption of dairy products, while calcium can be acquired from other sources including plant-based sources. However in these data, the influence of plant-based consumption of calcium was not accounted for, nor was supplementation through vitamin sources. 

Also, while [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contains information for groups of individuals described as male or female. 

## **What are the data?**
***

We will be using data that we requested from the [Global Burden of Disease (GBD)](http://www.healthdata.org/gbd){target="_blank"} of the [Institute for Health Metrics and Evaluation (IHME)](http://www.healthdata.org/about) about dietary intake, as well as the guideline data about optimal consumption amounts for different foods contained within the PDF of the article. We have two CSV files, dietary_risk_exposure_all_ages_2017.csv and dietary_risk_exposure_sep_ages_2017.csv. The first one includes consumption levels at the global level and for different countries for all ages combined.

Looking at the CSV file in excel:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/csv.png")
```

Here you can see that the data contains mean consumption values for both men and women in various countries at the national level in 2017 for various foods that may be problematic for health. The units for the food varies. So for example, the mean column in row that says "Diet low in fiber" indicates the average consumption level per person in that region and of that gender of fiber in grams per day.

The second CSV file has similar data, but consumption levels for different age groups are separated.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/age_sep3.png")
```

The authors of this article obtained the data from a variety of sources including household budget surveys and nutritional surveys regarding 24 hour recall of food consumption and 24 hour urinary sodium analysis. The data was derived from sales data from [Euromonitor](https://www.euromonitor.com/){target="_blank"}, estimates about national availability of specific nutrients from the [United Nations Food and Agriculture Organization (FAO)](http://www.fao.org/home/en/){target="_blank"} and the [United States Department of Agriculture](https://www.usda.gov/){target="_blank"}'s [National Nutrition Database](https://data.nal.usda.gov/dataset/usda-national-nutrient-database-standard-reference-legacy-release){target="_blank"}.

## **Data Import**
***

If you have trouble accessing the [GitHub Repository](https://github.com/opencasestudies/ocs-bp-diet), the data can be downloaded from [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_all_ages_2017.csv) and [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/dietary_risk_exposure_sep_ages_2017.csv).

Let's import our data into R now so that we can explore the data further.

In our case, we downloaded this data and put it within a "data" directory within a subdirectory called "raw" for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for importing this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the files within the "docs" directory within a directory where our `.Rproj` file is located by separating the name of the "data" directory, the "raw" subdirectory, and the file name using commas.

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects. 

</details>

***

```{r, eval=FALSE}
diet_data <- readr::read_csv(here("data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))
```

```{r, echo=FALSE}
diet_data <- readr::read_csv(here("www", "data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("www", "data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))
```

You may also use the `OCSdata` package to download the raw data: 

```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-diet", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" sub-folder 
# in your current working directory
```

If you used the `OCSdata` package to download the raw data, you can import the data into R like so:

```{r, eval=FALSE}
diet_data <- readr::read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_all_ages_2017.csv"))
sep_age_diet_data <- read_csv(here("OCSdata", "data", "raw", 
                       "dietary_risk_exposure_sep_ages_2017.csv"))
```

First let's just get a general sense of our data. We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).

```{r}
dplyr::glimpse(diet_data)
```

```{r}
glimpse(sep_age_diet_data)
```

Here we can tell that the `sep_age_diet_data` is much larger than the `diet_data`. The `diet_data` has only 5,880 rows while the `sep_age_diet_data` has 88,200 rows!

However, both files appear to have the same column structure with 11 variables each.


The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.

```{r}
skim(diet_data)
```

         
Notice how there is a column providing the number of missing observations for each variable. It looks like our data is very complete and we do not have any missing data.
We also get a sense about the size of our data.

The `n_unqiue` column shows us the number of unique values for each of our columns.


Let's take a look at `sep_age_diet_data`.

```{r}
skim(sep_age_diet_data)
```

We can see that there are many more rows in this data set.

Let's change the variable name `rei_name` to `dietary_risk` so that it makes more sense. We can use the `rename()` function from the `dplyr` package.

```{r}
diet_data <- dplyr::rename(diet_data, dietary_risk = rei_name)
sep_age_diet_data <- dplyr::rename(sep_age_diet_data, dietary_risk = rei_name)

glimpse(diet_data)
glimpse(sep_age_diet_data)
```

Looks good!

We will then take a look at the different dietary risk factors considered.
To do this we will use the `distinct()` function of the `dplyr` package.

This function grabs only the distinct or unique rows from a given variable (`dietary_risk`, in our case) of a given data frame (`diet_data`, in our case).

```{r}
dplyr::distinct(diet_data, dietary_risk)
```

Both over and under consumption could be a health problem!

We will be using the `%>%` pipe for sequential steps in our code later on.
This will make more sense when we have multiple sequential steps using the same data object.


We could do the same code as above using this notation. For example we first grab the `diet_data`, then we select the distinct values of the `dietary_risk` variable.

```{r}
diet_data %>%
  distinct(dietary_risk)
```

OK, so that gives us an idea of what dietary factors we can explore, and we can see that there are 15 of them. 

Let's see if the `location_name` values are the same between both CSV files. To do this we will use the `setequal()` function of `dplyr`.
```{r}
dplyr::setequal(
  distinct(diet_data, location_name), 
  distinct(sep_age_diet_data, location_name)) 
```

OK, we got the value of TRUE, so it looks like the same locations are in both files.

Note: In this case were comparing two different objects so using the pipe is not as useful.

Let's take a look at the locations included in the data.

#### {.scrollable }
```{r}
# scroll through the output!
sep_age_diet_data %>%
  distinct(location_name) %>%
  pull()
```
####


OK, so there are global values, as well as values for 195 countries.


Let's take a look at the data when we order it by the mean consumption rate column. We can do so using the `arrange()` function of the `dplyr` package.

```{r}
diet_data %>%
  dplyr::arrange(mean) %>%
  glimpse()
```

OK, so it looks like people in Lebanon don't eat very many trans fatty acids.

Let's also figure out how many values there are in each age group of the data that is separated by age. We will use the `count()` function of the `dplyr` package to do this.

```{r}
sep_age_diet_data %>%
  dplyr::count(age_group_name)
```
That's a lot of values!

Let's look a bit deeper to try to understand why.
We can use the `count()` function again but get the number of values for each category within `sex`, `age_group_name` and `location_name` of the data.

```{r}
sep_age_diet_data %>%
  count(sex, age_group_name, location_name)
```

OK, so it looks like these are probably the consumption values for each of the different dietary factors (since there were 15 different factors) for each age group and gender combination within each country.

We can confirm this by filtering the data to one of the age groups, for a single gender, and for a single location. To do this we can use the `filter()` function of the `dplyr` package. Notice that we need to use two equal signs `==` to specify what values we would like for each variable.

```{r}
sep_age_diet_data %>%
dplyr::filter(sex == "Female",
   age_group_name == "25 to 29",
   location_name == "Afghanistan")
```

This confirms that for each of the 15 dietary factors, our unit of observation is a combination of gender, age and country. 

However, before we proceed with our analysis, we will want to perform some additional data wrangling. To do this, we will introduce the `pdftools` package, which will allow us to pull additional data from the manuscript itself.


While all of the mean consumption values are reported in grams, each dietary factor has a different amount that is considered optimal for consuming. To make the consumption values more comparable across factors, let's also get some data from the PDF of the paper so that we can calculate consumption of these dietary factors as percentages of the daily optimum.

We are interested in this table on page 3:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Table.png")
```


First let's import the PDF using the `pdf_text()` function of the `pdftools` package.

You can find this file [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/raw/Afshin_et_al_2019.pdf).

```{r, eval=FALSE}
paper <- pdftools::pdf_text(here("data", "raw",
"Afshin_et_al_2019.pdf"))
```

```{r, echo=FALSE}
paper <- pdftools::pdf_text(here("www", "data", "raw",
"Afshin_et_al_2019.pdf"))
```

We can save our imported data as an rda file (stands for R data file) using the `save()` function. 

```{r, eval=FALSE}
save(diet_data, sep_age_diet_data, paper, file = here::here("data", "imported", "imported_data.rda"))
```


## **Data Wrangling**
***

If you have been following along but stopped, we could load our imported data like so:

```{r, eval=FALSE}
load(here::here("data", "imported", "imported_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "imported", "imported_data.rda"))
```

***
<details> <summary> If you skipped the data import section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the imported data using the following code:

```{r, eval=FALSE}
imported_data("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/imported_data.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r, eval=FALSE}
load(here::here("data", "imported", "imported_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "imported", "imported_data.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>

***

Let's take a look at our manuscript data.

We can use the `base` `summary()` function to get a sense of what the data looks like. By `base` we mean that these functions are part of the `base` package and are loaded automatically on startup of R. Thus, `library(base)` is not required.
```{r}
summary(paper)
```

We can see that we have 15 different character strings. Each one contains the text on each of the 15 different pages of the PDF.

Again, the table we are interested in is on the third page, so let's grab just that portion of the PDF. The top of this page looks like:

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/page3.png")
```


```{r}
# Here we will select the 3rd value in the paper object
pdf_table <- paper[3]

summary(pdf_table)

# specifying nchar.max truncates the output
glimpse(pdf_table, nchar.max = 800)

```

Here we can see that the `pdf_table` object now contains the text from the 3rd page as a **single large character string**. However the text is difficult to read because of the column structure in the PDF. Now let's try to grab just the text in the table.

One way to approach this is to split the string by some pattern that we notice in the table.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Table.png")
```

All the rows of interest of the table appear to start with the word `"Diet"`. Moreover, only the capitalized form of the word `"Diet"` appears to be within the table, and it is not present in the preceding text (although `"diet"` is). 

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/Diet_on_page3.png")
```


Let's use the `str_split()` function of the `stringr` package to split the data within the object called `pdf_table` by the word `"Diet"`.  Only lines from page 3 that contain the word `"Diet"` will be selected (and not `"diet"` as this function is case-sensitive). Each section of the text that contains `"Diet"` will be split into individual pieces every time the word `"Diet"` occurs and the word itself will be removed.

In this case we are also using the magrittr assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package. This allows us use the `pdf_table` data as input to the later steps but also reassign the output to the same data object name.

```{r}
pdf_table %<>%
  stringr::str_split(pattern = 'Diet')
```

Using the `base::summary()` and `dplyr::glimpse()` function we can see that we created a list of the rows in the table that contained the word `"Diet"`. We can see that we start with the row that contains `"low in fruits"`. 

```{r}
pdf_table %>%
 summary()
```

```{r}
pdf_table %>%
  glimpse()
```

In order to extract the values that we want from these character strings, we will use some additional functions from the `stringr` package. RStudio creates really helpful cheat sheets like this one which shows you all the major functions in the `stringr` package. You can download others [here](https://rstudio.com/resources/cheatsheets/){target="_blank"}.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/strings-1_str_split.png")
```

You can see that we could have also used the `str_split_fixed()` function which would also separate the substrings into different columns of a matrix, however we would need to know the number of substrings or pieces that we would like returned.

For more information about `str_split()` see [here](http://rfunction.com/archives/1499){target="_blank"}.

Let's separate the values within the list using the base `unlist` function, this will allow us to easily select the different substrings within the object called `pdf_table`.

```{r}
pdf_table %<>%
  unlist()
```

It's important to realize that the first split will split the text before the first occurrence of `"Diet"` as the first value in the output. (This is why there are 17 elements in `pdf_table` rather than 15, the number of rows in the table.) We could use the `first()` function of the `dplyr` package to look at this value. However, we will suppress the output as this is quite large.

```{r, eval = FALSE}
dplyr::first(pdf_table)
```

Instead we can take a look at the second element of the list. using the `nth()` function of `dplyr`.

```{r}
nth(pdf_table, 2)
```

Indeed this looks like the first row of interest in our table:

```{r,echo = FALSE,out.width= "800px"}
knitr::include_graphics("www/img/firstrow.png")
```


Using the `last()` and the `nth()` functions of the `dplyr` package we can take a look at the last values of the list.
```{r}
# to see the second to last value we can use nth()
# the -2 specifies that we want the second-to-last value
# -3 would be third-to-last and -1 would be the last value
dplyr::nth(pdf_table, -2)

# to see the very last value we can use last()
dplyr::last(pdf_table)

```

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/end_of_table.png")
```

We don't need this part of the table or the text before the table if we just want the consumption recommendations. 

So we will select the second through the second-to-last of the substrings. Since we have seventeen substrings, we will select the second through the sixteenth. However a better way to do this rather than selecting by index, would be to select phrases that are unique to the text within the table that we want. We will use the `str_subset()` function of `stringr` package to select the table rows with consumption guidelines.  Most of the rows have the phrase "Mean daily consumption", however, there are other phrases for some of the rows, including "Mean daily intake" and "24 h sodium". So we will subset for each of these phrases.

```{r}
# one could subset the pdf_table like this:
# pdf_table <- pdf_table[2:16]

pdf_table %<>%
  str_subset(pattern = "Mean daily consumption|Mean daily intake|24 h")
```

Notice that we separate the different patterns to look for using vertical bar character `"|"` and that all of the patterns are within quotation marks together.

#### {.think_question_block}
<u>Question opportunity:</u> 

1) What other string patterns could you use to subset the rows of the table that we want?

2) Why might it be better to subset based on the text rather than the index?

####


Now the first row is what we want:
```{r}
first(pdf_table)
```

And the last row is what we want:
```{r}
last(pdf_table)
```

At this point, we have a better look at the current representation of the table data in R, and we might notice something that will need to be fixed. In the string above, the decimal points from the PDF are being recognized as something called an interpunct instead of a period or decimal. An interpunct is a centered dot, as opposed to a period or decimal that is aligned to the bottom of the line.

The interpunct was previously used to separate words in certain languages, like ancient Latin.


<p align="center">
  <img width="400" src="https://www.yourdictionary.com/image/articles/3417.Latin.jpg">
</p>

###### [[source](https://www.yourdictionary.com/image/articles/3417.Latin.jpg)]

You can produce an interpunct on a Mac like this:


<p align="center">
  <img width="400" src="https://www.shorttutorials.com/mac-os-special-characters-shortcuts/images/middle-dot.png">
</p>

###### [[source](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html)]


It is important to replace these for later when we want these values to be converted from character strings to numeric. We will again use the `stringr` package. This time we will use the `str_replace_all()` function which replaces all instances of a pattern in an individual string. In this case we want to replace all instances of the interpunct with a decimal point.


```{r,}
pdf_table %<>%
  stringr::str_replace_all(pattern = "·",
                           replacement = ".")
last(pdf_table)
```

Looks good!

Now we will try to split the strings for each row based on the presence of two spaces to create the columns of the table, as there appears to be more than one space between the columns. The resulting substrings will be separated by quotes.

For additional details, the second page of the `stringr` cheat sheet has more information about using "Special Characters" in `stringr`. For example `\\s` is interpreted as a space as the `\\` indicates that the `s` should be interpreted as a special character and not simply the letter s.  The `{2,}` indicates two or more spaces, while `{2}` would indicate exactly two spaces.

```{r, echo = FALSE,out.width = "800px"}
knitr::include_graphics("www/img/strings-2_highlight.png")
```

#### {.scrollable }
```{r}
table_split <- str_split(string = pdf_table,
                         pattern = "\\s{2,}")
glimpse(table_split) #scroll the output!
```
####

Now we can see that each of our 15 strings has been split into pieces, but unfortunately, it was not completely consistent across dietary factors. Why did this happen? If we look closely, we can see that the sugar-sweetened beverage and the seafood category had only one space between the first and second columns. These are the columns about the dietary category and the one that describes in more detail what the consumption suggestion is about.

The values for these two columns appear to be together still in the same substring for these two categories. We can see this because there are no quotation marks adjacent to the word `"Mean"`.

Here you can see how the next substring should have started with the word `"Mean"` by the new inclusion of a quotation mark `"`. The red rectangles indicate the problematic substrings, while the green rectangles show examples where the split worked correctly.

```{r, echo = FALSE, out.width = "700px"}
knitr::include_graphics("www/img/substring_sep.png")
```


We can add an extra space in front of the word `"Mean"` for these particular categories and then try splitting again.

Since we originally split based on two or more spaces, we can just add a space in front of the word "Mean" for all the `pdf_table` strings and then try subsetting again. We can use the `str_which()` function of the `stringr` package to find the index of these particular cases.

```{r}
pdf_table %>%
  str_which(pattern = "seafood|sugar")
```

Here we can use the `str_subset()` function of the `stringr` package to see just the strings that match these patterns within `pdf_table`:
```{r}
pdf_table %>% 
  str_subset(pattern = "seafood|sugar")
```

This is equivalent to using the `str_which()` function with `[]`:
```{r, eval = FALSE}
pdf_table[str_which(pdf_table, pattern = "seafood|sugar")]
```

Now we can replace these values within the pdf_table object after adding a space in front of "Mean":

```{r, eval=FALSE}
pdf_table[str_which(pdf_table, 
      pattern = 
      "seafood|sugar")] <- str_replace(
                            string = pdf_table[str_which(pdf_table, 
                           pattern = 
                              "seafood|sugar")], 
                           pattern = "Mean", 
                       replacement = " Mean")
```

And now we can try splitting again by two or more spaces:
```{r, eval=FALSE}
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```

We could also just add a space in front of all the values of "Mean" in `pdf_table` since the split was performed based on two or more spaces. Thus the other elements in `pdf_table` would also be split just as before despite the additional space. Try this out yourself!

```{r, echo=FALSE}
save(pdf_table, file = here::here("www", "exercise", "dw_code1.rda"))
```

```{r DW_Code1-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "exercise", "dw_code1.rda"))
```

```{r DW_Code1, exercise=TRUE, eval=FALSE}
# fill in the blanks
pdf_table <- pdf_table %>%
  stringr::___________(_______ = "Mean",
                       ___________= " Mean")
table_split <- str______(pdf_table, pattern = "_______")
glimpse(table_split) # compare your output with the one below
```

```{r DW_Code1-hint-1}
pdf_table <- pdf_table %>%
  stringr::str_replace(pattern = "Mean", 
                   replacement = " Mean")
table_split <- str______(pdf_table, pattern = "_______")
glimpse(table_split) # compare your output with the one below
```

```{r DW_Code1-solution}
pdf_table <- pdf_table %>%
  stringr::str_replace(pattern = "Mean", 
                   replacement = " Mean")
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
glimpse(table_split) # compare your output with the one below
```

```{r, echo = FALSE}
pdf_table <- pdf_table %>%
  stringr::str_replace(pattern = "Mean", 
                   replacement = " Mean")
table_split <- str_split(pdf_table, pattern = "\\s{2,}")
```

***
<details> <summary> Click here to see desired output </summary>

```{r}
#scroll the output!
glimpse(table_split) 
```

Looks better!
</details>

***

We want just the first (the food **category**) and third column (the optimal consumption **amount** suggested) for each row in the table. However, the table is currently stored as a list of character vectors, so it is not quite so simple to extract these values.

We can use the `map` function of the `purrr` package to accomplish this.

The `map` function allows us to perform the same action multiple times across each element within an object, in this case, a list.

The following will allow us to select the first or third substring from each element of the `pdf_table` object.

```{r}
category <- map(table_split, 1)
amount <- map(table_split, 3)
head(category)
head(amount)
```

Now we will create a `tibble` using this data. However, currently both `category` and `amount` are of class `list`. To create a `tibble` we need to unlist the data to create vectors.

```{r}
class(category)
category %<>% unlist()
amount %<>% unlist()
class(category)
```

#### {.scrollable }
```{r}
category
amount
```
####

We could have done all of this at once in one command like this:

```{r, eval = FALSE}
category <- unlist(map(table_split,1))
amount <- unlist(map(table_split,3))
```

Now we will create a `tibble`, which is an important data frame structure in the tidyverse which allows us to use other packages in the tidyverse with our data.

We will name our `tibble` columns now as we create our `tibble` using the `tibble()` function of both the `tidyr` and the `tibble` packages, as names are required in tibbles.

```{r}
guidelines <- tibble::tibble(
  category = category,
  amount = amount
)
guidelines
```

Looking pretty good!

### **Separating values within a variable**
***

Recall that the main goal of this data wrangling is to extract the optimal intake level for each dietary factor. So while we have managed to pull and organize the data from the pdf table, we need to further process the results to isolate this numeric value.

Do to this, we want to separate the different numbers within the `amount` column, to isolate the optimal amount, and the optimal range, and eventually convert them to numeric values.

Recall what the original table looked like:
```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/firstrow.png")
```

We can use the `tidyr::separate()` function to separate the data within the amount column into three new columns based on the optimal level and the optimal range. We can separate the values based on the open parentheses `"("` and the long dash `"–"` characters. Again we will use the bar `"|"` to indicate that we want to separate by either character.

```{r}
# The first column will be called optimal
# It will contain the 1st part of the amount column data before the "("
# The 2nd column will be called lower
# It will contain the data after the "("
# The 3rd column will be called upper
# It will contain the 2nd part of the data based on the "–"
# The "\\" are necessary - we will explain very soon

guidelines %<>%
  tidyr::separate(amount,
    c("optimal", "lower", "upper"),
    sep = "\\(|–"
  )

guidelines
```


Let's also create a new variable/column in our tibble that indicates the direction of over- or under-consumption that can be harmful for each dietary factor.

```{r}
guidelines %<>%
  separate(category, c("direction", "food"), sep = " in ")
guidelines
```

If we wanted to remove the direction variable we could use the `modify_at()` function of the `purrr` package:

```{r,eval = FALSE}
guidelines %>% purrr::modify_at("direction", ~NULL)
```


### **Data cleaning with regular expressions**
***

OK, looking better, but we still need a bit of cleaning to remove symbols and extra words from the columns. Some of the extra symbols include: `"%"`, `")"` and the `"*"`.

The `"*"` and the `")"` are what we call metacharacters or [regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}. These are characters that have special meanings.

```{r, echo = FALSE, out.width = "800px"}
knitr::include_graphics("www/img/RegExCheatsheet.png")
```

Now we need the `"\\"` to indicate that we want these characters to be matched exactly and not interpreted as the meaning of the symbol. Recall that we used `"\\(|–"` earlier.

See [here](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html){target="_blank"} for more info about regular expressions in R. 

***

<details> <summary> Click here for a simple example of regular expressions using the `str_count()` function of the `stringr` package </summary>

The `str_count()` function counts the number of instances of a character string. In this case we will look for individual characters but you could also search for words or phrases.
```{r, eval=FALSE}
regextest <- readr::read_file(here("docs", "regEx.txt"))
regextest
```

```{r, echo=FALSE}
regextest <- readr::read_file("www/docs/regEx.txt")
regextest
```

Count the letter t:
```{r}
str_count(regextest, "t") # notice this doesn't include the t in the tab
```

Count tabs:
```{r}
str_count(regextest, "\\t") # search for tab
# this would not work:
str_count(regextest, "[t]") # searches for the letter t
```

Count parentheses:
```{r}
# this would not work because R thinks this is part of the code itself
# str_count(regextest, ")")
# this would not work because R thinks this is part of the code itself
# str_count(regextest, "\)")
str_count(regextest, "\\)") # this works!
# this works! because it is a punctuation character
str_count(regextest, "[)]")
```

Count the occurrence of the asterisk:
```{r}
# this also does not work
# str_count(regextest, "*")
# nor does this
# str_count(regextest, "\*")
str_count(regextest, "\\*") # this works!
# this works! because it is a punctuation character
str_count(regextest, "[*]") # this works!
```

</details>

***

We also want to make a unit variable so that we can make sure that our units are consistent later. 

```{r}
guidelines %>%
  pull(optimal)
```

Notice that the values that are percentages don't have spaces between the number and the unit.
We can separate the `"optimal"` values by a space or a percent symbol `"%"` using `"|"` to indicate that we want to separate by either. In this case we will lose the "%" and will need to add it back to those values.

```{r}
guidelines %<>%
  separate(optimal,
    into = c("optimal", "unit"),
    sep = " |%",
    remove = FALSE
  )
guidelines
```

Great, so to now we will add "`%`" to the `unit` variable for  the `"low in polyunsaturated"` and `"high in trans fatty acids"` rows.

First we need to replace the empty values with `NA` using the `na_if()` function of the `dplyr` package.

```{r}
guidelines %<>%
  na_if("")
guidelines
```


Then to replace the `NA` values, we can use the `replace_na()` function in the `tidyr` package and the `mutate()` function of `dplyr` to specify which values to replace, in this case the `NA` values within the variable `unit`. Essentially this variable gets reassigned with the new values, as we mostly think of the `mutate()` function as creating new variables.

```{r}
guidelines %<>%
  dplyr::mutate(unit = replace_na(unit, "%"))

# now just to show these rows
guidelines %>%
  filter(unit == "%")
```

Let's also move `unit` to be the last column. We can use the `relocate()` function of the `dplyr` package to do this. For more information about the `relocate()` function see [here](https://dplyr.tidyverse.org/reference/relocate.html){target="_blank"}.

```{r}
guidelines %<>%
  relocate(unit, .after = last_col())
```

To remove all of the remaining extra characters and words we will again use the `stringr` package. This time we will use the `str_remove_all()` function to remove all instances of these characters.

```{r}
guidelines <- as_tibble(
  map(guidelines, str_remove_all,
    pattern = "\\) per day|\\) of total daily energy|\\*"
  )
)
guidelines
```

Nice! That's pretty clean but we can do a bit more.

### **Data type conversion**
***

One of the next things to notice about our data is all of our variables are of class character, which is not how we want them to be.

For example, the optimal amounts of consumption are currently of class character, which is indicated by the `<chr>` just below the column names/variable names of the `guidelines` tibble:

```{r}
guidelines
```


To convert these values to numeric we use the `mutate()` and `across()` functions of the `dplyr` package.

The `across()` function has two main arguments: (i) the columns you want to operate on and (ii) the function or list of functions to apply to each column. In this case if we look at the beginning of the `guidelines` tibble, we can see that `optimal`, `lower` and `upper` should be converted. As these three columns are sequential, we can simply put a `:` between `optimal` and `upper` to indicate that we want all the variables in between these columns to be converted. 

```{r}
guidelines %<>%
  mutate(across(lower:upper, as.numeric))
guidelines
```

Great! Now these variables are of class `<dbl>` (stands for double) which indicates that they are numeric. Here is a [link](http://uc-r.github.io/integer_double/){target="_blank"} for more information on numeric classes in R.

If we had not replaced the `"·"` interpunct values to a period, conversion from character to numeric would be problematic and would result in NA values.

### **Data value reassignments**
***

We seem to have lost the word `"beverages"` from the `"sugar-sweetened beverages"` category,  as well as `"fatty acids"` from the `"seafood omega 3 fatty acids"`, and the `"polyunsaturated fatty acids"` categories as the full category name was listed on two lines within the table. We would like to replace these values with the full name. 

To select the `food` variable we will show you several options. Only a couple will work well with reassigning the data in that particular variable within `guidelines` without assigning an intermediate data object. We will look using `mutate_at()`, `pull()`, `select()`, and two styles of brackets `["variable name"]` and `[["variablename"]]`.

The bracket `["variable name"]` option and the `select()` option will grab a tibble (data frame) version of the food column out of guidelines. However we can't start commands with select for assignments.

```{r}
guidelines["food"] # same output as select
select(guidelines, "food") # same output as brackets
```


`pull()` and the bracket `[["variable name"]]` option in contrast, will grab the vector version of the food data:

```{r}
pull(guidelines, "food") # get character vector not a tibble
# bracket option:
guidelines[["food"]] # get character vector not a tibble
```

The `pull()` function can be very useful when combined with other functions (for example you typically want to use a vector with the `str_replace()` function), but just like select, we can't start assignments with `pull()`.


This is not possible and will result in an error:
```{r, eval = FALSE}
select(guidelines, food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

guidelines %>% select(food) <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```

This will only print the result, but not reassign the food variable values:

```{r}
guidelines %>%
  pull(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```   

Using `select()` would work as well to print the result (although the result structure is different):

```{r}
guidelines %>%
  select(food) %>%
  str_replace(
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )
```

#### {.think_question_block}

<u>Question opportunity:</u> 

Why do these commands not reassign the food variable values?

####

The bracket option is great alternative and allows us to reassign the values within guidelines easily. Either of the two styles of brackets: `["variable name"]` and `[["variablename"]]` will work.

```{r}
# 1st method: `["variable name"]`
# Replacing "sugar-sweetened" with "sugar-sweetened beverages"
guidelines["food"] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "sugar-sweetened",
    replacement = "sugar-sweetened beverages"
  )

# 2nd method: `[["variablename"]]`
# Replacing "seafood omega-3" with"seafood omega-3 fatty acids"
guidelines[["food"]] <-
  str_replace(
    pull(guidelines, "food"),
    pattern = "seafood omega-3",
    replacement = "seafood omega-3 fatty acids"
  )

guidelines
```


Finally, the best option is probably the `mutate_at()` function from `dplyr`. In this case we need to include `~` in front of the function that we would like to use on the values in our `food` variables. We also include `.` as a replacement to reference the data that we want to use within `str_replace()` (which in this case is the `food` variable values of `guidelines`).

Notice we didn't need this when we previously use `mutate_at()` with the `as.numeric()` function. This is because the `str_replace()` function requires us to specify what data we are using as one of the arguments, while `as.numeric()` does not.

```{r}
# Replacing "polyunsaturated" with"polyunsaturated fatty acids"
guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "polyunsaturated",
      replacement = "polyunsaturated fatty acids"
    )
  )

guidelines
```

This might be considered a better option because it is more readable as to where the `food` data came from that we are replacing values within.

There is one last minor detail... the `direction` variable has leading spaces still. We can use `str_trim()` to fix that!

```{r}
guidelines %<>%
  mutate_at(vars(direction), str_trim)

guidelines
```

OK! Now we know how much of each dietary factor we generally need for optimal health according to the guidelines used in this article.

### **Comparing data**
***

Recall that the main goal of pulling the guideline amounts from the pdf was that we would like to see how the mean consumption rates for the different groups of people compared to the optimal intake guidelines.

One way we could do this is to calculate a consumption percentage of the optimal value.

To calculate this it would be helpful to put the guideline amounts with the average consumption rates into the same tibble, especially because the observed consumption data (`diet_data` and `sep_age_diet_data`) are very different dimensions from the `guidelines` data. 

In order to create a tibble with our observed consumption rates with the suggested consumption rates, we will join our data using `dplyr`. In order to do so it is important that our different data sets have at least one column with the same values that we can use to join them together. So let's first assess if that is the case.


```{r}
distinct(diet_data, dietary_risk)
select(guidelines, food)
```

We are actually pretty close: there are 15 dietary factors in each data set, and the names are nearly the same. To make them match completely, we can see that we need to remove the `"Diet low in"` and `"Diet high in"` phrases from the observed consumption data.

```{r}
diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )
```

Come up with the code for the `sep_age_diet_data`!
```{r DW_Code2-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "data", "imported", "imported_data.rda"))
```

```{r DW_Code2, exercise=TRUE}
sep_age_diet_data %<>%
```

```{r DW_Code2-solution}
sep_age_diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )
```

***
<details> <summary> Click here to reveal the code </summary>

```{r}
sep_age_diet_data %<>%
  mutate_at(
    vars(dietary_risk),
    ~ str_remove(
      string = .,
      pattern = "Diet low in |Diet high in "
    )
  )
```
</details>

***

Also let's double check that the two observed files have the same exact values for dietary factor names. 

We can use the `setequal()` function from `dplyr` to check that the unique values for `dietary_risk` are the same for both `diet_data` and `sep_age_diet_data`.


```{r}
setequal(
  distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk)
)
```
Great!

Note that the default of the set_equal function ignores the order of values in rows. So we still don't know if the order is the same.

We can check using the `all_equal` function of `dplyr` which reports back clues about what might be different if anything. Importantly we are including `ignore_row_order = FALSE` as the default is `TRUE`.

```{r}
all_equal(distinct(diet_data, dietary_risk),
  distinct(sep_age_diet_data, dietary_risk),
  ignore_row_order = FALSE
)
```

Looks like they are not in the same order. 

Note that if any of the values are different, `all_equal()` will first report this and will not report that the rows are in a different order.

***

<details> <summary> Click here to see a toy example about how the three comparison functions (`setequal()`, `all_equal()` (also `all.equal()` for `tbl_df`), and `setdiff()`) work in `dplyr`. </summary> 

It's important to realize that row order is ignored by both`setequal()` and `setdiff()`. 

Now let's compare two tibbles that have different row orders and different values. 

Here are our tibbles to compare:
```{r}
X <- tibble(test = c("A", "B", "AC", "D"))
Y <- tibble(test = c("A", "D", "A", "B"))
X
Y
class(Y)
```

Since we are using tibbles, which are of class `tbl_df` we can use either `all_equal` or `all.equal()`.
Notice that it doesn't report rows being a different order because it first tells what rows have unique values or rows with a value that has a different number of frequency.

```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE)
# Doesn't report rows being different order
all.equal(X, Y, ignore_row_order = TRUE)
all.equal(X, Y, ignore_row_order = FALSE)
# Doesn't report rows being different order
```

`setequal()` does not provide clues about what is different but TRUE (no differences) or FALSE (indicating at least one difference).

```{r}
# Reports false indicating at least one difference
setequal(X, Y)
```

`setdiff()` tells us what is different and is dependent on the order of the objects compared, but prioritizes the values that are unique to each.

```{r}
# This reports what is unique to X
setdiff(X, Y)
# This reports what is unique to Y - nothing in this case
setdiff(Y, X)
```

Now let's make it so that only the order is different:
```{r}
Y <- tibble(test = c("A", "D", "AC", "B"))
X
Y
```

Now that there are no values that are unique to either X or Y, `all_equal()` reports that there is a different order. 
```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE) # reports diff order
```


Remember `setequal()` ignores order and gives a value of TRUE for no differences.
```{r}
# It reports no difference!
setequal(X, Y)
```

`setdiff()` also ignores order and shows no differences.
```{r}
setdiff(X, Y)
```

If we have different column/variable names this makes comparisons more challenging. Columns will be identified for having different names.
```{r}
X <- tibble(colname1 = c("A", "B", "AC", "D"))
Y <- tibble(colname2 = c("A", "D", "AG", "B"))
```

`all_equal()` will simply report that col names are different

```{r}
all_equal(X, Y, ignore_row_order = TRUE)
all_equal(X, Y, ignore_row_order = FALSE)
```

`seteqaul()` will report `TRUE` or `FALSE` to indicate either a difference in columns or rows

```{r}
setequal(X, Y)
```

`setdiff()` requires that column names be the same so this will cause an error:

```{r, eval = FALSE}
setdiff(X, Y) # This will not work
```

</details> 

***

OK, let's keep going with our data.

How similar are the guidelines tibble and the observed consumption tibbles?

```{r}
setequal(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)
```

OK, looks like we have some different values.

Let's use the `setdiff` function to get more information about what is different between the values.

```{r, eval = FALSE}
setdiff(
  distinct(diet_data, dietary_risk),
  select(guidelines, food)
)
```

:( That won't work. This is because `setdiff()` requires that the column names are the same in the objects that we are comparing.


We can use the `rename()` function from `dplyr` to do this. We list the value that we want to change to first. We find "food" more intuitive now so we are going to change "dietary_risk" to "food" for the `diet_data` and the `sep_age_diet_data`:

```{r}
diet_data %<>%
  dplyr::rename(food = dietary_risk)
```

Come up with the code for the `sep_age_diet_data` yourself!

```{r, echo=FALSE}
save(sep_age_diet_data, file = here::here("www", "exercise", "dw_code3.rda"))
```

```{r DW_Code3-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "exercise", "dw_code3.rda"))
```

```{r DW_Code3, exercise=TRUE}
sep_age_diet_data %<>%
```

```{r DW_Code3-solution}
sep_age_diet_data %<>%
  dplyr::rename(food = dietary_risk)
```

***
<details> <summary> Click here to reveal the code </summary>

```{r}
sep_age_diet_data %<>%
  dplyr::rename(food = dietary_risk)
```
</details>

***


```{r}
setdiff(
  distinct(diet_data, food),
  select(guidelines, food)
)
```

Great, now we know that the `fiber` value appears to be different between the two.


Checking our original files we can see that the British spelling "fibre" is used in the table from the article (that we used to create `guidelines`), in contrast to the American spelling "fiber" used in the CSV files.

Let's stick with the American spelling, so we will replace `"fibre"` in the guideline tibble.

```{r}
guidelines %<>%
  mutate_at(
    vars(food),
    ~ str_replace(
      string = .,
      pattern = "fibre",
      replacement = "fiber"
    )
  )

guidelines %>%
  filter(food == "fiber")
```

Now let's check again to see that our food values match between the guidelines and the observed consumption data tibbles.

```{r}
setdiff(
  select(guidelines, food),
  distinct(diet_data, food)
)

setdiff(
  select(guidelines, food),
  distinct(sep_age_diet_data, food)
)
```

Great!  There are no differences :)

### **Joining data**
***

Now we can put our guideline data together with the `diet_data` and the `sep_age_diet_data`.

Remember that the `food` data in our `guidelines` tibble is not necessarily in the same order as that of the consumption data tibbles. Thus this could be a problem if we decided to expand the `guidelines` rows (to repeat for the number of fruit observations etc.) and add them to our observed consumption tibbles by binding them together by column. 

```{r, echo = FALSE, outwidth = "50%", fig.align= "center"}
knitr::include_graphics("www/img/bind.png")
```

#### [[source]](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

In that case we could use the `arrange()` function of `dplyr` to sort the data alphabetically.

However, we will instead use a joining function of `dplyr`. These functions combine the data together based on **common values** and don't require the rows to be in the same order. There are a variety of options.

```{r, echo = FALSE, outwidth = "50%", fig.align= "center"}
knitr::include_graphics("www/img/join.png")
```

#### [[source]](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)


In our case we would like to retain all of the values of `diet_data` and `sep_age_diet_data`. We would like to add new columns of values to these tibbles that correspond to the guideline information about amounts of consumption for each food type in the `guidelines` tibble. We shouldn't have any values of `food` in `guidelines` that don't match, so we will not get any `NA` values. Therefore, in our case any of the mutating join functions should result in the same output.

It's important to check if we have any overlapping variable names before we join the data. Otherwise, these columns will either be used to identify which rows to join, or new copies of the columns, with a default name to distinguish the columns of one data set from those of the other, will be created. We can use the base R function `names()`  and the `intersect()` function of the `dplyr` package to identify which column names are common to our two data sets.

```{r}
dplyr::intersect(
  names(diet_data),
  names(guidelines)
)
```

So it looks like the `"upper"` , `"lower"` and `"unit"` variable names are overlapping. Therefore, to distinguish the names later we will rename the guideline `"upper"` , `"lower"` and `"unit"` variables.

We will again use the `rename` function from the `dplyr` package. We can list multiple variables to rename and separate each with a comma. We need to list the new names first.

```{r}
guidelines %<>%
  rename(
    upper_optimal = upper,
    lower_optimal = lower,
    unit_optimal = unit
  )

guidelines
```

It is also a good idea to check our units to make sure they are the same for both `guidelines` and the observed consumption tibbles(`diet_and_guidelines` and `all_age_diet_and_guidelines`).

Let's take a look with the `count()` function of the `dplyr` package. We will also use the `bind_cols()` function of `dplyr` to put the data together so that we can see it easily.

```{r}
dplyr::bind_cols(
  count(diet_data, unit, food),
  count(sep_age_diet_data, unit, food),
  count(guidelines, unit_optimal, food)
)
```

We can see that the only potential issue is the `seafood omega-3 fatty acids` data which is in g/day for the observed data(`diet_data` and `all_age_diet_and_guidelines`), but in mg/day in the `guidelines` data.

We can account for this by dividing the `guidelines` `seafood omega-3 fatty acids data` by 1000 to convert it to grams from milligrams.

To do this we will use the `if_else()` function in the `dplyr` package. This allows us to specify a condition (in this case if the unit is `"mg"`), as well as values if this condition is met (true), or if the condition is not met (false). 

In the following we mutate the values in each of the guideline numeric columns (`lower`, `optimal` and `upper`) one at a time. When we refer to `lower` for example we refer to the values in the column/variable. So if the condition is not met, then the original value is retained. We will also replace `"mg"` with `"g"` after everything is converted to grams.


```{r}
# "lower_optimal" variable
guidelines %<>%
  mutate(lower_optimal = dplyr::if_else(
    condition = unit_optimal == "mg",
    true = lower_optimal / 1000,
    false = lower_optimal
  ))
# Explanation for the use of "if_else()" here
# If the "unit_optimal" variable is in "mg", we convert the corresponding "lower_optimal" (currently in mg) variable to grams (g) by dividing by 1,000.
# If not, the corresponding "lower_optimal" (already in g) is not changed
```

Come up with the code for other variables yourself!

```{r, echo=FALSE}
save(guidelines, file = here::here("www", "exercise", "dw_code4.rda"))
```

```{r DW_Code4-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "exercise", "dw_code4.rda"))
```

```{r DW_Code4, exercise=TRUE}
# "optimal" variable
guidelines %<>%

# "upper_optimal" variable
guidelines %<>%

# replace "mg" with "g" in the "unit_optimal" variable
guidelines %<>%

# view the data
guidelines
```

```{r DW_Code4-hint-1}
# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))
```

```{r DW_Code4-hint-2}
# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))

# "upper_optimal" variable
guidelines %<>%
  mutate(upper_optimal = if_else(condition = unit_optimal == "mg",
    true = upper_optimal / 1000,
    false = upper_optimal
  ))
```

```{r DW_Code4-solution}
# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))

# "upper_optimal" variable
guidelines %<>%
  mutate(upper_optimal = if_else(condition = unit_optimal == "mg",
    true = upper_optimal / 1000,
    false = upper_optimal
  ))

# replace "mg" with "g" in the "unit_optimal" variable
guidelines %<>%
  mutate(unit_optimal = if_else(condition = unit_optimal == "mg",
    true = "g",
    false = unit_optimal
  ))
```


***
<details> <summary> Click here to reveal the code </summary>

```{r}
# "optimal" variable
guidelines %<>%
  mutate(optimal = if_else(condition = unit_optimal == "mg",
    true = optimal / 1000,
    false = optimal
  ))

# "upper_optimal" variable
guidelines %<>%
  mutate(upper_optimal = if_else(condition = unit_optimal == "mg",
    true = upper_optimal / 1000,
    false = upper_optimal
  ))

# replace "mg" with "g" in the "unit_optimal" variable
guidelines %<>%
  mutate(unit_optimal = if_else(condition = unit_optimal == "mg",
    true = "g",
    false = unit_optimal
  ))
```

</details>

***

```{r}
guidelines
```

***
<details> <summary> Click here to see a couple of other ways to do this: </summary>

```{r, eval = FALSE}
# Another possible way with dplyr::case_when():
guidelines %<>%
  mutate(lower_optimal = case_when(
    unit_optimal == "mg" ~ lower_optimal / 1000,
    unit_optimal != "mg" ~ lower_optimal
  ))

# Or could use this:
guidelines %<>%
  mutate_at(
    vars(unit_optimal),
    ~ str_replace(
      string = .,
      pattern = "mg",
      replacement = "g"
    )
  )
```

</details>

***


In contrast we could have changed or mutated the values for `lower_optimal`, `optimal`, `upper_optimal` all at once like this using the `funs()` argument in `mutate_at()` of `dplyr`.

```{r, eval = FALSE}
guidelines[str_which(
  string = guidelines[["food"]],
  pattern = "seafood omega-3 fatty acids"
), ] <- guidelines %>%
  filter(food == "seafood omega-3 fatty acids") %>%
  mutate_at(vars(lower_optimal:upper_optimal), funs(. / 1000))
```


Now we are ready to join the data!

Again, we would like to add new columns of values to `diet_data` and `all_age_diet_and_guidelines` that correspond to the guideline information about amounts of consumption for each food type in the `guidelines` tibble. So we will join the data based on the `food` variable values. We will use the `full_join()` function of the `dplyr` package.

```{r}
diet_and_guidelines <- diet_data %>%
  dplyr::full_join(guidelines, by = "food")

all_age_diet_and_guidelines <- sep_age_diet_data %>%
  full_join(guidelines, by = "food")

glimpse(diet_and_guidelines)
glimpse(all_age_diet_and_guidelines)
```

It's always a good idea to check that the values are what you expect after merging. 

```{r}
diet_and_guidelines %>%
  count(food, optimal)

all_age_diet_and_guidelines %>%
  count(food, optimal)

# For easy comparison we will arrange by food alphabetically
arrange(guidelines, food)
```

Looks good!
 
 
### **Calculating relative consumption**
***

Recall that our aim is to compare the consumption rates of these dietary factors by different groups of people, and ideally, to facilitate cross-factor comparisons, we want to consider consumption rates relative to the optimal guidelines.

To do this, let's calculate values of consumption that are relative to the suggested guidelines.

There are a few approaches we could take. One is to calculate a `"percentage of optimal consumption"` using the mean value for each observed factor relative to its optimal value. To do this we will use the `mutate()` function of the `dplyr`package. This will create a new variable called `Relative_Percent` that will be equal to the ratio of the `mean` value and the `optimal` value, multiplied by 100, to create a percentage relative to the optimal amount suggested.

```{r}
diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)

all_age_diet_and_guidelines %<>%
  mutate(Relative_Percent = (mean / optimal) * 100)
```

Another option is to incorporate the range of optimal intakes and the direction that is associated with health risk. If the direction of risk is `high` and the consumption was greater than the `optimal` mean value, than the percentage is calculated based on the `upper_optimal` value, while if the direction of risk is `low` and the consumption is less than the `optimal` mean value, then the percentage is calculated based on the `lower_optimal` value. We will use the `case_when()` function of the `dplyr` package to do this. This allows us to specify values (indicated on the right side of the `~`symbol) based on specific conditions (indicated on the left side of the `~` symbol). We can specify multiple conditions using the `&` symbol.

```{r}
diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))

all_age_diet_and_guidelines %<>%
  mutate(range_percent = case_when(
    direction == "high" ~ (mean / upper_optimal) * 100,
    direction == "low" ~ (mean / lower_optimal) * 100
  ))


diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))


all_age_diet_and_guidelines %<>%
  mutate(percent_over_under = case_when(
    direction == "high" & mean > upper_optimal ~
    ((mean - upper_optimal) / upper_optimal) * 100,
    direction == "high" & mean <= upper_optimal ~ 0,
    direction == "low" & mean >= lower_optimal ~ 0,
    direction == "low" & mean < lower_optimal ~
    ((lower_optimal - mean) / lower_optimal) * -100
  ))
```

Another option is to create a binary outcome indicating whether optimal consumption was achieved or not.

```{r}

diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

all_age_diet_and_guidelines %<>%
  mutate(opt_achieved = if_else(
    condition = direction == "low" & mean > lower_optimal |
      direction == "high" & mean < upper_optimal,
    true = "Yes",
    false = "No"
  ))

glimpse(diet_and_guidelines)
glimpse(all_age_diet_and_guidelines)
```

One last thing that can be useful with data wrangling is to **reshape** the data into what is called the **long** format. This is very useful for creating visualizations with a powerful and flexible package called `ggplot2`.

To coerce an object into long format, we create **more rows and fewer columns**. For more information about this, please see the Data Visualization section of this [case study](https://www.opencasestudies.org/ocs-bp-opioid-rural-urban/#Rural_and_Urban_areas/#:~:text=%20the%20data%20was%20presented%20in%20a%20format%20that%20is%20called%20long%20format.){target="_blank"}.

We would like to put together the different types of percentages of the optimal intake that we just calculated.

To get our data in long format we can use the `pivot_longer()` function of the `tidyr` package. We will list the columns that we want to come together into the longer format using the `cols` argument. The `names_to` argument indicates the name of the variable that will include the character information about the values that we are consolidating, i.e., this variable contains the names of the columns that we are bringing together. The `values_to` is the name of the column that will contain the values of the columns we are consolidating. We can use `contains()` of the `tidyr` package to look at the variables with names that contain `"percent"` .

```{r}
diet_and_guidelines_long <- diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )
```

***
<details> <summary> Click here to see how this would be done with the older version of this function, called `gather()`: </summary>

Recall that for `pivot_longer()`, the `cols` argument is used. For `gather()` we would simply list the variables that we wish to consolidate. The `names_to` and `values_to` arguments of `pivot_longer()` are equivalent to the `key` and `value` arguments in `gather()` respectively.

We would get an identical output from the two methods. We can check that with `setequal()`.
```{r}
diet_and_guidelines_long2 <- diet_and_guidelines %>%
  gather(contains("percent"),
    key = percent_type,
    value = percent
  )

setequal(diet_and_guidelines_long, diet_and_guidelines_long2)
```

</details>

***

Let's do the same for the age separated data.

```{r}
all_age_diet_and_guidelines_long <- all_age_diet_and_guidelines %>%
  pivot_longer(
    cols = contains("percent"),
    names_to = "percent_type",
    values_to = "percent"
  )
```

We now have the main variables and data formats that we need to proceed with the next steps of our analysis, including data exploration and eventually, modeling.

Now we will save our wrangled data. We will save it as an rda file for ourselves and as csv files, as this is often a good option for collaborators. We need a separate csv file for each tibble. We will save these files in a directory called "wrangled" within our "data" directory of our project.

```{r, eval=FALSE}
save(all_age_diet_and_guidelines, all_age_diet_and_guidelines_long, diet_and_guidelines, sep_age_diet_data, file = here::here("data", "wrangled", "wrangled_data.rda"))
write_csv(all_age_diet_and_guidelines, file = here::here("data", "wrangled", "all_age_diet_and_guidelines.csv"))
write_csv(all_age_diet_and_guidelines_long, file = here::here("data", "wrangled", "all_age_diet_and_guidelines_long.csv"))
write_csv(diet_and_guidelines, file = here::here("data", "wrangled", "diet_and_guidelines.csv"))
write_csv(sep_age_diet_data, file = here::here("data", "wrangled", "sep_age_diet_data.csv"))
```

### **Exercise**
***
``` {r DW_Quiz, echo = FALSE}
quiz(caption = "",
  question("Which one of the pipe operators (from the `magrittr` package) should be used right after a variable name if we want to perform a sequence of operations on that variable, and meanwhile, assign the final output to that variable (without redefining that variable using `<-` or `=`)?",
    answer("`%>%`", message = "This operator cannot assign the final output to that variable."),
    answer("`%<%`", message = "This is not a valie pipe operator."),
    answer("`%T>%`", message = "This is the [side-effects operator] (https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html#additional-pipe-operators)."),
    answer("`%<>%`", correct = TRUE),
    allow_retry = TRUE,
    random_answer_order = TRUE
  ),
  question("Which of the following can the `mutate()` function in the `dplyr` package do? (more than one correct answers)",
    answer("Select certain variables(columns) of the data.", message = "Selecting certain variables(columns) of the data is the function of the `select()` function."),
    answer("Rename a variable.", message = "Renaming a variable is the function of the `rename()` function."),
    answer("Create a new variable.", correct = TRUE),
    answer("Modify an existing variable.", correct = TRUE),
    allow_retry = TRUE,
    random_answer_order = TRUE
  ),
  question("Suppose we have two data sets `a` and `b` from a study. Both `a` and `b` have variable `ID` which identifies subjects in this study. We want to join these data sets based on the `ID` variable. We also want to keep all rows in `b` and only the matching rows in `a`. Which of the following commands should we use?",
    answer("`left_join(a, b, by = \"ID\")`", message = "This is keeping all rows in `a` and only the matching rows in `b`."),
    answer("`inner_join(a, b, by = \"ID\")`", message = "This is only keeping rows in both sets."),
    answer("`full_join(a, b, by = \"ID\")`", message = "This is keeping all rows from both sets."),
    answer("`right_join(a, b, by = \"ID\")`", correct = TRUE),
    allow_retry = TRUE,
    random_answer_order = TRUE
  )
)
```

We have two data sets. The first one is called `PlantGrowth`, which is an R built-in data set with an added `ID` variable. The second one is called `workerInfo`, which provides information about the lab workers who are responsible for each plant (hypothetical, for exercise purposes only). Following is what you need to do with these data sets.

* Step 1: join the two data sets by retaining all rows from both sets and call the joined data `fulldata`
* Step 2: change the variable name `wID` to `workerID`
* Step 3: convert `group` and `workerID` values to factors
* Step 4: only keep data from the `ctrl` group
* Step 5: order the data from the lowest weight to highest

```{r DW_Exercise-setup}
library(tidyverse)
library(magrittr)

PlantGrowth %<>% 
  mutate(ID = seq.int(1,30,by=1)) %>%
  relocate(ID, .before = weight) %>%
  as.tibble() %>%
  mutate(group = as.character(group))

set.seed(100)
workerInfo <- tibble(ID = seq.int(1,30,by=1), wID = sample(1:4,30,replace = TRUE))
```

```{r DW_Exercise, exercise = TRUE, exercise.eval = TRUE}
PlantGrowth
workerInfo
```

```{r DW_Exercise-hint-1}
# Step 1
fulldata <- full_join(PlantGrowth, workerInfo)
```

```{r DW_Exercise-hint-2}
# Step 1
fulldata <- full_join(PlantGrowth, workerInfo)
# Step 2
fulldata %<>%
  rename(workerID = wID)
```

```{r DW_Exercise-hint-3}
# Step 1
fulldata <- full_join(PlantGrowth, workerInfo)
# Step 2
fulldata %<>%
  rename(workerID = wID) %>%
  # Step 3
  mutate(across(c("group", "workerID"), as.factor))
```

```{r DW_Exercise-hint-4}
# Step 1
fulldata <- full_join(PlantGrowth, workerInfo)
# Step 2
fulldata %<>%
  rename(workerID = wID) %>%
  # Step 3
  mutate(across(c("group", "workerID"), as.factor)) %>%
  # Step 4
  filter(group == "ctrl")
```

```{r DW_Exercise-solution}
# Step 1
fulldata <- full_join(PlantGrowth, workerInfo)
# Step 2
fulldata %<>%
  rename(workerID = wID) %>%
  # Step 3
  mutate(across(c("group", "workerID"), as.factor)) %>%
  # Step 4
  filter(group == "ctrl") %>%
  # Step 5
  arrange(weight)
# view data
fulldata
```


## **Data Exploration**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***

### **Exploring age collapsed data**
***

Let's start by taking a look at the  percent of consumption, across all dietary factors. Again we will use the base R `summary()` function:

```{r}
diet_and_guidelines %>%
  select(Relative_Percent) %>%
  summary()
```

Wow! Some of the values are nearly zero, suggesting that some people are consuming basically zero percent of what is suggested for optimal health. On the other hand, for some dietary factors people are consuming over 13,000 percent what is suggested! 

This is why it is important to look at the direction of consumption that could be harmful. For example if there is a population that consumes large amounts of vegetables this could be a good thing, but if there is a population consuming large amounts of sodium this would be a bad thing. 

Let's take a look to see what dietary factors are at the extremes by arranging the data using the `arrange()` function of the `dplyr` package. We can arrange by smallest to largest using the default and we can arrange largest to smallest using the minus sign `-`.

```{r}
diet_and_guidelines %>%
  arrange(-Relative_Percent) %>%
  glimpse()
```

OK, so it looks like sugar-sweetened beverages are really over-consumed in some parts of the world!

Recall from the supplementary table from the article that over-consumption of sugar-sweetened beverages is associated with both Diabetes mellitus type 2 and Ischemic heart disease. This [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5133084/){target="_blank"} discusses some of the controversy over the potential health risks associated with high consumption of sugar.

It still looks quite bad if we look at the other calculated percentage values. 
```{r}
diet_and_guidelines %>%
  select(contains("percent")) %>%
  summary()
```
So some places are still consuming 8,000 percent more than the upper range of the suggested optimal intake.

Let's take a look at global levels:
```{r}
diet_and_guidelines %>%
  filter(food == "sugar-sweetened beverages" &
    location_name == "Global")
```

For those who are less familiar with the metric system where grams are equivalent to milliliters, it may be useful to realize how many fluid ounces the max amount of consumption per day (~444g for the `upper` value for Guatemala) actually is. 

There are 0.35247 ounces in one gram.

```{r}
# top amount in ounces
0.35247 * 444.4002
```

OK, so the top consumers are drinking about 87 fluid ounces per day. Since there are 12 ounces in a single can of soda, this is about `r 87/12` sodas per day. Globally on average, males are drinking around `r round((65.5*0.35247)/12, 3)` sodas worth of sweetened beverages, while females are drinking about `r round((47.7*0.35247)/12, 3)`.


Let's take a look at what is under-consumed:

```{r}
diet_and_guidelines %>%
  arrange(Relative_Percent) %>%
  glimpse()
```

On the other hand, it looks like some places are consuming almost no polyunsaturated fatty acids. These are fats that found in plant-based sources like seeds and nuts. According to an [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4859401/){target="_blank"} about polyunsaturated fatty acids and its influence on health:

> Coronary heart disease (CHD) is the leading cause of death worldwide ... The types of dietary fats consumed play an important role in CHD risk, representing key modifiable risk factors...In particular, higher intakes of trans fat (TFA) and of saturated fat (SFA) replacing ω‐6 (n‐6) polyunsaturated fat (PUFA) are associated with increased CHD... whereas higher intake of PUFA replacing either SFA or carbohydrate is associated with lower risk.


Let's get an idea about how countries compare in terms of how many of the dietary factors are consumed at the optimal level (the `opt_achieved` variable).

```{r}
diet_and_guidelines %>%
  count(opt_achieved)
```

Looks like overall, only `r round(1520/4360*100, 2)`%  of dietary factors for all tested populations were at optimal levels.

Let's get an idea about how countries compare on this metric.

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "Yes") %>%
  arrange(-n) %>%
  # this allows us to show the full output
  print(n = 1e3)
```
####

It looks as though on average the populations (both male and female separately) in Qatar, Rwanda, and Turkey consumed the optimal level of intake for the largest number of dietary factors (13 out of 30 (for the 15 dietary factors for males and females)).

In contrast, the Czech Republic, Greenland, Hungary, Slovakia, Slovenia, and the United States had the poorest consumption rates (27 out of 30 were not at optimal levels).

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  count(opt_achieved, location_name) %>%
  filter(opt_achieved == "No") %>%
  arrange(-n) %>%
  # to show full output
  print(n = 1e3)
```
####

Let's look at the raw US data:
```{r}
diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  glimpse()
```

Let's see how males and females compare for achieving the optimal intake, across all countries:

```{r}
count(diet_and_guidelines, sex, opt_achieved)
```
Looks pretty similar, but it may be a bit better for females. We will evaluate this further below.

In order to assess what we have observed so far in a graphical way, we will make some data visualizations. One way we can do this is with the `ggplot2` package.
The [ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} package creates plots by building the plot components piece by piece, using `"layers"`.

With `ggplot2` we select what data we would like to plot using the first function (`ggplot()`) and then we add on additional layers of complexity (these layers can even involve different data). The `aes()` argument specifies what aspects of the data will be plotted where. The `geom_*` function specifies what type of plot to create (e.g. `geom_histogram()` creates a histogram). Notice in the following code how there is a plus sign between the `ggplot()` function and the `geom_bar()` function; this is how we combine different plot layers. 

We will see later how we can add many layers to plots with `ggplot2`. For additional information on using `ggplot2`, see this [case study](https://opencasestudies.github.io/ocs-bp-co2-emissions/){target="_blank"}.

```{r}
diet_and_guidelines %>%
  ggplot(aes(opt_achieved, fill = sex)) +
  geom_bar(position = "dodge")
```

Continuing with `ggplot2` we will now create a different type of plot. This time we will create a series of box plots. We will use the `facet_wrap()` function of ggplot2 to allow us to create many different plots simultaneously. In this case we can look at box plots for the different dietary factors colored by sex. The `scales` argument when set to `"free"` means that each of the sequential plot created by the facet can have a different scale for the y axis, otherwise, by default they are constrained to the same scale. Since our dietary factors are measured on very different scales, we do not want this constraint here.


```{r}
# we will create a new variable with food names with new lines
# str_replace() is used here because we are only replacing the first instance of space
# otherwise str_replace_all() should be used
diet_and_guidelines %<>%
  mutate(
    food_to_plot =
      str_replace(
        string = pull(diet_and_guidelines, food),
        pattern = " ",
        replacement = "\n"
      )
  )

diet_and_guidelines %>%
  ggplot(aes(
    y = Relative_Percent,
    x = sex,
    color = sex
  )) +
  geom_boxplot() +
  facet_wrap(~food_to_plot,
    scales = "free",
    # specifies the number of rows of subplots
    nrow = 3,
    # moves the food label to the right
    strip.position = "right"
  ) +
  # this changes the size of the font for the labels
  theme(
    strip.text.y = element_text(size = 8),
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  )
```


If we just look at differences by sex for the specific dietary factors,  males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit.

### **Exploring the data separated by age**
***

Now we will take a look at the data that is separated by age groups.

First, recall that we have 15 different age groups starting from age 25 to 95 plus.
```{r}
all_age_diet_and_guidelines %>%
  count(age_group_name)
```


```{r, fig.height=15}
sep_age_diet_data %>%
  ggplot(aes(y = mean, x = age_group_name, col = sex)) +
  geom_boxplot() +
  facet_wrap(~food, scales = "free", nrow = 6) +
  theme(
    axis.text.x = element_text(angle = 70, hjust = 1),
    strip.text.x = element_text(size = 8)
  )
```

We can see from these plots that there appear to be age differences and gender differences for some of the different dietary factors. We will work to create clearer figures later on. However these initial figures have given us a better sense of the data that we are working with.


## **Data Analysis**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***

Recall what our main questions were:

#### {.main_question_block}
<b><u> Our main questions are: </u></b>

1) What are the global trends for potentially harmful diets?
2) How do males and females compare?
3) How do different age groups compare for these dietary factors?
4) How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

####

We have some general sense about global trends for the risk-associated dietary factors, however we want to know more.

We are interested in how much the genders differ, how much the 15 different age groups differ, and how the 195 countries compare. 

In order to make [inferences](https://www.britannica.com/science/inference-statistics) about these comparisons, it is helpful to perform statistical tests. These tests can help us to determine the strength of the association between the consumption of the dietary factors (our outcome variable) and sex, age group, and country identity (our predictor variables). One way to look at the strength of association between variables is to use a statistical method called **regression**.

If we measure consumption using either raw consumption or the percent of optimal consumption, then our outcome variable is what we call **continuous**, because our values can take on any numeric value within the range of possible values.  To look at the strength of association with a continuous outcome, we can use **linear regression**.

If, instead, we measure consumption by whether or not the optimal level of consumption was achieved ("yes" or "no"), then our outcome would be considered **binary**, meaning it can take only two possible values.  To look at the strength of association with a binary outcome, we can use **logistic regression.**  There are other regression method for different types of outcomes as well; see [here](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"} for a guide on different types of regression methods.

In this case study, we will focus on the outcome of the percent of optimal consumption (`Relative_Percent`), so we will focus our analysis on linear regression.  

You may have already learned that one can compare a continuous outcome between two groups using a $t$-test. For more information on the $t$-test see this [case study](https://opencasestudies.github.io/ocs-bp-rural-and-urban-obesity/){target="_blank"}.  And perhaps you have heard about ANOVA (ANalysis Of VAriance) for comparing a continuous outcome across more than two groups.  It turns out that both the $t$-test and ANOVA are specialized types of [linear regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}. We will use each of these tests to investigate patterns of consumption for dietary factors that contribute to health risk and we will look at how we can obtain equivalent results with regression.

### **Linear Regression**
***

So what is linear regression? How can we use regression to compare our groups of interest and look at the relationship between group identity and consumption of dietary factors associated with health risk?

The statistical version of the term regression was coined in 1877 in this [article](http://galton.org/essays/1870-1879/galton-1877-typical-laws-heredity.pdf ){target="_blank"} about the relationship between hereditary traits and population averages. The author particularly focused on [height](https://zenodo.org/record/1449548#.Xlf_9hNKihc){target="_blank"} and kinship or relatedness. The word itself means `"to go back to a simpler state"`. It was noticed that individuals with parents who had an extreme trait, such as exceptional height, tended to have a height more similar to the average of the population than the extreme height of their parents. For example if parents were very tall, their children were likely to be a bit shorter than their parents and therefore closer to the population average. Thus the children regressed towards the mean or in the author's words the offspring showed:

> "a *regression* towards mediocrity"

See [here](https://en.wikipedia.org/wiki/Regression_toward_the_mean){target="_blank"} for more information about this history.

When we think about this from a statistical standpoint, regression allows us to estimate or **regress** relationships between variables with a "simple" model. We do this by **estimating the mean** of an outcome, given a value of an input or predictor variable. This can be useful for **predicting future values** of the outcome based on the approximation of the real relationship between the variables within the model, or just for understanding how different variables are related to one another.

We will start by considering **simple linear regression**, where we have one continuous predictor variable and one continuous outcome variable, as shown below:
```{r, echo = FALSE, out.width = "400ptx"}
set.seed(15)
data_x <- sample(1:10, 10, replace = TRUE)
data_y <- data_x + rnorm(10, 0, 10)
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

We want to identify a "best fit" line that summarizes the relationship between these two variables.  We can so this using the ordinary least squares method, which chooses the line that best fits the data by minimizing the sum of the squared vertical distances between each point and the line. In the above example, this line turns out to be:
```{r, echo = FALSE, out.width = "400ptx"}
set.seed(15)
data_x <- sample(1:10, 10, replace = TRUE)
data_y <- data_x + rnorm(10, 0, 10)
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

Fitting a line to the data like this allows us to create a formula for the line using an **intercept** and a **slope**, so that we can then estimate **mean** values of $Y$ (dependent/outcome variable) given known values of $X$ (independent/predictor/covariate/explanatory variable(s)). People will also say that we are "regressing $Y$ on $X$".
 
You may have seen the formula for a line written like this:

$$Y = mX + b$$ 

<center> or </center>
$$Y = aX + b$$

In this case $m$ or $a$ is the slope of the line and $b$ is a constant and represents the y-intercept or the point where the y axis is crossed by the line, when $x = 0$.

In regression, we usually write this model like this:

$$Y = \beta_{1}X +\beta_{0}$$

Now $\beta_{1}$, called "beta one", is our slope and $\beta_{0}$, called "beta zero" (or "beta naught"), is our intercept.  In our example above, the slope of the regression line is $\beta_{1} = 2.3$ and the intercept is $\beta_{0} = -6.6$.

Importantly the slope ($\beta_{1}$) gives us a quantitative measure of the relationship between the independent variable ($X$) on the dependent variable ($Y$).  In particular, $\beta_{1}$ tells how the expected difference in the $Y$ value for a difference of 1 unit in the $X$ value.

It's possible that the regression line will perfectly fit the data, and all points will lie on the line with no distance to the line:

```{r, echo = FALSE, out.width = "400ptx"}

data_x <- sample(1:100, 20, replace = TRUE)
data_y <- data_x + 10
thedata <- bind_cols(x = data_x, y = data_y)

ggplot(data = thedata, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

In this case, the slope or $\beta_{1}$ is 1 and the intercept $\beta_{0}$ is 10 and every observed data point lies exactly on the line, e.g., we can see that when $X$ is 50, $Y$ is exactly 60. This is very unusual in statistical analysis however, as often the relationship between variables is more complicated and there is more noise in our data. In these other cases there will be greater distances between the line and the points. 

Like this regression:
```{r, echo = FALSE, out.width="400ptx"}
set.seed(13)
thedata %<>% mutate(y2 = rnorm(20, sd = 40))

ggplot(data = thedata, aes(x = x, y = y2)) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    formula = y ~ x
  ) +
  geom_point() +
  stat_regline_equation(size = 6) +
  theme(
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 15)
  )
```

In this case, because there is some vertical distance between the line and the data points, there is a bit of what is called "error" in the model. The formula for the relationship between $X$ and $Y$ does not perfectly describe the data. The vertical distance between the line and each data point is what we call a [residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}. Our least squares method finds the line with the minimized value of the sum of the squared residual values.

Check out this [interactive explanation](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"} of how the ordinary least squares method works.

Here is an image of what we are saying about the ordinary least squares regression to fit a line to data:
<center>![](https://qph.fs.quoracdn.net/main-qimg-3b0d7655ac76edf1241f97015ee755b4)</center>

###### [[source](https://qph.fs.quoracdn.net/main-qimg-3b0d7655ac76edf1241f97015ee755b4)]

This basic concept of simple linear regression an be extended to allow for more than one covariate (the independent variables, or x's); this is called **multivariable** linear regression.   With more than one independent variable, we can't visualize these relationships easily with a line on a two-dimensional page, but the mathematical concept remains in some sense the same.

R has it's own way of representing the regression equation in code. For a guide on how to perform regressions in R see [here](http://www.montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20092010/Class8/Using%20R%20for%20linear%20regression.pdf){target="_blank"}.

In R we indicate a linear model like this:
```{r, eval = FALSE}
y ~ x
```
Here our response/outcome variable is on the left of the `~` while our covariates/explanatory variables are on the right of the `~`.


Before we get started, let's remove the global values from our data and set them aside, as this is really a composite of all the country values.

```{r}
global <- diet_and_guidelines %>%
  filter(location_name == "Global")
diet_and_guidelines %<>%
  filter(location_name != "Global")
all_age_diet_and_guidelines %<>%
  filter(location_name != "Global")
```

### **$t$-test and linear regression**
***

Since we will be covering a lot of different statistical concepts here, we will want to focus are analysis on a single dietary factor. Let's choose one of the dietary factors that appeared to potentially have a difference between genders based on our figure in our exploratory analysis.

> "If we just look at differences by sex for the specific dietary factors,  males appear to potentially consume more of many of the factors, including possibly more sodium, fiber, calcium, red meat, and sugar-sweetened beverages than females. Females may consume more fruit."

Let's take a look at red meat.

We can compare the relative percent of red meat consumption of males and females around the world using the well known $t$-test using the `t.test()` function and a linear regression model using the `lm()` function (both are included in `stats` package that is installed with R by default) and we will get the **same results**. See [here](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} for additional explanation about why that is the case. [Here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} are also great sources about how many commonly known statistical tests are specialized forms of regression.

Before we get started, let's think about the assumptions of both an independent samples $t$-test and linear regression.


#### Independent samples $t$-test assumptions:

1) Normality of the outcome in each group (this is not as much of an issue if the number of observations is relatively large, i.e., total n > 30 - which is indeed the case for us!)
2) Equal variance between the two groups
3) Independent observations

#### Linear regression assumptions:

1) **L** (linear) - There is a linear relationship between the outcome variable and each covariate.

2) **I** (independent) - The outcome for individual observations are independent from one another, given the covariates in the model.

3) **N** (normal) - The residuals (errors) are normally distributed. Note that the variables themselves do not need to be normally distributed.

4) **E** (equal variances) - The variance of the residuals is constant across covariate groups.  This is called [homoscedasticity](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}. In other words the residuals are of similar size along the regression line.

It's also important that if there are multiple predictor variables, that these are not too highly correlated.

See [here](https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions.html){target="_blank"} for additional information about the assumptions of linear regressions.

Notice that many of the assumptions between $t$-tests and linear regression are similar -- each has an assumption of normality, equal variance, and independence!

#### Assessing normality

First we will explore the shape of the distribution of these relative percent of red meat consumption.  We can do this by looking at a frequency distribution of the `Relative_Percent` variable for red meat consumption.  We will use the `geom_histogram()` of the `ggplot2`package to create a histogram to evaluate the frequency distributions of our data. The `facet_wrap()` function of the `ggplot2` package allows us to look at different parts of our data in separate plots.  Here we can compare the distribution for males and females.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = Relative_Percent)) +
  geom_histogram() +
  facet_wrap(~sex)
```

This `Relative_Percent` variable appears to have a right skew for both male and female individuals.  We can also see this by looking at normal Quantile-Quantile (Q-Q) plots of this variable.  Remember that in a Q-Q plot, points away from the line indicate one of the distributions is more skewed than the other.  In this case, we see that the values in are sample are skewed relative to the theoretical normal distribution. [Here](http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html){target="_blank"} is a great reference for interpreting Q-Q plots.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()
```

We can consider transforming our data to make it more normally distributed. When data is highly right skewed, a log transformation is often helpful.

Let's take a look a the log (with base 10) of our `Relative_Percent` variable:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(x = log10(Relative_Percent))) +
  geom_histogram() +
  facet_wrap(~sex)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~sex) +
  geom_qq() +
  geom_qq_line()
```

OK, so now our histograms look fairly normal. It isn't perfect, but we have a large number of samples, so this is good for our $t$-test assumptions. 

#### Assessing equal variances

The next thing we need to check is if the variance in red meat consumption is similar between the two gender groups. We can use the `var.test()`  of the `stats` package using the log-normalized data, as this data is fairly normally distributed.

Because we are piping in our data to this test function, we need to indicate that this is the data we intend to use by using `.` for the `data` argument.  This is a handy tip when piping into a function outside of the `tidyverse` where the first argument isn't a data set.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  var.test(log10(Relative_Percent) ~ sex, data = .)
```

The p value > .05 for this test, thus we can conclude that there is not enough evidence to reject the null hypothesis that there is no difference in the variance of the distributions, so we conclude that variance is roughly equal.

#### Comparing a $t$-test to linear regression

Now let's compare the consumption of red meat across genders using both a $t$-test and a linear regression. First our independent samples $t$-test:
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex, data = ., var.equal = TRUE)
```

Notice here that sample means for the two groups are 1.80 and 1.98 for males and females, respectively.  So that means the difference in sample means is 1.80 - 1.98 = -0.18.  We also see a test statistic of $t$ = -5.32 and a very small $p$-value.  

Let's examine the same relationship using linear regression:
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()
```

Look at the results for the slope of the regression line, indicated by the `sexMale` row in the output above.  Notice how the $t$-value and the $p$-value match our $t$-test!  (Well, the signs are switched in each case -- the $t$ value is negative in the `t.test()` output because the male group is being used as reference group, while the female group is being used as the reference group in `lm()`). We can fix this using the `fct_inorder()` function of the `forcats` package which is all about factors. This function allows us to order the factor by what appears first. In this case "male" appears first, so now our output will match that of the `lm()` function.


```{r}
diet_and_guidelines %<>%
  mutate_at(vars(sex), factor)

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  summary()

diet_and_guidelines %>%
  mutate_at(vars(sex), forcats::fct_inorder) %>%
  filter(food == "red meat") %>%
  t.test(log10(Relative_Percent) ~ sex,
    data = .,
    var.equal = TRUE
  )
```

Now they match. Notice that the degrees of freedom also match, both results show 388 degrees of freedom. We are estimating two parameters for the linear model the two $\beta$ coefficients, (the slope and intercept), and for the $t$-test we are estimating the means of two groups (males and females). Overall we have two samples (male and female) for each of the 195 countries. 

Thus, the overall sample number is:  $n = 195*2 = 390$

$$df = n - # parameters estimating$$ 
Thus the degrees of freedom can be calculated as:  $df = 390 -2 = 388$

Let's look more closely at the linear regression output from `lm()`.  Our estimated intercept ($\beta_{0}$) is 1.80, which can be interpreted as the mean value when sex is not male (so in this case when sex is female).  This matches the sample mean of the female group in the `t.test()` output. 

Our estimated slope ($\beta_{1}$) is 0.18, which can be interpreted as the slope of the regression line or **the mean change in $Y$ associated with one-unit increase in $X$**.  Since our $X$ variable is sex, a one-unit change means moving from one group to another.  So we can think of the slope as the difference between the means of the two groups, male ($X$=1) minus female ($X$ = 0).  If we calculate this difference in means as calculated in the `t.test()` output, we get the value of $\beta_{1}$ (the slope or the `sexMale estimate`) of the `lm()` output!

Mean of males - Mean of females
$1.983259 - 1.798872 =0.184387$

Cool!  For more information about the output of `lm()` see [here](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}.

After fitting our linear regression model, we can use the base `plot()` function to get information about our model residuals to help us assess whether any of the assumptions of linear regression are violated. Here we choose to view the first three of these plots with `which = 1:3`.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex, data = .) %>%
  plot(which = 1:3)
```

The second plot shows us that our residuals are slightly negatively (or left) skewed.  We can see also see the spread of the residuals is similar between males and females, as the first and third plot show similar spreads of values in the two lines. This suggests that the assumption of homoscedasticity is met.  Here is what these plots would look like if the variance were not the same between the groups:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  mutate(example_perc = case_when(
    # this will artifically make our female data have different
    # variance from the male data
    sex == "Female" ~ log(Relative_Percent),
    sex == "Male" ~ Relative_Percent
  )) %>%
  lm(log10(example_perc) ~ sex, data = .) %>%
  plot(which = 1:3)
```

In this case the spread of the points is clearly less for one group compared to the other.  If we saw plots like these, we would be concerned the assumption of homoscedasticity was violated.

#### Assessing independence

We never considered the assumption of independent required by both a $t$-test and linear regression.  Do we truly have independent samples in this case?  No!  Since we have female and male values from the same countries, our data is really what we would call "paired". The male and female diet values from the same country are most likely related to each another because of cultural effects on diet.  This means the assumption of independence for the independent samples $t$-test is violated, as is the independence assumption for linear regression.  

We can address this by doing a *paired* $t$-test instead of an independent $t$-test and by accounting for country in our linear model by adding it to our model as what we call a *fixed effect*.   

### **Paired $t$-test and linear model with fixed effects**
***

Now we will perform the paired versions of our analysis. This is very easy to do with the `t.test()` function, by simply using the `paired` argument and setting it equal to `TRUE`.

However, our data needs to be in a slightly different form to do the paired test, since we have to tell `R` which values need to be paired together.  Instead of one long dataset with different rows for males and females, we will need separate columns for the male and female values.  So we need to make our dataset *wider*.  We can do that using the `pivot_wider()` function of  the `tidyr` package. To use this function we specify the values that we want to separate into more variables using the `values_from` argument and we use the `names_from` argument to specify how we want to separate these other variables. In this case we will make a male and female version of all the other variables specified.

```{r}
wide_diet <- diet_and_guidelines %>%
  pivot_wider(
    values_from = c(
      contains("percent"),
      mean,
      upper,
      lower,
      opt_achieved
    ),
    names_from = sex
  )

glimpse(wide_diet)
```

You can see we now have a `Relative_Percent_Male` variable and a `Relative_Percent_Female` variable.  We can use these two variables in our paired $t$-test.  Since the paired version of the $t$-test doesn't take a `data=` argument, we will pull the appropriate variables from our data a little bit differently, using the `pull()` function.
```{r}
t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)
```

Here an estimated mean difference (Males - Females) of 0.18, and that this is considered significantly different than 0 due to a very small $p$-value.  You can also see that now our degrees of freedom are 194, which makes sense because with paired samples we are only estimating one parameter (the mean difference) based on data on 195 differences for each country. So $df = n - \# \ parameters = 195 -1 = 194$.

The paired version of the linear model is a bit more complex. In this case we will add another term in our model to evaluate the influence of `sex` on `Relative_Percent` consumption while keeping the country identity fixed or constant, or in other words controlling/adjusting for country. We can use the  `+` to add this additional term. Now that we have multiple covariate/explanatory variable terms, we would call this a **multivariable linear regression**.

So now our model in words will be: 

Mean relative consumption of red meat is dependent on sex and country. Or in other words, sex and location influence the consumption of red meat around the world.

Then the coefficient for `sex` will be different from what we had in our previous `lm()` model, as it will be calculated while keeping `location_name` or the country where the consumption value was obtained fixed, or in other words "controlling for `location_name`." This will also result in output for each of the countries. The [coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"} here represent the average difference in consumption value for each country compared to the reference country of Afghanistan, while accounting for sex.

This now should meet the assumption of independence for a linear regression model, since observations will be independent conditional an the covariates of sex and country.

Let's fit this model and look at the results.

#### {.scrollable }
```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  summary()
```
####

First let's look at the estimated coefficient for the `sexMale` variable, which is 0.18.  This can be interpreted as the difference in mean log relative percent consumption between males and females, holding country constant.  So comparing males to females within the same country.  Notice this is the same estimated difference we found from our paired $t$-test!  The $p$-value for this coefficient also matches the $p$-value from the paired $t$-test.

You can also see from this output that we have a coefficient for every country except Afghanistan, which is our reference country.  These coefficients compare the country to that reference.  So the estimated coefficient for `location_nameAlbania`, 0.44, can be interpreted as the difference in mean log relative percent consumption between Albania and Afghanistan, holding sex constant.  So comparing Albania to Afghanistan within males or comparing Albania to Afghanistan within females.

Finally, you might notice that the number of residual degrees of freedom for this regression is 194, just as in the paired $t$-test.  This makes sense since we have to estimate a coefficient for 194 countries (all except Afghanistan) as well as a coefficient for sex and an intercept.  So we have:

$$df = n - # parameters estimating = 390 - 194 - 2 = 194$$ 

We should also check the residual plots for this fixed effects regression model.

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ sex + location_name, data = .) %>%
  plot(which = 1:3)
```

These residual plots look much better than our previous plots.  This [guide](https://data.library.virginia.edu/diagnostic-plots/) provides more information on how to interpret these residual diagnostic plots.

Based on our [Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}, we appear to have some outliers perhaps at the extreme ends of our tails but overall the residuals look fairly normal. The [residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"} shows us if the relationship between our outcome variable and our predictors looks linear, if there is unequal error variance between groups, and if there are possible outliers. Ideally this should look like a band of points equally centered around zero. Here are [examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"} of these plots that might show issues of concern. 

Overall our plot looks fairly good. The shape of our data looks fairly linear (the residuals don't appear to have a shape other than a band or line), there does not appear to be any extreme outliers (no data points are especially far away) and the points have the same general range around the line for the various fitted values. There are a few points with wider residuals at the higher fitted values, but overall this looks quite reasonable. 

Our [scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"} also shows us that our variance looks fairly equal across groups as our values show a relatively even spread. A larger  bend in the line would indicate more variation in the variance across our independent variable groups also known as [heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}. There is only a slight bend in the line for our data suggestive of [homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}. So our assumptions look pretty good:

1) Linear - the relationship appears to be fairly linear 
2) Independence - now that we have taken care of the location structure in our data, our samples are independent
3) Normality - the residuals appear to be fairly normally distributed and we have a large number of samples to help account for minor violations
4) Equal variance - the variance in the residuals appear to be fairly equal across the groups of the independent/predictor variables


### **Paired $t$-test and linear model with mixed effects**
***


To "pair" our data using fixed effects cost us an additional 194 variables in our regression model, one for each country except Afghanistan.  Alternatively, we can perform a slightly different type of regression that still accounts for the paired structure in the data.

In this case we will use the `lmer()` function of the `lme4` package.  This function allows us to fit what is called a [linear mixed effects regression model](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}. We will also use the `lmerTest` package, since this adds test statistics and $p$-values to the linear mixed effects model output.  

This type of regression is called **mixed** because it contains both **fixed** and **random** effects.  There are many different definitions for **fixed** and **random** effects and the difference is conceptually complex and context specific. 

However in simplistic terms, **fixed effects** are generally speaking the variables of interest that we have reason to believe explain or predict the outcome or response variable, while random effects are those that may introduce additional variance in the influence of those predictor variables on the outcome variable. For example, they may provide information about **group or batch structures** within the data.  

In our case, we are interested in the influence of `sex` on the consumption of red meat, however the identity of the country where the male and female consumption values were obtained may influence this relationship and we would like to control for that. We don't want to model for `location_name` itself, but just model it's influence on the relationship of `sex` on consumption of red meat. In other words, we are interested in getting a sense of how sex influences consumption rates in general and we want to account for the paired structure within our data, the fact that we have corresponding consumption values for the two sexes from different countries. The notation for including a random effect like this is  `1 | variable_name`. The one indicates a varying-intercept group effect, in other words we expect that the intercept may vary for each value of the variable indicated to the right of the `|`. So in our case, the intercept  (log relative percent consumption when sex is assigned to the zero value - female) may be different for each country.

Let's fit a mixed effects model that includes a fixed effect for sex and a random intercept for country:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  summary()
```

How would we interpret the results of this model? Again, let's look at the estimated coefficient for the `sexMale` variable, which is 0.18.  This can be interpreted the same way as in the simple linear regression, as the difference in mean log relative percent consumption between males and females.  However, here we haven't violated the independence assumption because we are accounting for the paired nature of the data through the random effect for country.  The $t$-statistic and $p$-value for this coefficient also match those from the paired $t$-test we did before:
```{r}
t.test(log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Male
)),
log10(pull(
  filter(wide_diet, food == "red meat"),
  Relative_Percent_Female
)),
var.equal = TRUE, paired = TRUE
)
```

Notice that in the output for the mixed effects model, there are **not** coefficients for each country, like there were in the fixed effects model.  This is because we are not explicitly estimating individual country effects in this model.  Instead, the country effect is captured through the intercept in this model.  Our estimated intercept is 1.80 and the standard deviation of this intercept is 0.34 (shown in the `Random effects` table in the output.)  We can interpret this as saying that each country has an intercept that comes from a normal distribution with mean of 1.80 and a standard deviation of 0.34.  Since the intercept in this model represents the log relative percent consumption for females, this give us an idea of how female consumption varies across countries -- average log consumption across countries is 1.80, but there is variability from one country to another.  And then the male log consumption is, on average, 0.18 higher than for females.

It is more complicated to calculate the degrees of freedom in the mixed effect model and beyond this case study, but it is based on the [Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"} and results in the same degrees of freedom.

Finally, lets see what our residual plots look like for this mixed effects model.  We can't use the `plot()` function with a `lmer()` model to get all of the plots at once, but we can construct a residual vs. fitted value plot and a Q-Q plot ourselves:

```{r}
diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  plot()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqnorm()

diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(log10(Relative_Percent) ~ sex + (1 | location_name), data = .) %>%
  resid() %>%
  qqline()
```

Notice that the plots look very similar to what we saw with the fixed effects model.

We see that the paired $t$-test, the fixed effects model adjusting for country, and the mixed effects model with a random intercept for country all give the same results in this case.  So which test should we use?  The decision of which test to perform depends on your question of interest. In this case we were particularly interested in the influence of `sex`, so setting `location_name` to a random effect provides the same level of detail about sex without as much information about `location_name`, so that might be ideal. As we can see, the results, in this case, are the same.  The benefit of using regression over a simple paired $t$-test would be the ability to add other covariates to our model if we wanted to adjust for other country characteristics.

Overall, though, we can conclude from these tests that we have enough evidence to reject the null hypothesis that there is no difference between the mean consumption of males and females ( or that `sex` has no association or influence on red meat consumption.) **Therefore, it appears that males consume significantly more red meat than females globally.** 


### **ANalysis Of VAriance (ANOVA) test**
***

We are also interested in the influence of age group on dietary consumption, but because there are 15 age groups we can't assess the influence of age group on consumption using the paired $t$-test, as this test can only compare 2 groups. 

If we wanted to test the hypothesis that there are any age group differences, that at least one of the groups is different from the others; we could use an [ANOVA test](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}. This test allows us to compare means of 3 or more groups by evaluating the variance of the data within the groups and among the groups. 

Our null hypothesis is that all age groups have equal means:
$$ H_0: \mu_{1} = \mu_{2} =\mu_{3}=\mu_{4} = ... \mu_{15} $$

The alternative hypothesis is that at least one age group mean is not equal to the others.

**Importantly**, if we reject the null, we *do not know which group means are different from one another*. Subsequent testing is required if we want to know this information. In this case we call this type of non-specific hypothesis an "omnibus" hypothesis.

You could actually perform an ANOVA to compare two means, but in this case you would get an $F$-statistic instead of a $t$-statistic which would be equivalent to $t^2$. However it is not conventional to use ANOVA for only 2 means. The $F$-statistic is derived form the $F$-test is used for a few different type of tests. In the ANOVA the F-test is calculated as:

$$F = \frac{ 
variability\ between\  the \ groups}{ 
variablity\ within\ the \ groups}$$

The larger the ratio, the larger the variability between the groups, thus the more likely that the data for each group comes from a different distribution with different means, suggesting that the groups are different.

It turns out that the ANOVA test is also equivalent to linear regression.  We will demonstrate this by evaluating how the consumption of red meat varies by age group using an ANOVA and a linear regression.

Thinking about how we want to know if red meat consumption differs between age groups from the linear regression perspective, we could also describe our null hypothesis as:

There is no influence of age group identity on consumption or there is no relationship between age group identity and consumption.

And we could describe our alternative hypothesis as:

Age group identity does influence consumption or explain some of the variation in consumption.

#### ANOVA assumptions

The ANOVA assumptions are quite similar to the $t$-test assumptions:

1) Normality of the data for all tested groups (less of an issue if the number of observations is relatively large total n > 30)
2) Equal variance between the groups - aka [Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"} (make sure you do the correct test if the data is not normal)
3) Independent observations

let's evaluate our assumptions for the groups we are comparing, starting with normality using Q-Q plots.  First let's make `age_group_name` a factor:

```{r}
all_age_diet_and_guidelines %<>%
  mutate_at(vars(age_group_name), factor)
```

Now let's look at Q-Q plots of both relative percent consumption and the log-transformed version of this variable:
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = Relative_Percent)) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(sample = log10(Relative_Percent))) +
  facet_wrap(~age_group_name) +
  geom_qq() +
  geom_qq_line()
```

After transformation, these Q-Q plots look pretty good.

Now let's look at the assumption of constant variance. There are different ways to assess this assumption across more than two groups.  [Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"} works well if the data appears to be quite normally distributed, while the [Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"} test is nonparametric and does not assume normality of the data.

We will use another popular test, [Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}, which is more robust to violations of normality than the Bartlett's test, but not as robust as the Fligner-Killeen test.  The null hypothesis of this test, as for the other two tests, is that the variances are equal across all of the groups.  The alternative hypothesis is that at least one pair of groups has different variances.  In symbols we would write this as

$$ H_0: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 ... = \sigma_n^2 $$

and

$$H_a:\sigma_i^2 \neq \sigma_j^2   $$
for at least  one  pair ($i$,$j$).


We will use the `leveneTest()` function of the `car` package to performs Levene's test.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  car::leveneTest(log10(Relative_Percent) ~ age_group_name, data = .)
```

Our data does not appear to violate the homogeneity of variances assumption as our $p$-value was greater than 0.05 and so we would fail to reject the null hypothesis of equal variances.

We already know that our independence assumption is not met, since the data for the different age groups comes from the same countries.  We will account for this in later models, but first let's compare the results between ANOVA and linear regression assuming the independence assumption is met.

#### ANOVA and linear regression

We can use the `aov()` function of the `stats` package to perform an ANOVA test. We will be performing what is called a [one-way ANOVA](http://onlinestatbook.com/2/analysis_of_variance/one-way.html){target="_blank"} because we only have one independent variable (age group). We will also perform a linear regression for comparison.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name, data = .) %>%
  summary()
```
####


We can see that the $F$-statistic ($F$ value in the `aov()` output and at the bottom of the `lm()` output) is the same for both models and the $p$-value for the $F$-statistic is the same!

We also see that the degrees of freedom for the $F$-statistic is 14. This makes sense because we have 15 different age groups and degrees of freedom for the $F$-statistic are calculated as $df = n - 1$. So in our case: $df = 15 -1$.


The difference here is that with the `lm()` model we also get information about how the individual age groups are associated with the log relative percent consumption of red meat. Notice that if we look at all the age groups in the data

```{r}
all_age_diet_and_guidelines %>%
  distinct(age_group_name)
```

we see that our `lm()` results are missing one of the age groups, the `25 to 29` age group. That is because this is the **reference group** and the coefficients indicate the slope or difference in log relative percent consumption rates for each listed age group *compared* to this reference group. 

#### ANOVA and linear regression with fixed effects

Now let's account for the paired `location_name` structure within our data, since the above models violate the independence assumptions for ANOVA and linear regression. We can do this by adding another fixed effect to both the ANOVA model and the linear regression model.  For ANOVA, this means we are now doing a two-way ANOVA, since we have two independent variables (age group and country).  For linear regression, we are now adding a fixed effect for country to our model.  

#### {.scrollable }

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()
```
#### 

It's hard to see that these results match, since the linear regression output doesn't print the $F$-statistic for the age groups together or the countries together; it only gives results for individual $t$-tests of each regression coefficient.  We can get these grouped $F$-statistics using the `anova()` function of the `stats` package. This function does not actually directly perform ANOVA like the `aov()` function, but instead prints a variance table using a `lm()` object.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  anova()
```
#### 

We can see that indeed the $F$-values and $p$-values from linear regression match those from ANOVA. In this case, this analysis suggests that there is a significant relationship between age group and consumption, even when controlling for country.  It also suggests that there is a significant relationship between country and consumption, even when controlling for age group.  However, only the first relationship is our relationship of interest; the second is only included in the model to account for the dependent nature of the data.

Remember, the ANOVA results indicate that the means are different across these groups, but it **does not** inform us about which groups are different. However, the original `lm()` output using the `summary()~ command gives more information about specific group differences. Remember, though, that these are **relative to the reference** level for the age group and location and that these values are calculated for the effect on consumption while controlling for the other variable in the model.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lm(log10(Relative_Percent) ~ age_group_name + location_name, data = .) %>%
  summary()
```
####

#### ANOVA and linear regression with mixed effects

We could instead perform a similar analysis as we did for the two group analysis where we controlled for the paired data structure using a random effect based on country In particular, we could include a random intercept for country.  We could do this within the `aov()` function using `Error()` and within the `lmer()` function with `1 | variable_name`.

#### {.scrollable }
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
```

Notice now the results only show for the age group variable, since this is the only fixed effect in the model.  However, dependence in the data due to country is still accounted for through the random effect.

If we use `anova()` instead of `summary()` for our `lmer()` model, we can see they give the same results.
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  aov(
    log10(Relative_Percent) ~ age_group_name + Error(location_name),
    data = .
  ) %>%
  summary()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()
```


### **Modeling all groups of interest**
***

Now we can extend out model to include include `sex`, `age_group_name` and `location_name` all in the same linear model and get information about how each of these factors influences dietary consumption, while accounting for the other factors. Since we are primarily interested in the effects of sex and age, but want to account for the dependence in the data due to repeated measurements by country, we will include `sex` and `age_group_name` as fixed effects and incorporate a random intercept for `location_name`.  

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  anova()

all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  lmer(
    log10(Relative_Percent) ~ sex + age_group_name + (1 | location_name),
    data = .
  ) %>%
  summary()
```

Looking at the `anova()` output, we can see that sex and age group both have significant associations with the consumption of red meat, when controlling for the other variable.  Additionally, by looking at the individual coefficient estimates in the `summary()` output, we see that males tend to have higher red meat consumption compared to females (positive coefficient for `sexMale`) and that consumption seems to decrease with increasing age (negative coefficients for all the age group categories that appear to become larger in magnitude with increasing age).


## **Data Visualization**
***

If you have been following along but stopped you could load the wrangled data like so:

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you are starting the case study at this section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-diet", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-diet/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-diet/master/data/imported/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r, eval=FALSE}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

```{r, echo=FALSE}
load(here::here("www", "data", "wrangled", "wrangled_data.rda"))
```


***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("www", "img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>
***

Now that we have statistically analyzed the consumption of red meat based on the location, sex, and age group of different populations around the world. Let's make some visualizations to help with our interpretations.

### **Red Meat**
***

Let's try to make a plot that shows the relationship of age group, sex, and location on consumption of red meat.

First we will filter our data for only the data associated with red meat, and then we will create a box plot graph with age group as the x axis, but include box plots for each sex for each age group. We can include an additional subplot to just look at the relationship of sex and consumption. Recall that the `ggplot2` package is very useful for making figures and uses a layering structure to make plots using the `+` between layers.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  ggplot(aes(
    x = age_group_name,
    y = Relative_Percent,
    col = sex
  )) +
  geom_boxplot() +
  # this adds the individual points for the sex comparison
  geom_jitter(aes(
    x = sex,
    y = Relative_Percent
  ),
  # width specifies how wide the points will be plotted
  width = .2,
  size = 2,
  shape = 21
  ) +
  # this angles the x axis text and removes the legend
  theme(
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    ),
    legend.position = "none"
  )
```

OK, this is pretty good, but we can do better.

Let's try specifically looking at the countries that over-consumed red meat. We can look at these countries by filtering our data where `Relative_Percent` was greater than 100%. Now we will overlap the jitter points and the box plot using the `position_jitterdoge()` as the position in `geom_pont()`. In order to not obscure our box plots, we can use the argument `alpha` to make our jitter points more transparent.

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  # this position option will separate the points by sex
  # this is determined by the fill argument in the ggplot() function
  # could also use col argument but it would change the style a bit
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))
```

What are the countries that have such high consumption rates?

```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 1000)
```
 
 Looks like the males in Laos and Timor_Leste have the highest consumption.

 
Now let's plot just the populations that eat less than the optimal amount by filtering for `Relative_Percent` < 100%.
 
```{r}
all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_point(
    position = position_jitterdodge(),
    aes(col = sex),
    alpha = 3 / 10
  ) +
  geom_boxplot(outlier.shape = NA) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))
```

Nice! It would be nice to be able to know what countries each data point corresponds to. One way to do this is using a package called `ggiraph`. This package is really helpful for creating interactive graphs. 

We will use the `geom_point_interactive()` function to allow us to hover over points to display the country name. We indicate what label we want with the `tooltip` argument.
This function is similar to the normal `geom_point()` function. Thus, we will include the same arguments as before. However, we will also split the male and female data using `facet_wrap()` to make things a bit less overwhelming.

Notice that we are creating a plot object before we use the `geom_point_interactive()`.

We are also rendering the plot with the `girafe()` function of the `ggiraph` package.


```{r, eval = TRUE}
g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent < 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  ))

g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

girafe(code = print(g))
```


Cool! 

From this plot we can see the countries with populations that do well by not over-consuming red meat, (as over-consumption is associated with health risk). We see that different countries greatly vary, we can see that overall younger populations appear to consume more red meat, and men appear to consume red meat.

Let's do the same thing for the over-consuming countries. We can also take this one step further to show all the points for the same country when we hover over one data point by using the `data_id` argument of the `geom_point_interactive()` function. 


We can also add links to Wikipedia pages for these countries using the `onclick` argument. See this [link](https://davidgohel.github.io/ggiraph/articles/offcran/using_ggiraph.html){target="_blank"} for more information on using the `ggirpah` package. We will use the base `sprintf()` function to format our urls for the Wikipedia links into C style to open a new tab for the link when a user clicks on the figure.

```{r, eval = TRUE}
all_age_diet_and_guidelines %<>%
  mutate(link = sprintf(
    "window.open(\"%s%s\")",
    "http://en.wikipedia.org/wiki/",
    as.character(pull(
      all_age_diet_and_guidelines,
      location_name
    ))
  ))

g <- all_age_diet_and_guidelines %>%
  filter(food == "red meat") %>%
  filter(Relative_Percent > 100) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = sex
  )) +
  geom_boxplot(outlier.shape = NA) +
  facet_wrap(~sex) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(
      angle = 70,
      hjust = 1
    )
  ) +
  expand_limits(y = 99)


g <- g + geom_point_interactive(aes(
  color = sex,
  tooltip = location_name,
  data_id = location_name,
  onclick = link
),
size = 2,
position = position_jitterdodge(),
alpha = 3 / 10
)

g <- g + geom_point_interactive(
  data =
    all_age_diet_and_guidelines %>%
      filter(food == "red meat") %>%
      filter(Relative_Percent > 100) %>%
      filter(location_name == "United States"),
  aes(
    fill = location_name,
    tooltip = location_name,
    data_id = location_name,
    onclick = link
  ),
  size = 4,
  position = position_jitterdodge(),
  alpha = 5 / 10,
  color = "black"
)

girafe(code = print(g))
```

### **United Sates Data**
***

Now let's take a look at the US data specifically.

```{r}
diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  count(opt_achieved)
```

OK, it looks like optimal consumption levels were achieved for only 10% of the dietary factors.

Let's look at males and females separately:

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Can you come up with the code for how you would do this?

####

```{r, echo=FALSE}
save(diet_and_guidelines, file = here::here("www", "exercise", "dv_code1.rda"))
```

```{r DV_Code1-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "exercise", "dv_code1.rda"))
```

```{r DV_Code1, exercise=TRUE}
# For males
diet_and_guidelines %>%
  
# For females
diet_and_guidelines %>%
```

```{r DV_Code1-hint-1}
# For males
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```

```{r DV_Code1-solution}
# For males
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)

# For females
diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```


***
<details> <summary> Click here to reveal the code. </summary>

```{r, eval = FALSE}
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)

diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```
</details>
***

For males:
```{r, echo = FALSE}
diet_and_guidelines %>%
  filter(
    sex == "Male",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```

For females:
```{r}
diet_and_guidelines %>%
  filter(
    sex == "Female",
    location_name == "United States"
  ) %>%
  count(opt_achieved, food) %>%
  arrange(food)
```


So females are a bit better about not over-consuming sodium in the United States relative to males.  Both groups are doing well with avoiding trans fatty acids. Let's look more closely at which dietary components have high and low consumption in the United States:

```{r}
all_age_diet_and_guidelines %>%
  filter(location_name == "United States") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = sex
  )) +
  theme(axis.text.x = element_text(
    angle = 70,
    hjust = 1
  )) +
  facet_wrap(~direction, scales = "free") +
  geom_boxplot() +
  geom_point(
    position = position_jitterdodge(),
    alpha = 3 / 10
  )
```

OK, so we can  indeed see that overall consumption of sodium and trans fatty acids is pretty close to optimal. So that's great. However, Both males and females are over-consuming processed meat, red meat, and sugar-sweetened beverages. On the other hand both genders are not getting adequate intake of all the other dietary factors for optimal health. The population in the United states has especially poor intake of polyunsaturated fats. it also looks like in most cases females are getting less of the dietary factors that pose health risks when under-consumed, with the exception of fruits.


How about if we look at age groups. First let's look at the dietary components with that were over-consumed in the United States.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Can you come up with the code for this on your own?

####

```{r, echo=FALSE}
save(all_age_diet_and_guidelines, file = here::here("www", "exercise", "dv_code2.rda"))
```

```{r DV_Code2-setup}
library(tidyverse)
library(magrittr)

load(here::here("www", "exercise", "dv_code2.rda"))
```

```{r DV_Code2, exercise=TRUE}
all_age_diet_and_guidelines %>%
```

```{r DV_Code2-hint-1}
all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
```

```{r DV_Code2-solution}
all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  theme(legend.position = "bottom")
```

***
<details> <summary> Click here to reveal the code. </summary>

 We will also move our legend to the bottom of the plot using the `theme()` function of the `ggplot2` package, like so:

```{r}
plot_age_groups <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  theme(legend.position = "bottom")
```
</details>

***

```{r}
plot_age_groups
```

OK! It looks like age really influences the consumption of these dietary factors. With the exception of trans fatty acids, the consumption of all of these dietary factors seems to decrease with age.  Let's also use the`scale_fill_viridis()` function of the `viridis` package to change the colors of our plot. This package uses palettes of colors that are discernible for individuals who are colorblind. 
```{r}
all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "high"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 8, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )
```

Nice!


Now let's look at the dietary factors that when consumed at low levels increase health risk:


#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

Again, see if you come up with the code for this on your own?

####

***
<details> <summary> Click here to reveal the code. </summary>

```{r}
low_foods_plot <- all_age_diet_and_guidelines %>%
  filter(
    location_name == "United States",
    direction == "low"
  ) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    fill = age_group_name
  )) +
  facet_wrap(~food, scales = "free") +
  geom_boxplot() +
  # change the colors from rainbow to purple/green/yellow
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  theme(
    strip.text = element_text(size = 7, face = "bold"),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "bottom"
  )
```

</details> 
***


```{r}
low_foods_plot
```


Interesting, we see that for the foods that are over consumed (processed meat, red meat, sodium, and sugar-sweetened beverages), consumption appears to decrease with age. For the foods that are under consumed, many appear to rise and fall with age.

### **Overall trends**
***

Finally, we would like to get a general sense of how consumption of these dietary factors differs around the world and we would like to know how the US compares to other countries. 

Before we do this let's change the labels of the foods by adding new line breaks (`"\n"`) so that they will fit more easily on our graphs.

#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>

See if you come up with the code for this on your own?

####

***
<details> <summary> Click here to reveal the code for the `diet_and_guidelines` and the `all_age_diet_and_guidelines` data. </summary>

```{r}
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "sugar-sweetened beverages",
  "sugar-sweetened\nbeverages"
)
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "seafood omega-3 fatty acids",
  "seafood omega-3\nfatty acids"
)
diet_and_guidelines[["food"]] <- str_replace_all(
  diet_and_guidelines[["food"]],
  "polyunsaturated fatty acids",
  "polyunsaturated\nfatty acids"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "sugar-sweetened beverages",
  "sugar-sweetened\nbeverages"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "seafood omega-3 fatty acids",
  "seafood omega-3\nfatty acids"
)

all_age_diet_and_guidelines[["food"]] <- str_replace_all(
  all_age_diet_and_guidelines[["food"]],
  "polyunsaturated fatty acids",
  "polyunsaturated\nfatty acids"
)
```

</details>
***

```{r}
diet_and_guidelines %>%
  select(food) %>%
  distinct()
```

```{r}
all_age_diet_and_guidelines %>%
  select(food) %>%
  distinct()
```

Nice!

#### Under-consumed foods

To choose the colors for our plot we can use the `show_col()` function of the `scales` package (which is installed with `ggplot2` package) to preview color options from the viridis palette of the `viridis` package.

```{r}
scales::show_col(viridis_pal()(3))
```

We will use the `position_jitterdodge()` in our `position` argument of `geom_point()` to indicate how the points should be grouped by sex.

```{r, fig.width = 17, fig.height=14}
# first filter the values
Under <- diet_and_guidelines %>%
  filter(direction == "low") %>%
  filter(Relative_Percent < 110) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food,
    color = sex
  )) +
  # adds grey points for each country
  geom_point(
    position = position_jitterdodge(),
    color = "dark grey",
    alpha = 7 / 10,
    size = 7,
    # this specifies how to separate the points
    aes(fill = sex)
  ) +
  # add boxplots
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    lwd = 2.5,
    aes(fill = sex)
  ) +
  # this allows us to use specific colors
  scale_fill_manual(
    values =
      c(
        "#481567FF",
        "#1F968BFF"
      )
  ) +
  # adds line for optimal amount
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 3
  ) +
  # manually changes y axis breaks
  scale_y_continuous(
    breaks = c(0, 25, 50, 75, 100),
    labels = c(
      0, 25, 50, 75,
      "100% \n optimal amount"
    )
  )
# creates a larger black point for the US data
Under <- Under +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "low",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "black",
    size = 11,
    aes(
      fill = sex,
      shape = location_name
    )
  )

# creates a smaller yellow point on top for the US data
Under <- Under +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "low",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "#FFDF00",
    size = 7,
    aes(
      fill = sex,
      shape = location_name
    )
  ) +
  # make the plot look nice
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 40,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_text(
      angle = 70,
      hjust = 1,
      size = 32
    ),
    # this is useful for removing the legend
    legend.position = "none",
    axis.text.y = element_text(size = 35),
    axis.title.y = element_text(size = 35),
    axis.title.x = element_text(size = 25),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = "\n Global consumption of foods associated with\nhealth risk when under-consumed",
    x = "",
    y = "Percent consumption relative \n to guidelines"
  )

Under
```

#### Over-consumed foods

We will use the `facet_zoom()` function of the `ggforce` package to create a plot with a zoomed in portion.

```{r, fig.width=20, fig.height=16}
Over <- diet_and_guidelines %>%
  filter(direction == "high") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = food
  )) +
  # this adds points for each country
  geom_point(
    position = position_jitterdodge(),
    color = "dark grey",
    alpha = 7 / 10,
    # this specifies how to separate the points for male and female
    aes(fill = sex),
    size = 7
  ) +
  # this adds boxplots
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    lwd = 2.5,
    # key_glyph = draw_key_rect),
    aes(fill = sex)
  ) +
  # this manually changes boxplot colors
  scale_fill_manual(
    values =
      c(
        "#481567FF",
        "#1F968BFF"
      )
  ) +
  # this adds optimal red line
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 3
  ) +
  # this changes y axis breaks
  scale_y_continuous(
    breaks = c(100, 500, 1000, 2000),
    labels = c("100% thresh.", 500, 1000, 2000)
  ) +
  # this zooms in to part of the plot
  ggforce::facet_zoom(ylim = c(0, 1300)) +
  # this changes the legend direction
  guides(guide_legend(
    direction = "horizontal",
    label.vjust = .5
  ))

# creates a larger black point for the US data
Over <- Over +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "high",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "black",
    size = 11,
    aes(
      fill = sex,
      shape = location_name
    )
  )

# creates a smaller yellow point on top for the US data
Over <- Over +
  geom_point(
    data = diet_and_guidelines %>%
      filter(
        direction == "high",
        location_name == "United States"
      ),
    position = position_jitterdodge(
      jitter.width = 0.01,
      dodge.width = 0.7
    ),
    color = "#FFDF00",
    size = 7,
    aes(
      fill = sex,
      shape = location_name
    )
  ) +
  scale_color_discrete(breaks = "United States") +
  # this makes the plot look nice
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 40,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_text(
      angle = 60,
      hjust = 1,
      size = 32
    ),
    legend.title = element_blank(),
    legend.position = "bottom",
    # this is for changing the zoom triangle color
    strip.background = element_rect(
      fill = "grey86",
      colour = "grey86"
    ),
    axis.text.y = element_text(size = 35),
    axis.title.y = element_text(size = 35),
    axis.title.x = element_text(size = 25),
    legend.text = element_text(
      size = 30,
      vjust = .01
    ),
    # this is for changing the legend symbol size
    legend.key.height = unit(2, "cm"),
    legend.key.width = unit(2, "cm"),
    # this is for changing the figure outline
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  expand_limits(y = -10) +
  labs(
    title = " Global consumption of foods associated with\n health risk when over-consumed",
    x = "",
    y = "Percent consumption relative \n to guidelines"
  )

Over
```

Let's put the plots together:

```{r, fig.height = 26, fig.width = 18}
cowplot::plot_grid(Over, Under, ncol = 1)
```
Nice!


#### Over-consumed by Age Global trends

Now we will look at Global trends with age. 

To make this plot we will use the `facet_wrap_paginate()` function of the `ggforce` package which allows you to specify the number of columns or rows for the facets.

```{r, fig.width = 16, fig.height = 18}

Over_age <- all_age_diet_and_guidelines %>%
  filter(direction == "high") %>%
  filter(Relative_Percent < 4000) %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  geom_boxplot(
    outlier.shape = NA,
    color = "black",
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "bottom",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 30),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    legend.text = element_text(size = 25),
    legend.key.height = unit(1.2, "cm"),
    legend.key.width = unit(1.2, "cm"),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = " Global consumption across age\n  of foods associated\n with health risk when over-consumed",
    x = "Age",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  ggforce::facet_wrap_paginate(~food,
    ncol = 2,
    scales = "free"
  ) +
  scale_fill_viridis(discrete = TRUE)

Over_age
```


Hmmm these are a bit difficult to see for some of the dietary factors like red meat because the outliers are making the range much larger than the box plot themselves. We can calculate the values for the box plots ourselves to deal with this:

```{r}
calc_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  return(stats)
}
```

Then we can use it in our plot code:

This is thanks to this:
https://stackoverflow.com/questions/25124895/no-outliers-in-ggplot-boxplot-with-facet-wrap

```{r, fig.width = 16, fig.height = 18}
# map_df(all_age_diet_and_guidelines %>% select(Relative_Percent), calc_stat)

Over_age <- all_age_diet_and_guidelines %>%
  filter(direction == "high") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  # here we will replace geom_boxplot() yet still create boxplots
  stat_summary(
    fun.data = calc_stat,
    geom = "boxplot",
    outlier.shape = NA,
    color = "black",
    lwd = 1.1,
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    axis.text.x = element_text(
      angle = 60,
      hjust = 1
    ),
    legend.title = element_blank()
  ) +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "bottom",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 30),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    legend.text = element_text(size = 25),
    legend.key.height = unit(1.2, "cm"),
    legend.key.width = unit(1.2, "cm"),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    )
  ) +
  labs(
    title = " Global consumption across age\n  of foods associated\n with health risk when over-consumed",
    x = "Age",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  guides(fill = guide_legend(
    nrow = 5,
    byrow = TRUE
  )) +
  facet_wrap_paginate(~food,
    ncol = 2,
    scales = "free"
  ) +
  scale_fill_viridis(discrete = TRUE)

Over_age
```

#### Under-consumed by Age Global Trends

```{r, fig.width=16, fig.height=18}
Under_age <- all_age_diet_and_guidelines %>%
  filter(direction == "low") %>%
  ggplot(aes(
    y = Relative_Percent,
    x = age_group_name,
    fill = age_group_name
  )) +
  # here we will replace geom_boxplot()
  stat_summary(
    fun.data = calc_stat,
    geom = "boxplot",
    outlier.shape = NA,
    color = "black",
    lwd = 1.1,
    aes(fill = age_group_name)
  ) +
  geom_hline(
    yintercept = 100,
    linetype = "dashed",
    color = "red",
    size = 2
  ) +
  theme_linedraw() +
  theme(
    axis.text.x = element_text(
      angle = 60,
      hjust = 1
    ),
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  facet_wrap_paginate(~food, nrow = 5, scales = "free") +
  theme_linedraw() +
  theme(
    plot.title = element_text(
      size = 30,
      hjust = 0.5,
      face = "bold"
    ),
    axis.text.x = element_blank(),
    legend.title = element_blank(),
    legend.position = "none",
    axis.text.y = element_text(size = 28),
    axis.title.y = element_text(size = 30),
    axis.title.x = element_text(size = 25),
    strip.text.x = element_text(
      size = 25,
      face = "bold"
    ),
    panel.background = element_rect(
      colour = "black",
      size = 1.5
    ),
    axis.ticks.x = element_blank()
  ) +
  labs(
    title = "\n Global consumption across age\n  of foods associated\n with health risk when under-consumed",
    x = "",
    y = "\nPercent consumption relative \n to guidelines"
  ) +
  scale_fill_viridis(discrete = TRUE)

Under_age
```


Now let's put all the plots together:

```{r, fig.height=33, fig.width = 34}
cowplot::plot_grid(Over,
  Over_age,
  Under,
  Under_age,
  ncol = 2,
  rel_widths = c(2.5, 1)
)
```

```{r, eval=FALSE}
png(
  filename = here::here("img", "mainplot.png"),
  res = 300, width = 34, height = 30, units = "in"
)
cowplot::plot_grid(Over,
  Over_age,
  Under,
  Under_age,
  ncol = 2,
  rel_widths = c(2.5, 1)
) + theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
dev.off()
```


## **Summary**
***

### **Synopsis**
***

We have evaluated average consumption estimates of 15 dietary factors with probably non-communicable disease (NCD) risk from 195 different countries around the world. To do so we imported data from a PDF using the `pdftools` package, as well as data from two CSV files using `readr`. We used `tidyverse` packages such as `dplyr`, `stringr`, and `tidy` to clean and join the data from the PDF with the CSV files. 

We learned that regression is a powerful and flexible statistical tool that simplifies or estimates the relationships between variables using a mathematical model. We learned about the utility of regression techniques to compare groups, look for associations between variables, and predict outcomes based on multiple predictor or explanatory variables. We then compared this to other popular tests like the $t$-test and the ANOVA. We learned that these tests are actually equivalent to specialized types of regressions.

Our statistical analysis focused on evaluating differences in the consumption of red meat around the world between females and males and across different age groups. First we looked at the assumptions of [$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"} and regressions, and determined that the rate of red meat consumption relative to the optimal guideline-suggested amount was right skewed. We learned that we could transform the data by taking the log of these values to achieve more normally distributed data. To compare males and females we used a $t$-test and learned that a $t$-test is a specialized form of a linear regression. To compare the 15 different age groups we used an ANOVA and learned that ANOVA is also a specialized form of linear regression. We examined how we obtained the same results using either statistical test. This was also the case if we looked at the effect of gender and controlled for the paired country structure in the data by either including `location_name` in the model as another term or by using a mixed effects model to control for this structure as a random effect but not specifically test for the influence of `location_name` on red meat consumption estimates. We learned that fixed effects are those that we wish to evaluate, while random effects are those that may influence the relationships of our variables of interest but that we do not wish to actively evaluate. Using these tests and models, we determined that males consume more red meat than females on average around the world. 

Our ANOVA analysis of age determined that indeed there is at least one age group that consumed a significantly different amount of red meat compared to the other age groups, and this was still the case when we controlled for `location_name`. However, we learned that the ANOVA does not provide information about which age groups are different. We learned how the regression could provide some quantification of the effect of specific age groups relative to the reference age group. Furthermore, our data visualizations allowed us to determine that in general red meat consumption appears to be higher in the younger age groups relative to the older age groups. 

Finally, we also looked at differences in red meat consumption between the different countries and saw in our ANOVA analysis and our regression analysis that there were significant differences. We were able to use a regression that included `sex`, `age_group_name`, and `location_name` to evaluate the influence of each of the three demographic factors on consumption while controlling or accounting for the other two. Our results demonstrated that all three influenced or were associated with red meat consumption.

In preforming our statistical analyses we learned about the assumptions of the $t$-test, regression, and the ANOVA. We also learned about important methods to tests these assumptions.

Using the `ggplot2` package we were able to visualize trends in the data and to compare consumption of these dietary factors in the US with that of the other countries.

We see that the populations in many countries are over-consuming foods that are associated with health risk when over-consumed. In particular processed meat and sugar-sweetened beverages appear to be the most over consumed. Importantly both of these appear to be consumed at higher quantities by males and younger adults. People in the US  appear to consume fewer sugar-sweetened beverages than other countries, however, people are still over-consuming. Processed meat however appears to be especially bad in the US. In terms of food that need to be consumed in adequate amounts to overcome health risk, nearly all countries for all factors are not reaching guideline levels. However, there are some countries consuming more than adequate amounts of legumes, vegetables, fruits and fiber. People in the US appear to eat more milk products and consume more omega-3 fatty acids and calcium rich foods than other countries. All countries including the US consume very low levels of polyunsaturated fatty acids. These [polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat) are abundant in seeds, nuts and avocados, as well as fish. Likely the low level of consumption of nuts and seeds contributes to these low polyunsaturated fatty acid estimates. The supplementary table included in the article suggests that poor consumption of polyunsaturated fatty acids is associated with ischemic heart disease. The article takes this data further to evaluate the association of consumption levels of these foods with mortality.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in the [article](https://www.thelancet.com/action/showPdf?pii=S0140-6736%2819%2930041-8){target="_blank"} however that many additional considerations would be involved to perform a more thorough analysis to adequately understand the data enough to recommend policy changes.


## **Suggested Homework**
***

Students can evaluate consumption estimates of another dietary factor besides red meat.


## **Additional Information**
*** 

### **Helpful Links**
***

<u>Terms and concepts covered:</u>  

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}  
[Interpunct](https://www.shorttutorials.com/mac-os-special-characters-shortcuts/middle-dot.html){target="_blank"}  
[Regular expressions](https://www.r-bloggers.com/regular-expressions-every-r-programmer-should-know/){target="_blank"}  
[Inference](https://www.britannica.com/science/inference-statistics){target="_blank"}  
[Regression](https://lindeloev.github.io/tests-as-linear/){target="_blank"}  
[Different types of regression](https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/){target="_blank"}  
[Ordinary least squares method](http://setosa.io/ev/ordinary-least-squares-regression/){target="_blank"}  
[Residual](https://www.statisticshowto.datasciencecentral.com/residual/){target="_blank"}  
[$t$-tests](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20$t$-test){target="_blank"}  
[ANOVA](http://onlinestatbook.com/2/analysis_of_variance/intro.html){target="_blank"}  
[$t$-tests and ANOVA are equivalent to regression](https://scientificallysound.org/2017/06/08/$t$-test-as-linear-models-r/){target="_blank"} also see [here](https://towardsdatascience.com/everything-is-just-a-regression-5a3bf22c459c){target="_blank"} and [here](https://lindeloev.github.io/tests-as-linear/){target="_blank"} about how many commonly known statistical tests are specialized forms of regression  
[Normal Distribution](https://www.physiology.org/doi/full/10.1152/advan.00064.2017){target="_blank"}  
[Q-Q plot](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}  
[Guide to residual diagnostic plots](https://data.library.virginia.edu/diagnostic-plots/) and [Examples](http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/){target="_blank"}  
[Residual vs fitted plot](https://online.stat.psu.edu/stat462/node/118/){target="_blank"}  
[Scale-location plot](https://boostedml.com/2019/03/linear-regression-plots-scale-location-plot.html){target="_blank"}  
[Homoscedasticity ](https://www.statisticssolutions.com/homoscedasticity/){target="_blank"}  
[Heteroscedasticity](https://statisticsbyjim.com/regression/heteroscedasticity-regression/){target="_blank"}  
[Interpreting `lm()` output](https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R){target="_blank"}  
[Coefficients](https://www.theanalysisfactor.com/interpreting-regression-coefficients/){target="_blank"}  
[Linear mixed effects regression](https://ourcodingclub.github.io/tutorials/mixed-models/){target="_blank"}  
[Satterthwaite formula](https://www.statisticshowto.datasciencecentral.com/satterthwaite-formula/){target="_blank"}  
[Mood's Two-Sample Scale Test](https://files.eric.ed.gov/fulltext/ED065559.pdf){target="_blank"}   
[Standard deviation](https://www.statsdirect.com/help/basic_descriptive_statistics/standard_deviation.htm){target="_blank"}  
[Homogeneity of Variances assumption](https://uc-r.github.io/assumptions_homogeneity){target="_blank"}   
[polyunsaturated fatty acids](https://en.wikipedia.org/wiki/Polyunsaturated_fat){target="_blank"} 


<u>Tests of Homogeneity of Variance for 3 or more groups:</u>

[Bartlett's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm){target="_blank"}  
[Fligner-Killeen](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_NonParam_VarIndep){target="_blank"}  
[Levene's test](https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm){target="_blank"}  
 

<u>Other helpful links:</u>

[Long and Wide Data Formats](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Distributions](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[Skewed Distributions](http://onlinestatbook.com/2/glossary/skew.html){target="_blank"} 
[Bimodal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"} 
[ggplot2](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}    
[Shapiro-Wilk Test](http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html){target="_blank"}   
[Paired Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5579465/){target="_blank"}  
[Welch's $t$-test](https://www.statisticshowto.datasciencecentral.com/welchs-test-for-unequal-variances/){target="_blank"}    
[Parametric and Nonparametric Methods](https://www.mayo.edu/research/documents/parametric-and-nonparametric-demystifying-the-terms/doc-20408960){target="_blank"}   
[Variance](https://stattrek.com/statistics/dictionary.aspx?definition=variance){target="_blank"}  
[Balanced Study Design](https://www.statisticshowto.datasciencecentral.com/balanced-and-unbalanced-designs/){target="_blank"}  
[Independent Observations](https://www.stat.cmu.edu/~cshalizi/36-220/lecture-5.pdf){target="_blank"}  
[Transformation](https://www.statisticshowto.datasciencecentral.com/transformation-statistics/){target="_blank"}  
[Permutation/Resampling Methods](https://jhu-advdatasci.github.io/2019/lectures/21-resampling-techniques.html){target="_blank"}   
[Central Limit Theorem](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/){target="_blank"} 
[Wilcoxon Signed Rank Test](http://www.biostathandbook.com/wilcoxonsignedrank.html)   
[Wilcoxon Rank Sum Test](http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html){target="_blank"}  
[Two-sample Kolmogorov-Smirnov Test](https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm){target="_blank"}  
[Type 1 Error](https://web.ma.utexas.edu/users/mks/statmistakes/errortypes.html){target="_blank"}  
[p-value](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8){target="_blank"}  
[Multiple Testing](https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf){target="_blank"}    
[Bonferroni Method of Multiple Testing Correction](http://mathworld.wolfram.com/BonferroniCorrection.html){target="_blank"}

<u>Packages used in this case study: </u>

 Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data  
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data    
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"}   | to read a PDF into R   
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text within the PDF of the data   
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` piping operator  
[purrr](https://purrr.tidyverse.org/){target="_blank"}      | to perform functions on all columns of a tibble   
[tibble](https://tibble.tidyverse.org/){target="_blank"}     | to create data objects that we can manipulate with  dplyr/stringr/tidyr/purrr  
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers  
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)  
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing   
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances   
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive   
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots  
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette    
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined   

### **Session Info**
***

```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.

### **Acknowledgments**
***

We would like to acknowledge [Jessica
Fanzo](https://bioethics.jhu.edu/people/profile/jessica-fanzo/) for
assisting in framing the major direction of the case study, as well as [Ashkan Afshin](https://globalhealth.washington.edu/faculty/ashkan-afshin) and [Erin Mullany](http://www.healthdata.org/about/erin-mullany) for giving us access to the data.

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.

We would also like to acknowledge the [Bloomberg American Health
Initiative](https://americanhealth.jhu.edu/) for funding this work.